Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

capOS Documentation

cap-os.dev documents the current capOS implementation: the implemented operating model, build and boot workflow, runnable demos, architecture, configuration surface, and security and verification boundaries.

capOS is a research operating system where kernel and userspace services are typed Cap’n Proto capabilities invoked through shared-memory rings. The manual focuses on behavior that exists or is directly reviewable in this repository; project plans, proposals, and research notes remain available as archives rather than driving the primary reading path.

The Basic Idea

capOS is an experiment in making an operating system easier to reason about. In familiar operating systems, a program’s power is spread across many mechanisms: system calls, file paths, sockets, process identity, permissions, environment variables, inherited handles, and service-specific protocols. That model is flexible, but it can be hard to answer simple questions: what can this program actually do, who gave it that power, and can that power be passed, revoked, recorded, or moved somewhere else without hidden side effects?

capOS tries a different tradeoff. A program can act only through explicit typed capabilities it already holds. The interface is the permission: instead of giving a broad handle plus a separate rights mask, capOS gives a narrower object with only the methods the caller should have. The same Cap’n Proto schema describes the kernel call, the service call, and the wire format used between processes.

If that approach works, it should make several things more natural: running small services, tools, and future AI agents with least authority, handing a resource from one program to another without accidentally duplicating it, auditing or replaying service traffic, and eventually moving services across persistence or network boundaries without inventing a second permission model. capOS is not a production OS or a Linux replacement; it is a prototype for testing whether those design choices hold together in real runnable code.

Start Here

For a printable current-system reference, use the PDF manual; planning archives and research notes remain on the website.

  • What capOS Is describes the implemented system model and the main authority boundaries.
  • Current Status lists what works today, what is partial, and what remains future work.
  • Build, Boot, and Test gives the commands used to build the ISO, boot QEMU, and run host-side validation.
  • Configuration explains operator overlays, host-user tag injection, the tools cache, and schema-aware data conversion.
  • Repository Map maps the main subsystems to source files.
  • Programming Languages describes current native Rust support and the status of Python, Go, Lua, C/C++, WASI, and POSIX adapters.
  • ABI Evolution Policy defines the compatibility rules for schema, ring, bootstrap, and runtime ABI changes.
  • First Chat Demo shows the smallest runnable resident-service chat proof and its current single-terminal limits.
  • Aurelian Frontier (proof slice) shows the current runnable multi-process slice of the Aurelian Frontier game and its QEMU proof.
  • Paperclips Terminal Demo shows a clean-room incremental terminal game running as an ordinary shell-launched process.

Site Map

  • System Architecture is the design reference for current behavior: boot, process, capability, runtime, memory, scheduling, IPC, threading, and park behavior.
  • Programming Languages summarizes implemented native Rust support and points language-specific future work back to owning proposals.
  • Security and Verification is the reviewer path: trust boundaries, validation workflow, trusted inputs, panic inventory, and DMA design.
  • Runnable Demos documents the proof paths that exercise the implemented service model.
  • Reference and Project Archives keeps planning, proposal, research, and topic-index material available below the manual sections without making it part of the manual PDF.

What capOS Is

A research kernel that boots on x86_64 QEMU. The rest of this page is about why it looks the way it does — the specific design bets behind the code — not a feature inventory. For the feature-by-feature matrix, see Current Status.

What Makes capOS Different

capOS is a research vehicle for a few specific design bets. Each is unusual on its own; the combination is the point.

  • Everything is a typed capability. System resources are accessed through Cap’n Proto interfaces defined in schema/capos.capnp. There is no ambient authority — no global path namespace, no open-by-name, no implicit inherit. A process can only invoke objects present in its local capability table. See Capability Model and the schema/repo map.
  • The interface IS the permission. Instead of a parallel READ/WRITE/EXEC rights bitmask (Zircon, seL4), attenuation is a narrower capability: a wrapper CapObject exposing fewer methods, or an Endpoint client facet that cannot RECV/RETURN. The kernel just dispatches; policy lives in interfaces. See Capability Model, IPC and Endpoints, and the prior-art notes on Zircon and seL4.
  • Identity metadata is not authority. In prose, a user is the human-facing actor, a principal is identity metadata, an account is planned durable local record state, and policy/resource profiles select bundles and quotas. Sessions receive capabilities; none of those labels become kernel subjects or bypass cap-table authority. See the local users backlog, User Identity and Policy, and Resource Accounting and Quotas.
  • io_uring-style shared-memory ring for every call. Every process owns a submission/completion queue page. Userspace writes SQEs with a normal memory store; the kernel processes them through cap_enter. New operations are SQE opcodes (CALL, RECV, RETURN, RELEASE, NOP), not new syscalls. The remaining syscall surface is cap_enter and exit; the accepted threading contract keeps current-thread exit as a ThreadControl capability operation. See Capability Ring, Userspace Runtime, and In-Process Threading.
  • Release is transport, not an application method. Dropping the last owned handle in capos-rt queues one local CAP_OP_RELEASE; acquiring or dropping a runtime ring client flushes the queue, and long-running code can call Runtime::flush_releases() explicitly. No close() method on every interface, no mutable table self-reference during dispatch. See Userspace Runtime and Capability Ring.
  • Capability transfer is first-class. Copy and move descriptors ride sideband on CALL/RETURN SQEs. Move reserves the sender slot until the receiver accepts and preflight checks pass, then commits or rolls back atomically — no lost, duplicated, or half-inserted authority. See Authority Accounting and IPC and Endpoints.
  • Cap’n Proto wire format end-to-end. The same encoding describes the boot manifest, runtime method calls, and future persistence/remote transparency. The debug tap records fixed, bounded SQE/CQE metadata today; authorized payload capture, replay, audit, and migration remain future transport work. See Manifest and Service Startup, Error Handling, and Storage and Naming.
  • Host-testable pure logic. Cap-table, frame-bitmap, ELF parser, frame ledger, lazy buffers, small ABI constants, and the ring model live in capos-lib, capos-abi, and capos-config, and run under cargo test-lib, Miri, Loom, Kani, and proptest without any kernel scaffolding. Kernel glue stays thin. See Verification Workflow and Repository Map.
  • Schema-first boot. system.cue is compiled to a Cap’n Proto SystemManifest embedded as the single Limine boot module. The kernel validates only the kernel-owned boot boundary and launches initConfig.init; mkmanifest and init validate the service graph under initConfig.services as structured data, not shell scripts or baked environment variables. See Boot Flow, Manifest and Service Startup, and Build, Boot, and Test.

Execution Model

Each process owns an address space, a local capability table, a mapped capability-ring page, and a read-only CapSet page that enumerates its bootstrap handles. The kernel enters Ring 3 with iretq and returns through cap_enter or the timer. Ordinary capability calls progress only via cap_enter; timer-side polling handles non-CALL ring work and call targets that are explicitly safe for interrupt dispatch. Details in Process Model, Capability Ring, In-Process Threading, and Scheduling.

Boot Flow

The kernel receives exactly one Limine module — a Cap’n Proto SystemManifest compiled from system.cue — validates the kernel-owned boot boundary, loads only initConfig.init.binary, builds that process’s bootstrap capability table and CapSet page from initConfig.init.caps, and starts the scheduler. The default manifest now boots the standalone init ELF, and init validates the service graph before spawning the foreground capos-shell, the remote-session CapSet gateway, and the resident demo services. The shell mints an anonymous UserSession when it starts and the user runs login or setup as ordinary shell commands to upgrade to an operator session. Focused shell-led manifests such as system-smoke.cue and system-shell.cue still boot capos-shell directly as initConfig.init until the run-target/init-policy cleanup migrates them. Full walkthrough in Boot Flow and Manifest and Service Startup.

Authority Boundaries

Authority is carried by cap-table hold edges with generation-tagged CapIds. Ring 0 ↔ Ring 3, capability table ↔ kernel object, endpoint IPC, copy/move transfer, manifest/boot-package, and process spawn are the boundaries reviewers care about; each one fails closed at hostile input. See Trust Boundaries for the boundary table and Authority Accounting for the transfer and quota invariants.

What capOS Is Not

A POSIX clone, a microkernel-shaped Linux replacement, or a production OS. It is a place to try the above choices and see which ones survive contact with real workloads. See Build, Boot, and Test to run it.

Current Status

This page describes current repository behavior, not the full long-term design.

Current Snapshot

capOS boots on x86_64 QEMU, starts a standalone init process from the default manifest, and runs the native shell plus resident demo services through typed capabilities. The current operator path starts as an anonymous shell session; login prompts for username> and hidden password>, validates the selected bootstrap account through SessionManager and CredentialStore, and upgrades to a broker-issued operator bundle. Default password-authenticated local operator sessions do not expire by wall-clock timestamp; they remain intended to end through logout, terminal/connection/process-tree close, or administrator revocation. Manifests can still set a non-default operator lifetime for focused expiry proofs. setup can create a volatile local operator credential and then follows the same login upgrade path.

The implemented baseline includes isolated processes, user-mode ELF loading, shared-memory capability rings, endpoint IPC, copy/move capability transfer, thread and park primitives, init-owned service spawning, local shell login, focused Telnet shell demo, resident chat/adventure services, and the Paperclips terminal demo. The current selected milestone is GCE Self-Hosted Web UI: the next visible path is serving the remote-session Web UI through the Phase C userspace network stack and proving private GCE reachability before any public endpoint. Installable System is the completed previous selected milestone for the bounded local/QEMU contract: data-region mount, config-overlay merge, generation/rollback machinery, integrated bootable disk, install, first-boot provision, update/rollback, and structural proposal/body wording reconcile have landed. Device Driver Foundation is also complete; its GCP-first provider rollup has live operator-access, selected NIC raw-frame, selected storage I/O, and gVNIC portability evidence. Durable multi-account credential storage, broader account policy, production SSH/WebShell ingress, public GCE ingress/TLS, AWS/Azure providers, broader storage variants, high-throughput NIC work, direct-remapping DMA, and persistence beyond the landed installable data-region and generation paths remain future work.

Recent Status Notes

Update 2026-06-11 19:21 UTC: the GCE Self-Hosted Web UI local readiness wave is reconciled here. Since the local Web UI L4 proof, the following landed as local QEMU/cloudboot or no-spend harness evidence only: legacy GCE virtio-net Web UI serving (make run-cloud-gce-legacy-virtio-webui-serving proves a host HTTP peer fetching the byte-verified UI bundle over the kernel-brokered legacy virtio 0.9 runtime that backs the typed Nic cap), a browser-facing hardening set proved on the L4 gate (single public-origin policy, IAP-aware SameSite cookie policy, JSON content-type guard, security response headers with a strict CSP, GFE-range-pinned X-Forwarded-Proto trust, the public /healthz health-check contract, and in-guest login peer-gate/failure-backoff hardening), and a no-spend provider-harness fixture set (the private-proof --preflight-only mode, private and public proof-evidence validators, the public ingress resource plan gate, the journal-driven teardown engine, and the provider-command allowlist gate), each fixture gate driving recording stub provider CLIs only, with no real provider invocation or mutation on any path. None of this proves private GCE reachability, public exposure, TLS custody, or production readiness; the live private and public proofs stay on hold.

The selected GCE Self-Hosted Web UI evidence ladder is:

Update 2026-06-08 10:21 UTC: local bounded ICMPv4 Echo Reply diagnostics have landed for the Phase C userspace network stack. make run-cloud-prod-icmp-echo-reply first reruns the served TcpListenAuthority local proof, then boots the ICMP manifest, acquires the QEMU SLIRP DHCP lease 10.0.2.15/24, proves same-subnet ARP and ICMP Echo Request / Echo Reply preservation for identifier 0x04d2, sequence 9, and a 23-byte payload, and rejects bad-checksum, invalid-code, truncated, address-family, oversized-payload, and oversized-frame controls. This answers bounded local ping for diagnostics only; it is not Web UI readiness and does not open public ICMP or change GCE firewall posture.

Update 2026-06-08 08:24 UTC: Phase C slice 7c-ii(b) has a local serve-from-userspace proof, and the legacy kernel socket-path grant is retired for non-qemu production manifests. make run-cloud-prod-userspace-network-stack-smoltcp boots the non-qemu cloudboot manifest, starts a userspace smoltcp network-stack service, grants an application client only Console plus a served TcpListenAuthority, and completes one hostfwd TCP request/response through served TcpListener/TcpSocket caps while preserving host_physical_user_visible=0. Non-qemu manifests that request kernel network_manager or tcp_listen_authority now fail closed instead of reaching kernel/src/virtio_stub.rs; remaining kernel socket grants are qemu-only fixtures. DHCP/IPv4 configuration, Web UI L4, private GCE reachability, public ingress/TLS, and kernel smoltcp/virtio-net cleanup remain separate tasks. The current GCE Self-Hosted Web UI evidence ladder lives in the 2026-06-11 19:21 UTC update above.

Update 2026-06-07 18:20 UTC (through commit 12b8334a, committed 2026-06-07 18:19 UTC): Installable System is closed for the bounded local/QEMU contract. The closeout reconciles the proposal, backlog, proposal index, roadmap, and status page to the landed data-region, overlay, generation, install, provision, and update/rollback proofs while preserving the RAM-only Namespace caveat and leaving secure boot/signing, production release authority, public ingress, broader provider support, full userspace smoltcp/L4 readiness, and full durable account policy as future work. The selected milestone is now GCE Self-Hosted Web UI.

Update 2026-06-07 08:23 UTC (commit ef8d98c2): Device Driver Foundation production-authority closeout is recorded by ddf-production-authority-closeout. The closeout ties together the landed provider-driver, interrupt, audit, and DMA-policy prerequisites and keeps public ingress, AWS/Azure support, direct-remapping hardware, device-autonomous MSI-X, high-throughput NIC, and userspace smoltcp/L4 readiness as future follow-up work.

Update 2026-06-07 05:26 UTC: the GCP-first usable-instance provider rollup is closed by cloud-usable-instance-provider-nic-storage. The rollup cites real GCE serial-console operator access (1779868872-2424), live legacy virtio-net raw-frame provider-nic-bound (1780412056-e1cb), live NVMe Persistent Disk brokered READ (1780806087-bf69), and separate live gVNIC raw-frame / typed-Nic portability evidence (1780794927-1aa9, 1780796615-decc). This is not a public L4 ingress, SSH/WebShell, AWS/Azure, high-throughput NIC, broader storage, direct-remapping DMA, or production cloud-image release claim.

Update 2026-05-23 16:51 UTC (commit c86374f8): make run-ddf-provider-consumer now has a stable userspace virtio-net provider closeout proof line. The line ties together selected queue 1 TX descriptor/avail/doorbell/used-ring/CQ ownership across the full QEMU TX queue depth, bounded queue 0 RX synthetic-token CQ identity, selected TX/RX MSI-X/LAPIC wait/ack/EOI, selected-route reset/reassignment, teardown, stale-handle blocking, and explicit no-silent-provider-fallback boundaries. This remains local bounded provider evidence over manager-owned bounce buffers: live hardware RX used-ring ownership, full virtio-net ownership, direct DMA/IOMMU, cloud NIC/storage readiness, and virtio block/storage drivers remain future work.

Update 2026-05-23 13:36 UTC (commit e248d42b): make run-ddf-provider-consumer now exercises selected userspace virtio-net TX CQ ownership across the full eight-entry TX queue depth used by the smoke. Eight manager-owned bounce buffers can be live before the first completion, the ninth allocation fails closed, wrong-order completion of descriptor 7 preserves descriptor 0, CQ identity is delivered and acknowledged in order for descriptors 0 through 7, release drains seven incomplete descriptors as teardown-only, and provider TX release retires seven delivered but unacknowledged CQ events. This remains bounded selected-queue evidence over bounce buffers: live hardware RX used-ring ownership, direct DMA/IOMMU, full userspace virtio-net ownership, and cloud NIC/storage readiness remain open.

Update 2026-05-12 16:40 UTC: make run-ddf-provider-consumer now exercises the selected userspace virtio-net TX path through bounded queue 1 descriptor/avail publication, exactly one selected notify doorbell, and a runtime-visible tx_interrupt completion event tied to the used-ring handoff. The selected submit path validates the live DMABuffer record, scrubs the bounce page, consumes the live no-write notify_mmio policy, publishes the stored descriptor/avail entry, and rings the notify doorbell only after those gates pass. DMABuffer.completeDescriptor then observes the real TX used-ring entry for the stored software descriptor generation, clears the manager in-flight record, and delivers a bounded selected-tx-used-ring-completeDescriptor event to a live tx_interrupt.wait for the same route. This is still selected-route proof coverage rather than full userspace virtio-net ownership: arbitrary doorbells, production NIC or storage migration, cloud readiness, hardware IRQ ownership, hardware acknowledgement/mask/unmask, direct DMA, IOMMU programming, broader CQ ownership, and grantable device ownership remain open.

Update 2026-05-11 14:39 UTC, commit f04a14f4: make run-ddf-provider-consumer now turns the selected userspace virtio-net TX doorbell gate into an explicit staged claimed-notify-offset admission proof. The selected queue 1 provider entry reports accepted notify-offset policy, blocked wrong-queue policy, blocked wrong-offset policy, and no_doorbell=true after descriptor authority validation and submit scrub. The same smoke submits queue 0 first and proves it remains neutral rather than selected/backend doorbell-capable. This remains bounded manager-owned bounce-buffer evidence only: no virtio-net notify BAR handle is granted, no notify register is written, no real virtio-net descriptor ring is mutated, and production userspace NIC or cloud readiness is not claimed.

Update 2026-05-11 12:01 UTC: make run-ddf-provider-consumer now extends the bounded provider-visible submit effect with a selected provider-owned queue entry. Accepted DMABuffer.submitDescriptor still validates descriptor authority and scrubs the manager-owned bounce page first, then writes queue magic, queue id, tail, descriptor id, submitted length, and flags before the submit marker. The focused smoke maps the buffer after completion and proves the queue entry and marker remain visible outside the completed byte range; it also rejects submits shorter than the full 24-byte provider-effect footprint with zero in-flight accounting and no provider mutation. This is bounded bounce-buffer evidence only: no hardware descriptor ring or CQ is published, no direct DMA or IOMMU/remapping is enabled, host physical/IOVA values stay hidden, no MMIO doorbell is written, and provider-driver IRQ consumption plus cloud NIC/storage readiness remain future work.

Update 2026-05-11 11:22 UTC: make run-ddf-provider-consumer now proves a bounded descriptor-ring-equivalent provider side effect after DMABuffer.submitDescriptor authority validation. The accepted submit path scrubs the manager-owned bounce page, writes a provider-visible shadow descriptor entry with magic, queue, descriptor id, submitted length, and flags, and then writes the existing submit marker before the in-flight record is committed. The same process maps the buffer after DMABuffer.completeDescriptor and proves the shadow descriptor and marker remain visible outside the completed byte range. A follow-up boundary check rejects submits shorter than the 24-byte provider-effect footprint as dmabuffer-provider-effect-too-short, preserving zero in-flight accounting and blocking side effects. This is still bounded bounce-buffer evidence: no hardware descriptor ring or CQ is published, no direct DMA or IOMMU/remapping is enabled, host physical/IOVA values stay hidden, arbitrary MMIO doorbells remain blocked, and provider-driver IRQ consumption plus cloud NIC/storage readiness remain future work.

Update 2026-05-11 10:46 UTC: make run-ddf-provider-consumer now proves the first bounded provider-visible DMA side effect in the four-cap provider consumer. On accepted DMABuffer.submitDescriptor, the manager-owned bounce page is scrubbed, then a submit marker is written only after descriptor authority validation succeeds. The same process maps the buffer after DMABuffer.completeDescriptor and proves the completion pattern is limited to the completed byte range while the submit marker outside that range remains visible. This is still bounded bounce-buffer evidence: no descriptor ring or CQ is published, no direct DMA or IOMMU/remapping is enabled, host physical/IOVA values stay hidden, arbitrary MMIO doorbells remain blocked, and provider-driver IRQ consumption plus cloud NIC/storage readiness remain future work.

Update 2026-05-11 10:06 UTC: commit c52064c0 extends the same make run-ddf-provider-consumer four-cap provider-consumer smoke beyond bounded DMABuffer submit/complete accounting. The service now also calls brokered DeviceMmio.read32, the existing claimed-register DeviceMmio.write32 path, and brokered readback before DeviceMmio.unmap. The interrupt half proves one async Interrupt.wait completes as delivered after route unmask, a second async wait stays pending for a kernel turn, and Interrupt.mask completes that second waiter as cancelled. This remains bounded provider-authority composition evidence only: DMA is still the manager-owned bounce-buffer path, MMIO writes remain limited to the claimed register policy with no arbitrary doorbell, IRQ behavior is bounded route-generation-checked waiter delivery/cancellation, and production NIC/storage migration, IOMMU/remapping, descriptor-ring mutation, completion-queue publication, provider-driver interrupt consumption, and cloud readiness remain future work.

Update 2026-05-11 09:25 UTC: make run-ddf-provider-consumer now extends the same four-cap provider-consumer smoke across the bounded DMABuffer submit/complete descriptor-accounting path. After allocating and unmapping one manager-owned bounce buffer, the service calls typed DMABuffer.submitDescriptor and DMABuffer.completeDescriptor, asserts manager-inflight-recorded then manager-inflight-completed, and checks DMAPool.info reports live_inflight=1 after submit and live_inflight=0 after completion before freeing the buffer. This remains bounded provider-authority composition evidence only: no descriptor ring is mutated, no CQ is published, direct DMA stays blocked, host physical/IOVA exposure stays hidden, arbitrary MMIO writes and doorbells remain blocked, and production NIC/block migration remains future work.

Update 2026-05-11 03:09 UTC: commit 9c0a5183 carries the manager-owned fixed bounce-buffer DMAPool budget ledger into DMAPool.allocateBuffer. The manifest-granted three-slot pool now attaches a device-manager-owned budget policy with three live buffers/pages, 12288 bytes, four queues, eight descriptors per queue, one in-flight descriptor per live slot, zero MMIO mappings/bytes, and zero interrupt holds. With all three fixed slots live, a fourth valid 4096-byte allocation returns no result cap and reports result=dmapool-budget-exceeded, reason=over-buffer-budget, sideEffect=side-effect-blocked, and bufferPresent=false before slot selection, frame allocation, generation allocation, cap minting, or manager ledger mutation. Imported live virtio-net proof records continue to use the kernel-owned device_dma:virtio-net budget policy. This remains the bounded manager-owned bounce-buffer path: direct DMA stays blocked, host physical addresses and IOVAs stay hidden, descriptor rings and completion queues are not mutated, and IOMMU/remapping plus production driver consumption remain future work.

Update 2026-05-11 00:20 UTC: make run-ddf-provider-consumer now boots one focused service that receives console, DMAPool, DeviceMmio, and Interrupt in the same CapSet. The smoke uses existing typed runtime clients and unchanged ABI surfaces: it validates DMAPool.info, allocates one manager-owned bounce-buffer DMABuffer, maps and unmaps it through the existing userspace bounce-buffer path, frees it with scrub-before-frame-free evidence, maps and unmaps a boot-preseeded DeviceMmio BAR page read-only, and exercises bounded Interrupt.wait, acknowledge, unmask, and mask on the manager-attached route. The harness asserts the four-cap service spawn, one manager-grant-source acquire/release for each authority family, and the stable provider-consumer proof line with authorities=dmapool,device_mmio,interrupt, no direct DMA, no host physical/IOVA exposure, no arbitrary MMIO write or doorbell, no real interrupt delivery requirement, and no production NIC/block migration. This is bounded composition evidence only; it does not add schema/generated/runtime changes, IOMMU programming, provider-driver interrupt consumption, or production storage/network driver migration.

Update 2026-05-10 20:06 UTC: manifest-granted DeviceMmio.write32(offset, value) now performs a bounded kernel-side volatile MMIO write after validating the active manager-attached handle, owner/state, region and policy binding, pure DeviceMmioOperation::Write authority, a dword-aligned in-BAR range, and the single PCI MSI-X metadata-derived provider claim, including BDF, BAR, BAR base, offset, and value. The effect uses only the boot-preseeded kernel MMIO mapping cache already used by DeviceMmio.read32; it does not install a post-userspace kernel mapping, and the userspace BAR VMA remains read-only. The focused smoke uses the claimed virtio-rng MSI-X entry-0 vector-control mask dword, reports side_effect=mmio-write-performed and register_write=performed, then reads the same value back through brokered read32 and the read-only userspace VMA. It also attempts an unclaimed message-address dword write and reads back the original value unchanged. Invalid range and unclaimed-register paths remain typed side-effect-blocked results, while stale or released handles fail closed before any write and do not return a write32 result payload. This does not add writable userspace MMIO, arbitrary register writes, doorbells, host physical/IOVA exposure, IOMMU programming, or a production provider-driver consumer.

Update 2026-05-10 19:29 UTC: DMABuffer.completeDescriptor now produces a bounded userspace-visible completion effect on the manager-owned bounce-buffer page. On the existing valid matching manager-inflight-completed path, after active owner/epoch/slot validation and submitted-length checks pass, the manager writes a deterministic byte pattern into the first completionLength bytes of that slot’s bounce page before clearing the in-flight record. The focused DMAPool smoke maps the same slot after completion and proves byte 0 and the last completed byte match the pattern while the next byte remains unchanged. Invalid, stale, no-inflight, mismatched, length-exceeded, mapped-live, and after-free paths keep their fail-closed labels and do not write. This is bounce-buffer completion data only: no direct DMA, descriptor ring mutation, CQ publication, host physical/IOVA exposure, IOMMU programming, or production driver consumer is added.

Update 2026-05-10 16:47 UTC: the manifest-granted Interrupt.wait path now has a fixed-table deferred waiter object for the current manager-attached route. A masked wait still fails closed synchronously as stale-pending-irq-masked / route-masked / side-effect-blocked, but an unmasked wait now returns pending. The focused interrupt smoke submits that wait nonblocking, drives the kernel once, observes it remains pending, calls Interrupt.mask, then finishes the original wait as interrupt-waiter-cancelled / route-masked / waiter-completed-no-irq with wake_blocked=false, matching source/route generations, unchanged delivery counts, and no real IRQ delivery. This is no-IRQ cancellation waiter behavior only; hardware acknowledgement, MSI/MSI-X programming, IRQ-delivery userspace waiters, and production interrupt dispatch remain future work.

Update 2026-05-10 15:33 UTC: the manifest-granted DeviceMmio.map path now maps a boot-preseeded MMIO BAR page into the caller’s userspace address space as read-only, user-accessible, no-execute, no-cache PTEs. Accepted requests return userspace-mmio-bar-mapped, boot-preseeded-read-only-bar-page, user-vma-mapped, and a nonzero page-aligned userspace address; writable, executable, unknown-protection, zero-size, unaligned, out-of-BAR, overflow, and duplicate active map requests return typed no-side-effect results. The focused smoke reads the same QEMU BAR value through the returned userspace address and brokered DeviceMmio.read32, then exercises explicit DeviceMmio.unmap, a no-op second unmap, remap after unmap, and stale unmap failure after cap release. Release, drop, driver-crash, and reset-disable cleanup revoke any borrowed user VMA before the manager record is detached. This does not add writable MMIO, doorbells, volatile register writes, host physical/IOVA exposure, post-userspace kernel MMIO mappings, IOMMU programming, or a production provider-driver consumer.

Update 2026-05-10 14:12 UTC: DMABuffer.submitDescriptor / DMABuffer.completeDescriptor now keep in-flight descriptor identity on each live DMABuffer slot instead of one pool-global descriptor. The focused DMAPool grant smoke proves slot 0 and slot 1 can both be in flight, duplicate submit on the same slot still fails closed, mismatched completion preserves both live slot descriptors, completing slot 0 decrements aggregate DMAPool.info live_inflight from 2 to 1, explicit freeBuffer of the remaining in-flight slot fails closed, and cap release of an in-flight slot drains only that slot while preserving another slot’s in-flight accounting. This remains bounded manager accounting: no descriptor ring is mutated, no CQ entry is published, no direct DMA is attempted, no IOVA or host physical address is exposed, and no IOMMU programming or production driver consumer is added.

Update 2026-05-10 13:45 UTC: commit 3bbeb3d4 makes DMABuffer.unmap explicitly remove the manager-owned bounce-buffer userspace VMA for the calling process without freeing or scrubbing the bounce page, detaching the DMABuffer record, changing DMAPool.info live buffer/page/in-flight accounting, or touching real DMA state. The typed result reports userspace-bounce-buffer-unmapped / single-page-bounce-buffer / user-vma-unmapped when a live mapping is removed, and dmabuffer-mapping-absent / no-user-mapping with no side effect on a second unmap. The focused smoke maps slot 0 read-only, rejects a writable remap while the read-only mapping remains live, unmaps it, proves the second unmap is a typed no-op, remaps the same VMA writable, and verifies stale unmap after freeBuffer fails closed like info, map, submitDescriptor, and completeDescriptor. While VMA teardown is in progress, the cap records an in-progress mapping state so concurrent map/free/release paths fail closed instead of observing an absent mapping before the page-table unmap and TLB wait complete.

Update 2026-05-10 12:49 UTC: commit 28e16431 extends the manifest-granted DMAPool bounce-buffer allocator from two live result caps to three fixed manager-owned slots. The focused smoke now proves slot 0, slot 1, and slot 2 can be live at the same time, DMAPool.info reports live_buffers=3 live_pages=3 live_bytes=12288, a fourth allocation fails closed as dmapool-already-attached / active-buffer-attached, and freeing slot 0 while slots 1 and 2 remain live leaves two committed/resident/unswappable pages. The same smoke reallocates slot 0 with a fresh generation while slots 1 and 2 remain live, verifies the old cap stays revoked, explicitly frees the reused slot 0 and slot 2 buffers, and releases slot 1 while one bounded descriptor submission remains in flight so parent-first DMAPool release completes only after the last live result buffer detaches and drains manager-owned accounting. This is a fixed three-slot bounce-buffer allocator; it still does not expose direct DMA, IOVA or host physical addresses, descriptor-ring mutation, CQ publication, IOMMU programming, hostile isolation coverage, or a production driver consumer.

Update 2026-05-10 11:44 UTC: commit 75beeeb8 extends the manifest-granted DMAPool bounce-buffer allocator from one live result cap to two fixed manager-owned slots. The focused smoke now proves slot 0 and slot 1 can be live at the same time, DMAPool.info reports live_buffers=2 live_pages=2 live_bytes=8192, a third allocation fails closed as dmapool-already-attached / active-buffer-attached, and freeing slot 0 while slot 1 remains live leaves one committed/resident/unswappable page. The same smoke reallocates slot 0 with a fresh generation while slot 1 remains live, verifies the old cap stays revoked, and releases the parent DMAPool before the remaining buffers so the staged pool detach completes only after the final DMABuffer release drains bounded in-flight accounting. This is a fixed two-slot bounce-buffer allocator; it still does not expose direct DMA, IOVA or host physical addresses, descriptor-ring mutation, CQ publication, IOMMU programming, hostile isolation coverage, or a production driver consumer.

Update 2026-05-10 10:56 UTC: commit 9659763e adds typed DeviceMmio.write32(offset, value) admission on the manifest-granted DeviceMmio cap. The kernel validates the active manager-attached handle, owner/state, region and policy binding, DeviceMmioOperation::Write authority, and 32-bit aligned in-BAR range before returning admission-check-only, real-mmio-write-not-programmed, and side-effect-blocked with register_write=blocked. The focused smoke asserts accepted-shaped admission, unaligned/out-of-BAR/overflow mmio-write32-range-invalid denials, stale-after-release failure, and two sequential grant-cycle runs. This is an admission proof only; no volatile register write, userspace BAR mapping, doorbell, host physical exposure, IOMMU programming, or production driver consumer is added.

Update 2026-05-10 10:20 UTC: commit 3777a50d gives the manifest-granted proof-buffer DMABuffer descriptor path an explicit single in-flight descriptor identity. After one valid submitDescriptor, the focused grant smoke now proves a duplicate submit for the same queue/descriptor returns dmabuffer-descriptor-already-inflight with side-effect-blocked, and a valid-shaped completion for a different live descriptor returns dmabuffer-inflight-descriptor-mismatch with side-effect-blocked. Both refusals preserve DMAPool.info live_inflight=1; the matching completion still restores it to 0. This is still bounded manager accounting only, not descriptor-ring mutation, CQ publication, direct DMA, IOVA export, or a production driver consumer.

Update 2026-05-10 05:03 UTC: unsupported-protection DeviceMmio.map requests now stay on the typed admission result path. The focused grant smoke decodes range_result=mmio-map-prot-invalid, range_reason=unsupported-map-prot, range_side_effect=side-effect-blocked, addr=0, and unchanged manager identity fields instead of treating executable or missing-read protections as a capability exception. This remains admission-only evidence; no real BAR mapping, register access, doorbell write, or host physical address exposure is added.

Update 2026-05-10 09:36 UTC: a harness-hardening follow-up, commit 7dfa1d65, strengthens make run-dmapool-grant around that bounded userspace bounce-buffer map. The current focused smoke maps the first slot generation read-only, proves the zeroed page is readable, and asserts a same-cap writable remap attempt fails while that read-only mapping remains live. It then writes and reads a marker through the second slot generation’s read-write mapping while preserving the existing typed partial-range, protection, and free/release cleanup assertions. This is a stronger proof of the existing mapping permission contract, not new direct DMA, IOVA, host physical, descriptor-ring, CQ, IOMMU, hostile-isolation, or production-driver authority.

Update 2026-05-10 08:53 UTC: the manifest-granted DMABuffer.map path now maps the single manager-owned bounce-buffer page into the caller’s userspace VMA. Accepted readable full-page requests return userspace-bounce-buffer-mapped, single-page-bounce-buffer, and user-vma-mapped with a nonzero page-aligned userspace address while still reporting real_dma_mapping=not-programmed, direct_dma=blocked, and host_physical_user_visible=false. The feature slice proved the mapped page could be reached from userspace, kept typed partial-range/protection denials at addr=0, and proved DMABuffer.freeBuffer / cap release revoke the user mapping before the bounce page is scrubbed and freed. This is userspace access to the kernel-managed bounce buffer only; it does not expose IOVA, host physical addresses, direct DMA, descriptor-ring mutation, CQ publication, IOMMU programming, or a production driver consumer.

Update 2026-05-10 06:37 UTC: the manifest-granted DMAPool.allocateBuffer and DMABuffer.freeBuffer path now uses production bounce-buffer labels for the single-page userspace allocation/free authority. A valid 4096-byte request still mints exactly one same-session DMABuffer result cap, but the cap surfaces now report userspace_dmapool=manager-issued-bounce-buffer, allocation=single-bounce-buffer-page, record_pool=userspace-bounce-buffer-live, and free_buffer=bounce-buffer-page; zero-live records report zero-live-dmapool-bounce-buffer, and oversized requests fail as size-exceeds-bounce-buffer. The backing frame remains ledger-owned, resident, unswappable, scrubbed before frame free, and hidden from userspace. That slice advanced allocation/free authority; the later DMABuffer.map slice above adds userspace bounce-buffer VMA access while descriptor methods still only update bounded manager accounting, and no IOVA, host physical address, CQ publication, descriptor-ring mutation, IOMMU programming, or production driver consumer is added.

Update 2026-05-10 04:40 UTC: invalid-size DMAPool.allocateBuffer requests now use the same typed no-result-cap rejection shape as duplicate-active requests. Zero-size and over-bounce-buffer calls return result=dmapool-allocation-request-invalid, the exact request reason, side_effect=side-effect-blocked, buffer_present=false, no result cap, and no page mutation instead of relying on a capability exception string.

Update 2026-05-10 04:27 UTC: the manifest-granted Interrupt.mask and Interrupt.unmask methods perform bounded route-state control over the manager-attached dispatch slot. unmask changes claimed-masked to driver-unmasked, mask changes it back to claimed-masked, both preserve delivery counts, and release masks retained manager-grant routes before detaching. The 2026-05-10 16:47 UTC status above supersedes the original post-unmask wait placeholder with deferred no-IRQ cancellation behavior.

Update 2026-05-10 03:23 UTC: DMAPool.allocateBuffer now reports the duplicate-active bounded proof-buffer rejection as typed result data instead of requiring userspace to infer the label from a capability exception string. When the first proof DMABuffer is still live, the second valid-size allocation returns no result cap and reports result=dmapool-already-attached, reason=active-buffer-attached, side_effect=side-effect-blocked, and buffer_present=false; the smoke then proves DMAPool.info still reports one live 4096-byte proof frame. This remains bounded proof-buffer guard evidence, not real multi-buffer DMA allocation or production userspace DMA authority.

Update 2026-05-10 02:52 UTC: the focused DMAPool grant smoke now proves active duplicate allocation is refused while the single proof buffer is still attached. The smoke calls DMAPool.allocateBuffer a second time after the first result DMABuffer becomes live, requires the failure path, then re-reads DMAPool.info to prove the record still reports exactly one live 4096-byte proof frame. This is a bounded proof-buffer guard only; it does not add real multi-buffer allocation, DMA mappings, IOVA/physical exposure, or production driver authority.

Update 2026-05-10 02:21 UTC: DMAPool.info now reports the live manager record accounting for the bounded proof-buffer path. The manifest-granted DMAPool starts as zero-live-dmapool-proof, moves to synthetic-live-dmapool-proof with one live 4096-byte page while the manager-attached DMABuffer result cap is active, and returns to zero-live after typed DMABuffer.freeBuffer scrubs/releases the proof frame. The focused grant smoke asserts the live and after-free accounting lines through the typed runtime client. This remains bounded proof-buffer accounting only; it does not implement real device-visible DMA mappings, IOVA/physical address exposure, production descriptor side effects, or production page lifecycle.

Update 2026-05-09 23:52 UTC: the manifest-granted Interrupt skeleton now also has bounded Interrupt.mask and Interrupt.unmask admission methods. They validate the current manager-attached route through the existing Mask/Unmask authority paths and return typed no-side-effect labels while proving route state and delivery counts stay unchanged. This does not implement real route mask/unmask mutation, hardware acknowledgement, blocking userspace waiters, MSI/MSI-X table programming, or real interrupt delivery.

Update 2026-05-09 23:21 UTC: the manifest-granted Interrupt skeleton now also has a bounded Interrupt.acknowledge admission method. It validates the current manager-attached route through the existing Acknowledge authority path and returns typed no-side-effect labels without acknowledging hardware, waking waiters, or changing delivery counts. This does not implement real interrupt acknowledgement, blocking userspace waiters, real mask/unmask route mutation, or real interrupt delivery.

Update 2026-05-09 19:18 UTC: the manifest-granted Interrupt skeleton now has a bounded Interrupt.wait admission method. It validates the current manager-attached route, delegates to the shared pending-IRQ token validator, and returns typed masked-route labels without waking a waiter or advancing delivery counts. This does not implement blocking userspace waiters, hardware acknowledgement, real route mask/unmask mutation, or real interrupt delivery.

Update 2026-04-28 22:02 UTC: normal shell client @... grants reject explicit badge N selector syntax and preserve delegated client endpoint identity when the selector is omitted. Low-level and hostile-path tests still carry explicit selector fixtures.

Update 2026-04-29 05:59 UTC: the focused chat manifest now routes the kernel singleton chat_endpoint through init to the resident chat server, and the focused chat shell no longer receives a manifest-forwarded chat service export. Its normal chat authority comes from the broker-issued operator shell bundle, matching the default and remote shell paths while the resident bot keeps its manifest service grant.

Update 2026-04-29 07:35 UTC: Session-Bound Invocation Context core gates are landed. Implemented pieces include the process-session invariant, endpoint caller-session metadata, stale normal endpoint rejection, transfer scopes, field-granular disclosure gating, session expiry for broker-issued shell bundle caps, guest bundle narrowing, chat session-keyed membership, and Aurelian player state keyed by live endpoint caller-session metadata.

Update 2026-05-01 08:47 UTC: default password-authenticated local operator sessions now mint with no wall-clock expiration. Short-expiry operator proofs remain available by setting a non-default sessionLifetimes.operatorMs in the manifest.

Update 2026-04-29 09:44 UTC: later Gate 4 cleanup put terminal output behind live caller-session dispatch, bound shell-serviced stdio bridge waits to opaque live caller-session metadata, removed remaining badge-facing service-common handler APIs from normal chat paths, and widened non-adventure endpoint caller-session opaque references to 128 bits while preserving the scoped_ref ABI field as the low half.

Update 2026-04-29 10:20 UTC: non-adventure endpoint caller-session references now use an entropy-backed boot secret and HMAC-SHA256 over a non-reused endpoint service-scope id plus kernel session id. The ABI layout is unchanged, but scoped_ref is no longer value-compatible with the old unkeyed hash. References rotate on reboot and endpoint object replacement.

Update 2026-04-29 21:40 UTC: Gate 4 of the Session-Bound Invocation Context milestone is implemented and verified on mainline. Commit faeff80 at 2026-04-29 21:39 UTC records the final closeout: normal chat, adventure, terminal, and stdio paths no longer expose caller-selected receiver identity, focused make run-adventure passed with session-bound Adventure/chat service grants, and focused make docs passed after docs PDF render hardening. The paper/status alignment now records the session-bound shared-service evidence as landed. Remaining work in this area is future stable service-audit identity across upgrades, not additional shared-service migration.

Implemented

Visible Milestone Proofs

  • First Packet: commit b56a5c1 at 2026-04-24 15:37 UTC.
  • First HTTP: commit a4f1722 at 2026-04-24 16:47 UTC.
  • The Unprivileged Stranger: commit d4016ab at 2026-04-22 16:35 UTC.
  • Native Cap Shell: commit f554e88 at 2026-04-23 08:41 UTC.
  • Boot to Shell: commit e5adafb at 2026-04-23 13:39 UTC.
  • The Revocable Read: commit 7f19af2 at 2026-04-23 16:15 UTC.
  • First Chat: commit 2cd85a8 at 2026-04-24 00:13 UTC.
  • Local MUD: commit add7f9b at 2026-04-24 01:40 UTC.
  • Verified Core: commit d43b691 at 2026-04-23 22:09 UTC.
  • Ring as Black Box: commit da5f5e9 at 2026-04-24 03:13 UTC.
  • First AP Scheduler: commit d88bca7 at 2026-04-25 11:31 UTC.
  • Telnet Shell Demo: commit 2834bfc at 2026-04-25 20:25 UTC. Demo scope: plaintext, loopback-only research demo proving the TerminalSession/SessionManager/AuthorityBroker/ RestrictedShellLauncher boundary over a real TCP socket; not a shippable Telnet service. Production remote shell tracks the SSH Shell Gateway in SSH Shell Gateway.
  • Multi-Process SMP Concurrency: commit 3fb89923 at 2026-04-30 09:45 UTC.
  • Default Run Telnet Wiring Retired: commit 367117be at 2026-05-01 16:54 UTC removes default host-local Telnet gateway forwarding and the default manifest service. Current make run starts the foreground shell, chat service, and remote-session CapSet gateway, and forwards only the remote CapSet endpoint. The plaintext demo was later retired with qemu-only kernel TCP listener removal; make run-telnet now exits before QEMU with a retirement diagnostic. Earlier commit 7a155f4 at 2026-04-26 21:02 UTC moves Telnet IAC filtering into the kernel socket terminal (best-effort silent swallow for the BSD/netkit clients we test against; no WONT/DONT replies) so a normal Telnet client lands at the shell prompt without a userspace pre-handoff recv, and refactors the gateway to loop accept/handoff/launch_shell/wait per connection so repeated host Telnet connections succeed.
  • Service Object Routing/Lifecycle: commit a4655f0 at 2026-04-28 14:10 UTC; make run-service-object-routing proves trusted service-object minting, receiver-cookie dispatch, payload-spoof rejection, copy/move IPC transfer, nested spawn delegation, generation-checked receiver cookies, close/revoke rejection, and stale-cookie rejection after record reuse. This is now historical low-level coverage: the implemented Session-Bound Invocation Context baseline gives normal workload processes one immutable session context, and endpoint subject disclosure is private by default.
  • Session Context Invariant: commit 3edee90 at 2026-04-28 16:26 UTC adds make run-session-context, proving every spawned process has one immutable session context, raw child spawns inherit the caller context, a copied UserSession cap cannot relabel invocation context, and a broker-issued launcher can select a validated child context while mismatched profile requests fail closed. Commit 3469c27 at 2026-04-28 16:54 UTC extends the proof so expired guest-session bundle refreshes fail closed at the broker. Commit 687511a at 2026-04-28 17:43 UTC adds privacy-preserving endpoint caller-session metadata and stale normal endpoint rejection: endpoint servers receive only a service-scoped opaque caller-session reference, epoch, and live/stale flags by default, spoofed user/session/role payload labels do not affect the delivered invocation context, and calls after the process session expires fail before transfer preparation or enqueue. Commit f0cb74b at 2026-04-28 18:38 UTC adds session-aware cap transfer scopes: same-session-only caps cannot cross into another session, explicitly shareable caps may cross and then invoke under the receiver session, and service-regrant-only caps require a trusted fixed-session broker/launcher path. Commit 0f92d77 at 2026-04-28 19:33 UTC adds explicit endpoint subject disclosure gating: request without scope and scope without request expose no subject fields, request plus matching scope exposes only allowed fields, and broader requests are narrowed. Commit dc7ece4 at 2026-04-28 20:06 UTC migrates the chat demo to session-keyed membership: chat member state is keyed by the endpoint caller-session reference, the focused chat manifest no longer assigns static chat badges, and make run-chat proves operator-session chat clients plus rejected delegated endpoint relabeling. A follow-up review fix keeps join handles as request data and uses service-assigned visible member labels.
  • 2026-04-28 20:48 UTC narrows guest shell bundles: guest sessions require an explicit manifest guest seed, guest bundles receive no default chat/adventure service endpoint caps, and guest launcher policy comes from resource-profile launcherProfile rather than the full manifest binary list.

Boot and Kernel Baseline

  • Limine boots the x86_64 kernel in QEMU.
  • The kernel initializes dual UART output, GDT, IDT, LAPIC, syscall MSRs, memory management, page tables, heap allocation, and the global capability registry. The legacy PIC/PIT path remains as a fallback when LAPIC timer setup or PIT-based calibration is unavailable.
  • User page-table map, unmap, and protect operations are routed through a TLB shootdown helper keyed by address-space CPU residency. Remote targets get pending full-TLB flush generations plus vector-49 IPIs, and the sender waits for observed target completion after ring dispatch releases address-space, cap-table, and scratch locks. Deferred queue slots are reserved before page-table mutation, and drains flush the current CPU before waiting. Delayed maskable interrupt delivery is covered by syscall-entry and flush-before-user-return hooks. Scheduler CR3 handoff marks the current CPU resident, including AP cpu=1 during the first AP scheduler-owner proof, so remote shootdown targets become active when an address space has run on more than one CPU.
  • AP cpu=1 can own scheduler/user execution under -smp 2: APs register their PerCpu records, program LAPIC timers from the BSP calibration, update AP TSS.RSP0 during context switches, and enter the scheduler from the AP idle loop when AP timer setup succeeds. This proof keeps one scheduler owner; when AP cpu=1 is online with a programmed timer, the BSP stays in kernel idle so the process-wide capability ring is not executed concurrently.
  • The Multi-Process SMP Concurrency milestone is complete at commit 3fb89923 (2026-04-30 09:45 UTC). The scheduler tree includes a narrow reschedule-IPI wake path for halted scheduler-owner loops, and make run-smp-process-scale builds capos-smp-process-scale.iso from system-smp-process-scale.cue, runs repeated -smp 1, -smp 2, and best-effort 4-vCPU QEMU cases, parses compact verified timing lines, stores raw serial logs under target/smp-process-scale/<timestamp>/, and enforces the 1.6x median speedup threshold when KVM-backed evidence is available. The accepted run in target/smp-process-scale/cycle-balanced-default/ recorded 1.608x 1-to-2 speedup. A later capos-bench nested-QEMU/KVM rerun on GCE n2-highcpu-8 at commit 0d89a91b (2026-04-30 11:09 UTC) pinned QEMU to host CPUs 0,1,2,3 and recorded 1.873x capOS 1-to-2 speedup; the matching Linux guest baseline under the same CPU pinning recorded 1.934x. The same run recorded capOS smp4=1111 scaled cycles, or 1.475x from the 1-vCPU baseline but slower than the 2-vCPU median, while Linux recorded 3.774x 1-to-4 speedup; capOS therefore still claims only the 1-to-2 milestone gate. The closeout also reran ordinary run-smoke and run-spawn under -smp 2, with logs in target/smp2-smokes/, covering the default manifest, ring, thread lifecycle, park cleanup, generic child waits, and process exit.
  • The kernel creates its own page tables with per-section permissions and keeps the higher-half direct map for physical memory access.
  • SMEP/SMAP are enabled when the QEMU CPU advertises support.

Code: kernel/src/main.rs, kernel/src/arch/x86_64/, kernel/src/mem/.

Validation: cargo build --features qemu, make run-smoke.

Process and Userspace Runtime

  • Processes have isolated address spaces, one or more internal Thread records with per-thread kernel stacks and saved CPU context, CapSet bootstrap pages, capability rings, and local capability tables.
  • ELF loading supports static no_std userspace binaries and TLS setup.
  • capos-rt owns the userspace entry path, allocator initialization, ring-client access, typed clients, result-cap parsing, and owned-handle release.
  • capos-rt is the only source owner for the userspace _start, panic, global allocator, raw syscall, and capos_rt_main handoff surfaces; a source check guards this split.
  • targets/x86_64-unknown-capos.json defines the capOS userspace target for booted init, demos, shell, and capos-rt runtime builds; the kernel default remains x86_64-unknown-none.
  • The 7.1.0 in-process threading contract defines the split between process-owned address-space/capability state and thread-owned execution state, plus thread/kernel-stack quotas and generation-checked waiter identity. 7.2.0 moved saved context, kernel stack, FS base, and block state into Thread records; 7.2.1 schedules and wakes generation-checked ThreadRef values; 7.2.2 adds process-local ThreadSpawner and ThreadHandle caps plus ThreadControl.exitThread for create, join, detach, self-join rejection, exit-code observation, and last-thread process exit; and 7.2.3 adds private ParkSpace wait/wake with timeout, wake, and reserved waiter completion semantics. SharedParkSpace park-words remain future work.

Code: kernel/src/spawn.rs, kernel/src/process.rs, capos-rt/src/, init/src/main.rs, demos/, shell/, targets/x86_64-unknown-capos.json, tools/check-userspace-runtime-surface.sh.

Design: In-Process Threading, Park Authority.

Validation: tools/check-userspace-runtime-surface.sh, make capos-rt-check, make init-capos-build, make demos-capos-build, make shell-capos-build, make capos-rt-capos-build, make run-smoke, make run-spawn.

Programming Language Support

  • Native capOS Rust is the only implemented booted Rust language path. It uses #![no_std], alloc, capos-rt, static ELF binaries, and the targets/x86_64-unknown-capos.json custom target.
  • Native C boots through the libcapos C-substrate (Phase 0; make run-c-hello exercises Console + Timer + EntropySource + a 4 KiB anonymous VM roundtrip) and through the POSIX adapter (Phase P1.2 Phase B; the historical make run-posix-dns-smoke resolved example.com over the qemu-only kernel UdpSocket cap via QEMU slirp DNS at 10.0.2.3:53, but that target is retired after kernel socket-owner removal; make run-posix-pipe-smoke and make run-posix-spawn-smoke exercise pipe, fork-for-exec, direct posix_spawn, minimal file actions, read, and waitpid over ProcessSpawner / Pipe; the Console-backed stdio proof landed at commit aa6a56d7 (2026-05-13 11:03 UTC) and make run-posix-stdio-smoke exercises write(1, ...) and write(2, ...) over a granted Console while proving read(0, ...) stays closed without a stdin grant; the file/directory fd closeout landed at commit f97d9833 (2026-05-23 06:23 UTC) and make run-posix-file exercises open(), write(), lseek(), read(), opendir(), readdir(), and closedir() over a granted root Directory; make run-posix-printf exercises the focused printf/string subset: formatted output, string/mem, numeric conversion, and ctype helpers; make run-posix-signal-time exercises Timer-backed time, nanosleep, and sleep plus the documented fail-closed signal-delivery stubs). Both bypass WASI – they are static ELF binaries linked against libcapos.a and, for POSIX smokes, libcapos_posix.a. POSIX posix_spawn() accepts argv/envp for source compatibility but does not deliver them until LaunchParameters / environment support lands. Broader C/libcapos surface and full POSIX adapter scope remain future design.
  • Sandboxed wasm32-wasi is the first booted WASI-hosted language path. Phase W.5 (filesystem; capos-wasm/src/wasi/fs.rs) closed and is exercised by make run-wasi-fs: the wasm-host installs the manifest-granted root Directory cap as a preopened fd, the WASI payload writes and reads back a file through path_open / fd_write / fd_close / re-open / fd_read, and the preopen sandbox refuses absolute paths and parent-escape .. segments. The WASI host adapter closed Phase W.4 at commit b0f6939f (2026-05-07 20:09 UTC); Phase W.3 closed at commit ca41ecc1 (2026-05-07 18:29 UTC; the surrounding W.3 narrative stamps from 2026-05-07 18:25 UTC predate the feat commit by a few minutes); Phase W.2 closed at commit 7bfcb1d8 (2026-05-07 10:53 UTC): the wasm-host userspace binary (capos-wasm/ standalone crate over vendored wasmi 1.0.9) hosts WebAssembly modules whose wasi_snapshot_preview1 imports are backed by typed capOS capabilities (Console + Timer + BootPackage, the per-instance argv text grant from W.3, the 2026-05-13 bounded environment text grant through initConfig.init.wasiEnv, and the optional W.4 EntropySource cap looked up from the per-instance CapSet under the well-known name random). make run-wasi-hello-rust, make run-wasi-hello-c, make run-wasi-cli-args, and make run-wasi-env are the regression smokes; make run-wasi-random is the W.4 granted gate (the payload reads N=64 bytes through random_get and prints [wasi-random] entropy_bytes=64 entropy_bound_ok=true) and make run-wasi-random-ungranted is the matching refusal gate (the same payload observes ERRNO_NOSYS = 52 from the closed-fail branch when the manifest withholds the grant). A 2026-05-13 authority-free compatibility slice adds make run-wasi-stdio-fd, whose direct-import payload proves clock_res_get(MONOTONIC), sched_yield, fd_fdstat_get(1/2), and fd_seek(1/2) no longer return ERRNO_NOSYS; make run-wasi-env proves one granted environment value reaches a WASI payload through environ_get / environ_sizes_get; make run-wasi-preview1-refusals remains the storage/socket fail-closed gate for path_open, fd_read, sock_send, and sock_recv. Wall-clock support stays deferred until capOS has a typed WallClock/RealTimeClock cap; clock_time_get(CLOCKID_REALTIME) keeps returning ERRNO_NOSYS until that cap lands.
  • Rust std, C++, Go, Python, JavaScript/TypeScript, and full POSIX shell/utilities are not implemented as supported capOS runtime paths.
  • Lua has a Phase 0 in-tree capability-aware Lua-subset interpreter under demos/lua-smoke/ (gated by make run-lua-smoke); it validates the long-term capability-userdata host API design but is NOT a PUC Lua dialect-compatible runner. Dialect compatibility waits on the future C/libcapos PUC port.
  • The planned compatibility story is split by adapter type rather than one generic “compatibility layer”: native runtime adapters for languages such as Rust and Go, capability-native bindings over Cap’n Proto interfaces, POSIX compatibility adapters over scoped file/socket/process caps, and WASI host adapters backed by capabilities.

Design: Programming Languages, Userspace Runtime, Userspace Binaries, Go Runtime, Lua Scripting.

Validation: current native Rust validation uses tools/check-userspace-runtime-surface.sh, custom-target userspace builds, and the runtime QEMU smokes listed above. Native C/POSIX validation is through the focused make run-c-* and make run-posix-* smokes named in this section, including make run-posix-file for the File/Directory fd surface. WASI filesystem validation uses make run-wasi-fs (Phase W.5 preopened-directory round-trip and sandbox proof).

Capability Ring and IPC

  • The shared ring ABI supports CALL, RECV, RETURN, RELEASE, CANCEL, NOP, and compact ParkSpace PARK/UNPARK transport operations.
  • cap_enter processes submissions and can block until completions arrive or a timeout expires.
  • Endpoints route ring-native IPC between processes.
  • Direct IPC handoff lets a blocked receiver run before unrelated round-robin work after a matching CALL arrives.
  • Transport errors and application exceptions are surfaced through CQEs and typed runtime client errors.
  • Ordinary capability implementation errors, revoked ordinary/endpoint use, live endpoint target errors after endpoint identification, and endpoint RETURN application failures use serialized CapException payloads when a caller result buffer can safely receive one. No-payload application failures report CAP_ERR_APPLICATION_EXCEPTION_TRUNCATED; malformed transport metadata and unsafe result-buffer paths remain transport errors.
  • Endpoint RETURN can propagate a serialized CapException from a userspace endpoint server to the original cross-process caller.
  • debug_tap builds export metadata-only ringtap: records for observed SQEs and posted CQEs on the QEMU/debug UART. The format is fixed, bounded, and deliberately records payload_len = 0 until a separate payload-capture authority lands.
  • tools/ringtap-viewer/ parses ringtap: logs into SQE/CQE summaries and can decode authorized Cap’n Proto payloads for CapException, TerminalSession.readLine params, and ProcessHandle.wait results when future tap output includes payload_schema and payload_hex fields.
  • make run-ringtap-failing-call boots the default shell smoke with debug_tap, drives the known typed-call method-99 launcher failure, runs the viewer over the captured kernel log, and leaves offline inspection logs in target/ringtap-failing-call-*.log.

Code: capos-config/src/ring.rs, kernel/src/cap/ring.rs, kernel/src/cap/endpoint.rs, kernel/src/debug_tap.rs, capos-rt/src/ring.rs, capos-rt/src/client.rs, tools/ringtap-failing-call-smoke.sh, tools/ringtap-viewer/.

Validation: cargo test-ring-loom, make run-smoke, make run-spawn, make run-smoke CARGO_FLAGS='--features debug_tap', cd tools/ringtap-viewer && cargo test, make run-ringtap-failing-call.

Capabilities

Implemented kernel capabilities include:

  • Console for debug UART output.
  • TerminalSession for the separate session UART with line input/output, bounded readLine, visible/hidden echo, structured cancellation, and a single move-only foreground holder.
  • BootPackage for read-only, chunked boot manifest reads from init.
  • FrameAllocator for typed MemoryObject frame ownership grants.
  • MemoryObject for owned physical frame ranges, caller-local map/unmap/protect, and final backing release after cap/mapping teardown.
  • Endpoint for IPC rendezvous.
  • VirtualMemory for anonymous user page map, unmap, and protect operations.
  • Timer for monotonic tick/time reads and bounded sleep completions through the capability ring.
  • ThreadControl for runtime-owned FS-base get/set and current-thread exitThread on the current thread.
  • ThreadSpawner and ThreadHandle for process-local in-process thread creation, one-shot join, exit-code observation, detach-on-release, and retained-status cleanup.
  • ParkSpace for process-local private park wait/wake on 32-bit userspace words, with per-thread blocking and reserved waiter CQE credits.
  • ProcessSpawner and ProcessHandle for init-driven child process creation and wait semantics.
  • Retired NetworkManager, TcpListener, and TcpSocket qemu-only kernel socket capabilities. Their entry points now fail closed; the active TCP/UDP socket authority shape is the Phase C userspace network-stack path.

MemoryObject holders and anonymous VirtualMemory mappings charge the same per-process ResourceLedger::frame_grant_pages quota. Mapping a held MemoryObject records borrowed address-space pages and reserves mapping quota until unmap so backing frames cannot stay pinned after the cap charge is released.

Code: kernel/src/cap/console.rs, kernel/src/cap/terminal_session.rs, kernel/src/cap/boot_package.rs, kernel/src/cap/frame_alloc.rs, kernel/src/cap/endpoint.rs, kernel/src/cap/virtual_memory.rs, kernel/src/cap/timer.rs, kernel/src/cap/thread_control.rs, kernel/src/cap/thread_handle.rs, kernel/src/cap/process_spawner.rs, kernel/src/cap/network.rs.

Validation: make run-smoke, make run-memoryobject-shared, make run-spawn, make run-shell, make run-terminal, make run-net, cargo test-lib.

Capability Transfer and Release

  • IPC CALL and RETURN support sideband transfer descriptors.
  • Copy and move transfer are implemented.
  • Move transfer reserves the sender slot until destination insertion and commit.
  • Transfer result caps carry interface ids to userspace.
  • CAP_OP_RELEASE removes local capability-table slots. Runtime owned-handle drop queues one local release, and Runtime::flush_releases() forces queued releases when code cannot wait for the next ring-client acquisition/drop.

Code: kernel/src/cap/transfer.rs, kernel/src/cap/ring.rs, capos-lib/src/cap_table.rs, capos-rt/src/ring.rs.

Validation: cargo test-lib, make run-smoke.

Manifest Tooling and Smokes

  • tools/mkmanifest turns system.cue into a Cap’n Proto boot manifest.
  • The build uses repo-pinned Cap’n Proto and CUE tool paths through the Makefile; direct mkmanifest invocation also rejects missing, unpinned, or version-mismatched CUE compilers. mkmanifest cue-to-capnp extends the same pinned-tool policy to general CUE-authored data messages: it exports CUE as JSON, validates CAPOS_CAPNP, and delegates arbitrary specified schema-rooted struct serialization to capnp convert json:binary.
  • Default scripted QEMU smoke still uses the focused shell-led system-smoke.cue path: anonymous session on boot, login prompting for username before hidden password entry, generic failed-auth output on a wrong password, successful operator login, broker upgrade to the operator bundle, child terminal isolation, stale-handle release, single-capos-shell init boot, and clean halt. The default operator-facing system.cue path is init-owned and is exercised by make run.
  • system.cue is now the default init-owned manifest. The kernel starts only the first init service, and init starts capos-shell, the remote-session CapSet gateway, and the default demo services from the manifest service graph. The shell receives terminal/creds/sessions/audit/broker caps and mints its own anonymous session.
  • system-shell.cue is the focused anonymous-shell proof (no verifier), which exercises the shell in its anonymous bundle and asserts that the anonymous launcher rejects spawns because its allowlist is empty.
  • system-chat.cue is the focused First Chat prototype proof. It starts a resident Chat endpoint service on the kernel singleton chat_endpoint, a resident bot participant, and the shell; make run-chat drives run "chat-client" with explicit StdIO plus the broker-issued chat endpoint grant, sends one line, and checks that the bot reply is printed by the foreground client.
  • system-adventure.cue is the focused adventure prototype proof. It keeps adventure out of shell builtins and drives run "adventure-client" through explicit StdIO, adventure, and chat endpoint grants. See the Aurelian Frontier (proof slice) page for the current mission, commands, and transcript coverage.
  • system-paperclips.cue is the focused clean-room Paperclips-style terminal demo proof. As of commit 532207c1 (2026-04-30 20:54 UTC), it boots Paperclips server services plus a terminal client. The server owns generated content, game state, regular timer cadence, unlock checks, game-rule mutation, and proof-command gating; the terminal client receives explicit StdIO plus a PaperclipsGame endpoint and renders server-mode help from the server’s structured command specs. Commit e9ae4e97 (2026-04-30 22:02 UTC) adds structured plain-status snapshots, so server-mode plain status is rendered from the server’s structured PaperclipsStatusSnapshot. Commit 32462e9f (2026-04-30 22:32 UTC) adds the structured project-list follow-up: server-provided project entries for terminal-rendered plain projects, while project <id> remains a raw text request that mutates server-owned game state. make run-paperclips first proves that normal server authority rejects run <ms> fast-forward plus rejection of a forged proof_accelerator: @timer grant, then relaunches against the proof server endpoint with the focused manifest’s explicit proof_accelerator cap for transcript acceleration. The accelerated proof drives one-at-a-time manual production, locked-purchase and insufficient-funds refusal output, bulk-manual rejection, high-price zero-demand sale refusal, no-wire manual production refusal, explicit sales, immediate repeat-sale cooldown refusal, repeatable marketing, autoclipper unlock, real-time automation, generated Cap’n Proto content loading, first project completion, scaled business-phase production, the design-search and forecast-engine project chain, survey-drones, the visible == autonomous phase == transition, then representative autonomous drone/factory scaling with local-matter conversion and additional clip production, mesh-coordination, seed-probes, the visible == cosmic phase == transition, one bounded probe interval with cosmic matter conversion, probe replication, and additional production, then asserts a compact status --json machine-readable status line and verifies final-conversion remains locked before clean process exit through the native shell. Active schema, content, rules, and smoke sources use clean-room Strategy internals, and host tests reject explicit zero-count purchases without mutating state. Host tests cover the one-real-time-hour non-completion property under a generous normal-play creativity upper bound.
  • demos/service-common/ holds the shared caller-session endpoint loop and chat actor bootstrap/polling helpers used by the chat/adventure resident services, chat bot, and adventure NPC processes. New shared endpoint loop code uses EndpointUserData; the old badge-named user-data alias remains only for compatibility while peer branches migrate. Shared event queues remain deferred until another service has queue needs matching chat history/inbox behavior.
  • system-spawn.cue remains the focused ProcessSpawner smoke for endpoint, IPC, VirtualMemory, Timer, ThreadControl, FrameAllocator cleanup, and hostile spawn inputs. make run-spawn asserts that the kernel boot-launches only the standalone init, that init validates BootPackage metadata, and that the init-owned manifest executor spawns and waits for every focused child service, including the timer-smoke monotonic now/sleep proof, timer-flood per-process Timer sleep quota proof, runtime-fs-base runtime-owned FS-base proof, single-thread-runtime VirtualMemory plus Timer runtime checkpoint, and thread-lifecycle in-process thread/park proof.

Code: tools/mkmanifest/, system.cue, system-chat.cue, system-adventure.cue, system-paperclips.cue, system-spawn.cue, demos/, capos-rt/.

Validation: cargo test-mkmanifest, make generated-code-check, make run-smoke, make run-chat, make run-adventure, make run-paperclips, make run-spawn.

Partially Implemented

Login Boot and Init-Owned Spawn

Default make run now uses the init-owned default manifest. The kernel validates the kernel-owned boot boundary, boot-launches standalone init, and leaves the service graph plus login/session/broker flow in userspace. Init starts the foreground capos-shell service, resident demo services, and the host-local remote-session CapSet gateway; make run forwards host-local TCP to guest port 2327 for the remote CapSet path only. The foreground shell mints its own anonymous UserSession on boot; login and setup commands drive CredentialStore/SessionManager/AuthorityBroker to upgrade the session in place. Local password login is username-aware on the ordinary foreground shell path, while durable multi-account credential storage remains future work. The plaintext Telnet gateway was only a focused make run-telnet / system-telnet.cue research demo. That target is retired after qemu-only kernel TCP listener removal, and the gateway demo, its manifest, and the kernel SocketTerminalSession shim are removed; use the in-guest login smokes for current shell coverage and rebuild any socket-backed terminal proof on the Phase C userspace network stack before using it as validation.

The focused init-owned spawn path remains under make run-spawn. There the kernel boot-launches init with Console, BootPackage, and ProcessSpawner. Parent endpoint facets used for later service-sourced imports are returned by ProcessSpawner during child spawn, not granted at boot. init performs metadata-only manifest validation, resolves kernel and service cap sources, spawns children through ProcessSpawner, records exports, waits for children, and reports failures through Console output. The QEMU target now asserts the single-init boot markers, the three-cap init bundle, BootPackage validation, child exit records, manifest child waits, spawn-loop completion, and clean halt.

Measurement startup now follows the same boundary. make run-measure uses a focused system-measure.cue manifest where the kernel boots standalone init with Console, BootPackage, and ProcessSpawner, and init spawns ring-nop with Console, FrameAllocator, the measurement-only NullCap, and the measurement-only ParkBench cap through ProcessSpawner grants. It also spawns thread-lifecycle with ThreadControl, ThreadSpawner, ParkSpace, and a measurement marker cap. The demos print compact versus generic park-shaped failed-wait/empty-wake cycle averages plus real ParkSpace blocked/resume cycle averages before the measure-feature kernel prints segmented dispatch counts, total cycles, and averages for SQE processing, validation, cap lookup, capnp decode, method body dispatch, CQE posting, and waiter wake/check. Kernel bootstrap now loads only initConfig.init and validates only the kernel-owned manifest boundary; mkmanifest and init own initConfig.services graph validation for focused BootPackage executor manifests.

SSH Shell Gateway

The SSH Shell Gateway proof targets are implemented, covering the authority prerequisites and fixture authentication path that precede an encrypted SSH transport. Bounded QEMU smokes exist for:

  • Host-key fixture signing (make run-ssh-host-key): a development-only non-production SshHostKey cap returns public metadata, signs bounded fixture exchange hashes for QEMU proof, fails wrong-algorithm requests closed, and does not leak the private host-key seed.
  • Authorized-key lookup (make run-ssh-authorized-key): a manifest-seeded AuthorizedKeyStore cap accepts configured ssh-ed25519 public keys mapped to seed-account principals, denies unknown, disabled, and unsupported-algorithm keys, and does not leak private key material.
  • Public-key session minting (make run-ssh-public-key-session, make run-ssh-public-key-auth): SessionManager.sshPublicKey rechecks configured key records, verifies a bounded ssh-ed25519 signature over fixture authentication bytes, mints a publicKey UserSession only after the signature succeeds, and logs stable audit reason codes for each denial path without leaking principal or profile metadata. UserSession.auditContext fails closed after logout through the same ensure_session_live guard as info().
  • Unsupported feature policy (make run-ssh-feature-policy): a capos-config::ssh_policy surface classifies password auth, exec requests, SFTP, direct-tcpip, agent/X11 forwarding, env import, and multiple session or shell channels into stable audit reason codes; all denied paths produce event=session result=denied reason=policy audit records.
  • Restricted shell launcher (make run-restricted-shell-launcher): a manifest-declared RestrictedShellLauncher cap launches only capos-shell, injects supplied terminal/session caps plus child-local stdio, rejects session/profile mismatch and kernel-sourced or dangerous pass-through grant attempts, and strips hidden process-supervision result caps.
  • Bounded terminal-host proof (retired): make run-ssh-gateway-terminal-host wired scoped TcpListenAuthority listen, authorized-key lookup, public-key UserSession minting, broker profile matching, socket-to-TerminalSession conversion, and restricted shell launch over a host-local plain TCP connection. It sat on the qemu-only kernel socket owner and the kernel SocketTerminalSession, both of which are retired with the userspace network-stack migration; the smoke now exits with a retirement diagnostic and a future terminal host must target the userspace network stack.

Encrypted SSH packet transport, OpenSSH-compatible key exchange and channel handling, full SSH userauth transcript validation, channel binding, TerminalSessionFromByteStream terminal-factory wiring, a terminal host over the userspace network stack, and a production OpenSSH harness remain open. The landed proofs use development/fixture key material; they are not a production SSH service and are not safe for non-loopback deployment.

Design: SSH Shell Gateway.

Code: kernel/src/cap/ssh_host_key.rs, kernel/src/cap/authorized_key_store.rs, kernel/src/cap/restricted_launcher.rs, capos-config/src/ (ssh_policy), demos/ssh-*/, tools/qemu-ssh-*-smoke.sh.

Validation: make run-ssh-host-key, make run-ssh-authorized-key, make run-ssh-public-key-session, make run-ssh-public-key-auth, make run-ssh-feature-policy.

Hardware and Networking

The hardware bring-up path has bounded ACPI RSDP/RSDT/XSDT, MADT, MCFG, DMAR, and IVRS diagnostics plus reusable PCI config-space access through legacy I/O ports and Q35 PCIe ECAM, and the x86 path programs masked MADT-backed I/O APIC routes for legacy IRQs while honoring source overrides. IOMMU reporting is policy-only: malformed DMAR/IVRS structures fail closed, DMAR DRHD include-all or single-hop PCI endpoint device-scope metadata can mark retained DMA-capable PCI functions as IOMMU-attached/covered; bridge and multi-hop scopes remain diagnostic-only until PCI topology traversal exists, and include-all fallback fails closed when retained DMAR coverage metadata is capped. Direct DMA remains blocked with zero trusted domains, and every retained DMA-capable prototype function requires bounce buffering. The current staged domain-policy proof also reports that future claimed DMA-capable devices use a device-manager-owned per-device domain or trusted sharing group, exported device addresses are IOVA-only, host physical addresses are not user-visible, remapping tables are not programmed, and production userspace hardware authority is still blocked. That blocked-direct-DMA admission decision now runs through the host-tested capos-lib::device_authority helper used for device-authority validation, so the PCI proof line and diagnostics mirrors share the same fail-closed labels for absent, malformed, unsupported, or retained-capped remapping metadata. Active device-manager DMAPool policy records also carry a software remapping-domain ledger staging record. The QEMU lifecycle/imported-live proofs bind it to the active record and matching handle with diagnostics-only static ACPI/PCI coverage, remapping_domain_owner=device-manager, remapping_domain_ready=false, remapping_tables=not-programmed, iova_export=disabled-future-only, direct_dma=blocked, and host_physical_user_visible=0; no remapping tables, direct-DMA trusted domains, host physical addresses, or IOVAs are exposed. Bounded manifest grants now exist for DMAPool, DeviceMmio, and Interrupt: DeviceMmio exposes bounded .info, read-only userspace .map / .unmap over boot-preseeded BAR pages, brokered read-only .read32 backed by the same boot-preseeded 64-page kernel mapping cache, and bounded brokered claimed-register .write32; Interrupt exposes bounded .info plus admission-only .wait, .acknowledge, .mask, and .unmask, and DMAPool reports conservative .info status and can mint eight fixed manager-attached bounce-buffer DMABuffer result caps via request-shaped allocateBuffer. Valid one-page bounce-buffer requests report requested bytes, allocated bytes, page count, and request labels; zero-size and over-bounce-buffer requests fail closed as dmapool-allocation-request-invalid before result-cap or page mutation. The same DMAPool.info result now exposes the attached manager record’s owner/pool labels, live buffer/page/byte counts, in-flight submissions, and committed/resident/unswappable/scrub-before-release flags. The bounded manifest smoke proves zero-live accounting before allocation, one, two, three, and eight live 4096-byte bounce pages while result DMABuffers are active, a ninth-allocation full-pool rejection, restoration to the existing four-buffer descriptor working set after freeing slots 4 through 7, three-live accounting after freeing one descriptor-test slot, slot-0 reuse while the other three descriptor-test slots remain live, and zero-live again after all typed DMABuffer releases complete. That DMABuffer now supports typed .freeBuffer, single-page userspace bounce-buffer .map and .unmap, and bounded manager-accounted .submitDescriptor / .completeDescriptor. The .map path validates the live bounce-buffer epoch, accepts readable full-page requests, maps the manager-owned bounce page into the caller’s userspace address space, returns userspace-bounce-buffer-mapped, single-page-bounce-buffer, user-vma-mapped, and a nonzero userspace address, rejects zero-size, partial-page/out-of-range, and executable/unknown protections with typed range/protection labels, and still reports real_dma_mapping=not-programmed, direct DMA blocked, and no host physical address exposure. The .unmap path validates the live buffer record first, removes only that cap-owned borrowed VMA for the caller process, reports a typed no-op when no mapping is present, and leaves page lifetime plus pool and descriptor accounting unchanged. The descriptor paths carry request labels plus the bounded proof counts (queue_count=4, descriptor_count=8, buffer_bytes=4096): valid submits return manager-inflight-recorded and raise the attached DMAPool.info live_inflight count to 1, valid completions return manager-inflight-completed and restore it to 0, and valid completions with no outstanding submission return dmabuffer-no-inflight-submission with side-effect-blocked. The manager record also tracks the single live descriptor identity: duplicate submits for that live queue/descriptor return dmabuffer-descriptor-already-inflight, and valid-shaped completions for a different descriptor return dmabuffer-inflight-descriptor-mismatch; both paths leave live_inflight=1 until the matching completion arrives. Out-of-range queues/descriptors, zero submit lengths, submit lengths beyond the bounce buffer, and completion lengths beyond the bounce buffer still fail closed as dmabuffer-descriptor-request-invalid without mutating the counter. The default manager-accounting descriptor path also preflights result serialization before mutating accounting, and cap-table release drains bounded in-flight accounting before detaching so a removed userspace cap cannot strand the bounce buffer. The selected provider-TX exception for make run-ddf-provider-consumer is narrower and runtime-visible: queue 1 submits may publish the selected eight-entry TX queue depth, descriptors 0..7, into the existing kernel-owned virtio-net TX ring after the same DMABuffer authority, bounce-scrub, and live notify_mmio policy gates; that selected path then rings exactly one notify doorbell per accepted provider descriptor and lets DMABuffer.completeDescriptor consume the stored software descriptor generation from the real TX used ring before clearing each manager in-flight record. Live tx_interrupt.wait calls over that selected route can observe the ordered bounded completion events, and provider tx_interrupt release proves bounded teardown by draining seven incomplete descriptor handoffs or retiring seven delivered-but-unacked completion events with no pending provider waiters. Wrong queue, stale DMABuffer, stale notify policy, inflight publication, duplicate completion, and stale tx_interrupt issue paths still fail closed before their guarded side effects. DMABuffer.freeBuffer, cap release, driver-crash, reset-disable, and drop cleanup still revoke any remaining user mapping before scrub/free. None of these methods program direct DMA, publish arbitrary CQ entries, transfer full virtio-net ownership, or expose host physical addresses. Allocations beyond the eight fixed bounce-buffer slots, DMA map/submit/complete side effects outside the selected provider-TX proof, writable userspace BAR mappings, arbitrary MMIO writes and doorbells, unbrokered register access, blocking IRQ wait beyond the bounded selected-route completion waiter, real hardware acknowledgement, hardware IRQ ownership, hardware mask/unmask, hardware MSI/MSI-X programming, and general IRQ delivery remain blocked; parent-first DMAPool release defers until all live DMABuffer slots release, or successful DMABuffer driver-crash/reset-disable cleanup frees the remaining bounce pages and completes the staged zero-live pool detach. make run-hardware-grant-cycle proves sequential DeviceMmio/Interrupt skeleton grants can release and reacquire fresh DeviceMmio mapping generations while the Interrupt grant retains its source generation and refreshes only the route generation, and its read-only HardwareAuditLog.snapshot check decodes those two-cycle audit records through the current volatile unsigned audit surface. make run-hardware-audit-interrupt-waiter also decodes recent boot-time DmaBuffer, DmaPool, and Interrupt driver-crash / reset-disable lifecycle records through that typed volatile snapshot path, and its cursor snapshot requests from the first older retained DeviceMmio lifecycle sequence to decode rows outside the default latest 16-record tail. make run-hardware-audit also proves below-oldest cursor clamping and past-end empty cursor metadata on the overflowed volatile ring, and the QEMU-only local-ring proof now checks those same cursor edges without mutating live audit records. Unsafe retained metadata fails closed, and prototype devices remain kernel-owned bounce-buffer-only. The device-manager DMAPool attachment path now stores that explicit bounce-buffer policy on the attached pool record and the QEMU lifecycle/imported-live proofs read it through the active manager record and matching DmaPoolHandle; no new cross-manager lock was added, so the existing PCI_DEVICE_MANAGER before DEVICE_INTERRUPT_ROUTES order remains unchanged. PCI memory-BAR subregions are validated and mapped through a shared kernel helper before in-kernel drivers use device MMIO, and PCI capability walking reports non-programming MSI/MSI-X metadata for the QEMU virtio-net function. make run-pci-nvme now applies the same metadata-only PCI path to a QEMU NVMe controller: class/subclass/programming-interface, memory BAR, capability, and MSI-X metadata are visible, while userspace device authority, DMAPool, DeviceMmio, Interrupt, controller init, admin queues, I/O queues, MMIO doorbells, and direct DMA remain not started or blocked. make run-diagnostics now boots a feature-gated COM1 early-boot diagnostics prompt before capability, scheduler, timer, manifest, or userspace startup, with bounded commands for status, CPU, memory, ACPI, PCI, IRQ, timers, devices, logs, reboot placeholder, and halt. The ACPI and PCI diagnostics commands now also print bounded MADT/MCFG/DMAR/IVRS record details, PCI function/config-header summaries, BAR summaries, capability counts, MSI/MSI-X summaries, and bounded PCI DMA-attachment policy counters/details when present; devices reports PCI totals plus network/storage/display/bridge class counts and mirrors the current DMA-domain policy without owner identity: direct DMA is blocked, trusted-domain and ready-domain counts are zero, remapping tables are not programmed, future exported device addresses are IOVA-only, userspace device authority is not started, and prototype devices remain kernel-owned bounce-buffer-only. The current runtime-state diagnostics slice also attaches QEMU virtio-net plus the second virtio-rng proof device before the prompt and reports virtio_net=ready, bounded RX/TX virtqueue ring state, MSI-X route/vector/counter state, live buffer state, the kernel-owned DMA owner/pool ledger, and device interrupt route aggregates plus per-route delivery counters. Future driver extensions, production teardown/lifecycle diagnostics, IOMMU remapping table programming, production DMAPool, and userspace driver authority remain planned. The QEMU virtio-net path has a make run-net boot target, modern virtio PCI transport discovery for the common, notify, ISR, and device-specific MMIO regions, feature negotiation, and RX/TX split-virtqueue initialization, a TX descriptor completion proof, minimal Ethernet ARP resolution, and ICMP echo validation against the QEMU user-mode gateway. It no longer wraps the virtio driver in a kernel smoltcp interface or performs a kernel TCP HTTP GET; the remaining run-net evidence is a lower-layer QEMU fixture, and TCP/UDP socket proof lives under the Phase C userspace network-stack gates. QEMU currently exposes a transitional 1af4:1000 virtio-net function with modern vendor capabilities; capOS accepts that shape only through the modern capability layout and now selects a usable MSI-X capability for config/RX/TX table entries, records kernel-owned MSI-X sources for config/RX/TX in the device interrupt dispatch table, programs those entries through the typed PCI MSI-X table helper using a bounded first-fit LAPIC device MSI vector pool, lets the in-kernel virtio-net owner claim and unmask only its routes, assigns the virtio common/config and queue MSI-X vector fields, and keeps descriptor, ARP, and ICMP fixture evidence in make run-net after the kernel L4 owner is retired. Device-autonomous virtio-net MSI-X delivery is covered by the dedicated userspace-provider gates. A masked lifecycle probe on the unused virtio-net MSI-X table entry proves claimed-route reassignment, stale old-route rejection, old-vector unregistered delivery, reassigned-vector masked delivery, unsupported-vector delivery, and release before the live routes are registered. The QEMU virtio-rng second-device metadata path also exercises the device-manager ownership model, the claimed MSI-X route handoff, and a bounded teardown-trigger contract for cap release, process exit, driver crash, reset/disable escalation, interrupt waiter, future DeviceMmio, and future DMAPool trigger labels through the same claim/transfer/revoke/release transaction. It also proves a bounded manager-owned DeviceMmio record lifecycle: active-owner attach, stale and owner-mismatch rejection, duplicate attach rejection, active RingDoorbell validation through capos-lib::device_authority, binding to the first decoded PCI memory BAR region from the tested PciDevice, region_source=pci-decoded-memory-bar, region_bound_to_manager=true, bar_present=true, bar_memory=true, bar_base, bar_length, fail-closed wrong-BDF, wrong-BAR, and zero-length region metadata as devicemmio-region-invalid, no invalid mapping created, negative side-effect blocking, stale-after-revoke rejection as devicemmio-stale-handle with stale-owner-generation and side-effect-blocked, the RevokingHandles -> MmioRevoked transition blocking while attached, bounded detach, and no userspace handle, real BAR mapping, or doorbell write. For the bounded userspace DeviceMmio.map path, the shared pure capos-lib::device_authority validator now accepts only page-aligned in-BAR read-only requests and denies writable, executable, unknown-protection, zero-size, unaligned, offset-size overflow, and out-of-BAR requests before the kernel maps anything. Accepted map requests install a caller-owned borrowed read-only userspace VMA over boot-preseeded BAR pages only, and DeviceMmio.unmap removes that borrowed VMA after checking the active manager record and caller address space. Duplicate map, second unmap, stale release, drop, driver-crash, and reset-disable paths all preserve fail-closed VMA revocation and no-side-effect labels. For the brokered userspace DeviceMmio.read32 path, the same authority layer validates active handle/state/policy and in-BAR dword-aligned offsets before a single kernel-side volatile read from the boot-preseeded cache. The cap call path never installs new kernel mappings, returns typed mmio-read32-range-invalid denials for unaligned, overflowing, and out-of-BAR offsets, reports register_read=performed only for accepted reads, and fails closed after cap release. For the brokered userspace DeviceMmio.write32 path, the kernel runs the same manager-attached identity, policy, and range checks before one volatile dword write through the boot-preseeded kernel MMIO mapping cache, and it accepts only the single PCI MSI-X metadata-derived provider-scoped masked vector-control claim. The focused proof uses that idempotent dword on the virtio-rng BAR, confirms it through both read32 and the read-only userspace VMA, and proves an unclaimed message-address dword write leaves the original value unchanged. Unclaimed, unaligned, overflowing, and out-of-BAR calls report typed blocked labels before any write; stale or released handles fail closed before a write and do not return a write32 result payload. The same path proves a bounded zero-live DMAPool record lifecycle: active-owner attach, stale and owner-mismatch rejection, duplicate attach rejection, generation invalidation on revocation, DmaMappingsRemoved blocking while attached, teardown detach, and terminal release. It now also issues and records a bounded manager-attached DMA buffer handle under that attached pool, validates active SubmitDescriptor through the pure DMA-buffer validator, and records stale-after-revoke (stale-owner-generation), freed-buffer (freed), and reused-slot (stale-slot-generation) rejection with side-effect-blocked; the attached buffer record now also blocks zero-live pool teardown as dmapool-buffer-attached, rejects a stale same-slot proof-scoped FreeBuffer as dmabuffer-stale-handle with stale-slot-generation and side-effect-blocked while preserving the manager-owned buffer record, then validates an active FreeBuffer and manager-owned buffer-record detach as ok, after which the existing DMAPool detach succeeds. This still exposes no userspace handle and attempts no real DMA. The pending IRQ token path now also delegates the source/route-generation, masked, unregistered, invalid-owner, and malformed identity decision to a host-tested capos-lib::device_authority validator after snapshotting the live dispatch slot, while preserving the existing stale-pending-irq-* QEMU labels. This remains validator/adapter evidence, not production userspace interrupt waiter authority. The virtio-net smoke now also derives an imported live-accounting DMAPool record from the authoritative kernel-owned DMA ledger, records live buffer/page count, live bytes, in-flight submissions, committed/resident/unswappable flags, and scrub-before-release policy, and proves both teardown detach and DmaMappingsRemoved fail closed while that ledger is live. The live proof now consumes the device_dma teardown-evidence API, records the expected authoritative-ledger-live block with matching imported live accounting, and defers completion because this slice has no authoritative zero-live/scrubbed evidence for the live virtio-net ledger and does not attempt real DMA teardown, scrub, DmaMappingsRemoved, terminal Dead, or release for the live virtio-net record. A separate scratch-ledger proof reaches authoritative-ledger-zero-live only after both quiesce and scrub markers are set, without touching live virtio-net DMA. Another scratch proof covers stale DMA page handles by generation-tagging same-phys reuse and rejecting stale, wrong-queue, wrong-label, and duplicate-free attempts without mutating the active ledger. Userspace DMAPool/DeviceMmio/Interrupt authority, real lifecycle hook plumbing, real page quiesce/scrub/release cleanup, and broader driver interrupt dispatch remain planned. The kernel negotiates VIRTIO_F_VERSION_1 plus MAC when safe and VIRTIO_NET_F_MRG_RXBUF for QEMU’s merged-buffer virtio-net header, maps the virtio MMIO regions after kernel paging is live, allocates kernel-owned DMA pages for RX/TX descriptor, available, used rings, RX packet buffers, and one-shot TX buffers, submits a descriptor proof frame, sends an ARP request from 10.0.2.15 to 10.0.2.2, and observes the ARP reply in make run-net. Those current DMA pages now pass through a bounded kernel-owned device_dma pool ledger that proves live pool bytes, page counts, page-rounded MMIO mapping bytes, config/RX/TX interrupt holds, RX/TX ring depths, and RX/TX descriptor submission/completion accounting while no userspace DMA/MMIO/interrupt handles are exposed. The net smoke also proves the current kernel-owned budget/OOM policy with a scratch ledger: page and byte allocation over budget, overlarge queue depth, duplicate and over-budget MMIO holds, MMIO byte over budget, duplicate and over-budget interrupt holds, and descriptor submission beyond queue depth all fail closed while the live virtio-net ledger still validates normally, and the live device-manager record proof is derived from that same ledger without zeroing the copied record as a stand-in for real cleanup. The device-manager DMAPool record now also carries that budget profile, and the lifecycle/imported-live proofs read it through the active manager record plus matching DmaPoolHandle before checking zero-live accounting or live aggregate in-flight accounting against derived total budgets. The same scratch-proof pattern now covers zero-live teardown evidence and stale DMA page handles. Production userspace DMAPool, DeviceMmio, and Interrupt handles, production userspace DMA-buffer handles, real page cleanup/reuse, real DeviceMmio mapping objects, cache attributes/write policy enforcement, hostile stale-MMIO/DMA smokes, S.11.2 hostile smokes, and real doorbell writes remain unavailable; the current malformed-region and manager-attached buffer proofs are only bounded fail-closed metadata evidence in the manager proof path. The same smoke sends an IPv4 ICMP echo request to 10.0.2.2, validates the echo reply identifier, sequence, payload, IPv4/ICMP checksums, and addresses, and prints an icmp echo ok proof line. The former kernel smoltcp TCP HTTP smoke, scheduler-polled smoltcp runtime, Phase B NetworkManager/TcpListener/TcpSocket qemu-only cap objects, socket-backed Telnet terminal handoff, and POSIX DNS UdpSocket smoke are retired. The kernel no longer depends on smoltcp; qemu-only kernel TCP/UDP socket entry points fail closed; and the corresponding Make targets exit before QEMU with retirement diagnostics. Phase C (Networking Part 3) moves TCP/IP behavior into a userspace network stack process and keeps the kernel production surface focused on DMAPool/DeviceMmio/Interrupt device capabilities. The local serve-from-userspace proof now boots a non-qemu cloudboot manifest where a userspace smoltcp service grants an application client a TcpListenAuthority and serves TcpListener/TcpSocket caps for one hostfwd TCP round trip. A later local DHCP/IPv4 proof now lands the first lease/default-route/ARP configuration evidence on that userspace stack. Local bounded ICMPv4 Echo Reply diagnostics are also proved through a local cloudboot manifest, but remain diagnostic-only and outside the Web UI readiness ladder. For the selected GCE Self-Hosted Web UI milestone, the evidence order is local served TcpListenAuthority, local DHCP/IPv4, local Web UI L4, private GCE reachability, then the separately authorized public ingress/TLS proof. The legacy kernel socket owner no longer accepts non-qemu production manifest grants; qemu-only fixtures keep their explicit kernel socket sources until the broader Phase C exit cleanup removes that path.

Code: kernel/src/acpi.rs, kernel/src/diagnostics.rs, kernel/src/pci.rs, kernel/src/device_interrupt.rs, kernel/src/device_manager/, kernel/src/device_dma.rs, kernel/src/virtio.rs, kernel/src/cap/network.rs, kernel/src/cap/ring.rs, kernel/src/sched.rs, kernel/src/mem/paging.rs, kernel/src/arch/x86_64/pci_config.rs, Makefile, tools/qemu-diagnostics-smoke.sh, tools/qemu-iommu-acpi-smoke.sh, tools/qemu-net-smoke.sh, tools/qemu-net-harness.sh.

Validation: make run-diagnostics, make run-iommu-acpi, make run-net, make qemu-net-harness.

Security and Verification Track

The repo has Miri, proptest, fuzz, Loom, Kani, generated-code, dependency policy, trusted-build-input, panic-surface, and DMA-isolation work. CI now runs a bounded Kani gate for capos-lib bitmap, cap-table stale-handle, transfer preflight, transfer rollback split between source-visible rollback and destination-ledger restoration, and frame-grant accounting invariants. The heavier prepare-copy to provisional-destination seam proof passed in the high-memory make kani-lib-full Cloud Build gate, but coverage is not complete for every trust boundary.

References: Trusted Build Inputs, Panic Surface Inventory, DMA Isolation, and Security and Verification Proposal.

Future Work

Future architecture includes service restart policy, capability-scoped system monitoring, notification objects, promise pipelining, service-facing SharedBuffer APIs on top of the MemoryObject substrate, scheduling-context donation, session quotas, SMP, storage and naming, userspace networking, cloud boot support, user identity, policy enforcement, multi-front-end terminal hosts, richer native command surfaces, and broader language/runtime support.

Design references:

Changelog

A curated record of capOS’s shipped milestones: the significant, externally visible capabilities the system has demonstrated.

Each entry documents one landed milestone with the evidence that backs it – a shipped feature with measured behavior, a security finding closed with its fix and verification commands, a scaling proof with its data, or a benchmark with its host caveats – named, dated to the commit it landed at, and reproducible.

2026-06-09

Remote-session Web UI server-side session hardening – Review C high closed

  • The capOS-served Web UI (remote-session-web-ui) no longer derives its capos_remote_session cookie from the accept counter. It now mints an opaque, high-entropy server-side session id – a one-way SHA-256 (domain-separated, base64url) over the kernel-CSPRNG backend SessionInfo.session_id – and a per-session double-submit CSRF token from the same seed under a distinct label. The raw backend id never crosses to the browser (the digest is one-way). Landed at 91743ed4.
  • Server-side enforcement added before request dispatch: token rotation on login/re-login, cookie expiry + fail-closed rejection on logout and on a replayed rotated-out id, idle (30 min) and absolute (12 h) lifetime bounds via absolute monotonic deadlines, Host (DNS-rebinding) and Origin validation, and a required X-CSRF-Token double-submit on state-changing requests. The session cookie is Secure when X-Forwarded-Proto: https reports HTTPS ingress; the plaintext loopback proof stays explicitly non-Secure. This matches the committed operator-bundle/host-bridge CSRF contract; no schema/kernel/ABI change.
  • Evidence: make run-cloud-prod-remote-session-web-ui-l4 (local QEMU/cloudboot) now drives stale-token, CSRF (missing/mismatch), Origin (missing/cross-site), Host, idle/absolute expiry, cookie-attribute, and login/re-login rotation denial gates, each failing closed before any backend-held capability call (report.json sessionHardening: all gates true, tokenLen 43). Local proof only; not private GCE reachability, public ingress, or TLS.

2026-06-04

Userspace TCP over the capability NIC – TcpListener/TcpSocket round trip

  • A userspace process now completes a full TcpListener/TcpSocket round trip over smoltcp driven entirely through capabilities: frames cross the Nic capability, the kernel-owned keep-armed sustained-receive RX pool (Nic.receivePoll @4) feeds smoltcp’s RX token across the multi-frame TCP handshake, and no host-physical or device-usable address is exposed to userspace. This is the first userspace TCP (connection-oriented, multi-frame) path in capOS, landed at 002c5927 (Phase C slice 7c-iii).
  • It rests on the sustained-receive ABI that lifted the prior single-frame Nic.receive blocker (slice 7d, Nic.receivePoll @4, kernel-owned bounce pool with per-recycle scrub + slot-generation bump, no per-frame device reset). Evidence: the run-cloud-prod-network-stack-smoltcp-tcp-listener-roundtrip QEMU proof exercises listen/accept/echo over the userspace stack.
  • Remaining for the full Phase C userspace L4 stack: the cap/network.rs production-contract relocation (parent task cloud-prod-userspace-network-stack-smoltcp-local-proof), after which the TLS-client handshake, self-hosted Web-UI L4, and IPv6-TCP tracks unblock.

2026-06-02

Real-GCE virtio-net NIC bind – the GCE Polling Path track closes

  • The billable real-GCE proof passed: a real e2-small instance (europe-west3-a, image capos-test-1780412056-e1cb, source commit 1fb65683) booted the production non-qemu cloud kernel from the legacy datapath manifest, the kernel-brokered legacy polled path bound the live GCE virtio 0.9 NIC (00:04.0, 1af4:1000), and the run passed the tools/cloudboot/run-test.sh --require-provider-nic-proof gate. Run 1780412056-e1cb; teardown_status=complete, no leaked sandbox resources.
  • This is the first real-hardware attestation of the legacy bind. Every stage ran end to end on GCE: candidate select over PIO BAR0 (iobase=0xc040), I/O + bus-master enable (command=0x0107), real GCE device MAC read (src_mac=42:01:0a:c8:00:12), NET_F_MAC negotiation (device_features=0x204399a7), full 4096-entry vring materialization (rx_queue_size=4096 tx_queue_size=4096 rx_vring_pages=28 tx_vring_pages=28 – the ~110 KiB/28-page contiguous frame::alloc_contiguous per queue that QEMU cannot emulate, since QEMU caps queue size at 1024), a broadcast DHCP DISCOVER TX, and a real device->host RX DMA within the TSC-governed wall-clock budget (rx_used_len=532 ethertype=0x0800 IPv4, rx_clock_usable=true, rx_iters=1). Marker: cloudboot-evidence: provider-nic-bound 0000.00.04.0-vendor.1af4-dev.1000-iobar.0-iobase.c040-usedidx.1-usedid.0-usedlen.532-ethertype.0800-txusedidx.1-srcmac.42010ac80012 from cap::provider_nic_bind_proof::report_real_completion_legacy.
  • The bind reached the device only after three distinct real-hardware premise conflicts were closed by prior local slices, each found by a bounded billable run: modern-only candidate select vs the legacy device (5b), the QEMU-SLIRP-only RX stimulus vs GCE anti-spoofing (5c, real-MAC DHCP DISCOVER + accept-any wall-clock RX), and the device’s 4096-entry queue exceeding the prior 1024 bound (5d, MAX_LEGACY_QUEUE_SIZE raised to the spec max 32768).
  • Honest scope: this is a kernel-brokered, polling-only data-path attestation (userspace_driver_authority=kernel-brokered-legacy-polled, interrupt_model=polled-no-msix, device_autonomous_raise=not-claimed, direct_dma=blocked, host_physical_user_visible=0). It is not a claim of userspace-driver authority, device-autonomous MSI-X delivery, an L4 socket round-trip (raw-frame reachability per the slice-5 Option-A decision; L4 is networking-proposal Phase C), or cloud storage readiness. It retires the cloud-gcp-virtio-net-nic-driver blocker.
  • Reproduce: build the cloudboot image with make capos-cloudboot-image MANIFEST_SOURCE=system-cloud-provider-virtio-net-legacy-datapath.cue, confirm make run-cloud-provider-nic-bound-legacy (and make run-cloud-provider-nic-bound-legacy-large-queue) green on the build commit, then tools/cloudboot/run-test.sh --require-provider-nic-proof (BILLABLE; operator-authorized 2026-05-27, commit 2aaeaa53).

2026-05-30

Device Driver Foundation – production bind-stack qemu-gate dissolution

  • Umbrella cloud-prod-ddf-bindstack-qemu-gate-dissolution closed at commit fdc8eb66. The production (non-qemu) cloud kernel’s device-authority surface is now always-built code fronted by fail-closed runtime capability probes, graduated off the overloaded qemu gate and the per-proof cloud_*_proof feature modules it previously hid behind, while iommu.rs stays gated and brokered bounce-buffer-only DMA (no host-physical/IOVA export) is preserved. Landed as six reviewed slices:
    • 29a76850 – RX MSI-X Interrupt.wait waiter-wakeup determinism: the provider-consumer flake was a synthetic-dispatch ordering race, fixed by gating injection on the owner being parked in cap_enter (sched::thread_blocked_on_cap_enter); 28/28 make run-ddf-provider-consumer (baseline ~18% flake).
    • ef2548b3 – grant-source de-specialization: the prod {dmapool,devicemmio,interrupt}_grant_source statics stage an arbitrary enumerated function through one stage_with_class entry point taking a ProdGrantClass descriptor (cap::prod_grant_source_class), bit-identical.
    • b7d30ec3 – MSI-X program/attach/arm/unmask + kernel-injected-dispatch wait graduated into always-built cap::interrupt_programmed / device_interrupt::wait_kernel_injected_dispatch.
    • 82c2ed53 / b2168e05 / ad6da6ce – the device_manager backend port: always-built ProductionDeviceTable device-record/handle backend, per-record bounce-buffer DMA-pool backend, and interrupt-route backend (parent cloud-prod-device-manager-backend-port).
    • fdc8eb66 – split test-harness affordances off the qemu feature.
    • Reproduction: make run-cloud-devicemmio-grant, make run-cloud-dmapool-grant, make run-cloud-interrupt-grant, make run-cloud-provider-cap-waiter, make run-cloud-provider-nvme-readonly-bind, make run-ddf-provider-consumer, make run-net. Remaining DDF work – userspace virtio-net RX/multiqueue and NVMe I/O-queue provider readiness, plus live cloud bind – is tracked by the separate provider-parent tasks, not this umbrella.

2026-05-23

Device Driver Foundation – userspace virtio-net provider closeout

  • Commit c86374f8 (2026-05-23 16:51 UTC) closes the first local bounded userspace virtio-net provider-driver proof for Task 6. The provider-consumer smoke now asserts one stable closeout line tying together selected queue 1 TX descriptor/avail/doorbell/used-ring/CQ ownership across the full QEMU TX queue depth, bounded queue 0 RX synthetic-token CQ identity, selected TX/RX MSI-X/LAPIC wait/ack/EOI, selected-route mask/unmask/reset/reassignment, teardown, stale-handle blocking, and no silent provider fallback. Reproduction: make run-ddf-provider-consumer, make run-net. This remains bounded local QEMU provider evidence over manager-owned bounce buffers; live hardware RX used-ring ownership, full virtio-net ownership, direct DMA/IOMMU, cloud NIC/storage readiness, and virtio block/storage drivers remain separate work.

Device Driver Foundation – provider TX full-depth CQ ownership

  • Commit e248d42b (2026-05-23 13:36 UTC) extends selected userspace virtio-net TX CQ ownership from the prior four-outstanding window to the full eight-entry TX queue depth used by QEMU. The smoke now proves eight live manager-owned bounce buffers, descriptor/avail publication and notify doorbells for descriptors 0 through 7, wrong-order completion fail-closed at descriptor 7, in-order CQ identity delivery/ack for all eight descriptors, ninth allocation rejection without pool expansion, teardown-only drain for seven incomplete descriptors, and release retirement of seven delivered but unacknowledged CQ events. Reproduction: make run-ddf-provider-consumer, make run-net.

Device Driver Foundation – provider RX wait/ack dispatch-token proof

  • The provider-consumer smoke now promotes the provider RX interrupt grant’s wait/ack path beyond the blocked skeleton for one selected RX dispatch token. rx_interrupt.wait can pend, stay unpromoted by generic route delivery-count advancement, and wake only after a selected RX MSI-X/LAPIC dispatch validates the live RX issue, selected RX source, source generation, route generation, virtio-net owner, and driver-unmasked route state. The paired rx_interrupt.acknowledge accounts exactly one bounded RX hardware-dispatch ack for the delivered zero-CQ RX event; pre-event, masked-route, duplicate, and stale-after-release wait/ack attempts remain fail-closed. Reproduction: make run-ddf-provider-consumer. RX descriptor publication, RX CQ identity, real hardware IRQ acknowledgement/deferred EOI, direct DMA/IOMMU, full virtio-net ownership, cloud NIC/storage readiness, and production driver readiness remain open.

POSIX Adapter – File/Directory fd closeout

  • Commit f97d9833 (2026-05-23 06:23 UTC): the posix-file-directory-client-capos-rt task closes the v0 File/Directory fd surface on top of the existing Storage Phase 3 RAM-backed Directory cap. libcapos-posix now implements lseek() over the per-fd file position and readdir() as a lazy Directory.list snapshot, while preserving the existing pipe, UDP, Console, and TerminalSession fd paths. The new make run-posix-file proof boots a live C process that creates a file through open(), writes, seeks, reads, lists the root directory with opendir() / readdir(), closes both handles, and asserts relative paths still fail closed. Remaining P1.4 dash-port work is the printf/string subset, signal/time stubs, identity stubs, dash vendoring/patching, multi-TU C build, and run-posix-shell-smoke.

Device Driver Foundation – provider TX release retires three unacked CQ events

  • The provider-consumer smoke now extends the selected TX release-retirement path to three delivered but unacknowledged bounded provider TX CQ events in one live issue. The smoke completes descriptors 0, 1, and 2, consumes all three through tx_interrupt.wait, skips all acknowledgements, proves the stale-bound in-flight descriptor remains in fixed DMABuffer slot 3, and asserts provider tx_interrupt release retires three pending provider completion acks without hardware acknowledgement. The claim remains bounded CQ teardown evidence only; deferred EOI, hardware acknowledgement, hardware IRQ ownership, direct DMA/IOMMU, full CQ ownership, full userspace virtio-net ownership, cloud NIC/storage readiness, and production driver readiness remain open. Reproduction: make run-ddf-provider-consumer.

Device Driver Foundation – provider TX release retires unacked CQ event

  • Commit 11eeab2e: the provider-consumer smoke now proves release-time retirement for a delivered but unacknowledged bounded provider TX CQ event. The smoke drives a selected TX completion through DMABuffer.completeDescriptor and tx_interrupt.wait, deliberately skips tx_interrupt.acknowledge, releases the provider tx_interrupt cap, and asserts the release proof records one pending provider completion ack retired from the ledger. The stale post-release acknowledge path remains revoked; the completed buffer can still be freed normally; and deferred EOI, hardware acknowledgement, hardware IRQ ownership, direct DMA/IOMMU, full CQ ownership, full userspace virtio-net ownership, cloud NIC/storage readiness, and production driver readiness remain open. Reproduction: make run-ddf-provider-consumer.

Device Driver Foundation – provider RX descriptor boundary proof

  • Commit 2bd5add5: the provider-consumer smoke now records an explicit provider RX queue 0 descriptor boundary while the live RX interrupt issue is active. The proof keeps the existing DMABuffer.submitDescriptor(queue=0) / completeDescriptor(queue=0) path as neutral bounce-buffer accounting and asserts that RX ring publication, provider CQ publication, provider IRQ delivery, hardware acknowledgement, and direct DMA remain blocked while kernel RX cohabitation is unresolved. Reproduction: make run-ddf-provider-consumer. Honest caveat: RX descriptor publication, RX CQ identity, RX waiter delivery, direct DMA/IOMMU, full virtio-net ownership, cloud NIC/storage readiness, and production driver readiness remain open.

2026-05-17

Kernel – scheduler/IPC recoverable-panic surface closed

  • The scheduler and IPC hot-path .expect()/.unwrap() sites that could panic on stale run-queue or thread-metadata invariants are now hardened across the seven hot-path functions. block_current_on_cap_enter logs and returns false (yielding u64::MAX at the syscall boundary); next_start_context logs and returns None so the caller’s retry loop (kernel_idle_entry, start_current_cpu, start_ap) selects another thread; and schedule(), exit_current(), exit_current_thread, and capos_block_current_syscall() drop the scheduler lock and crate::hcf() on the dispatch/exit/block paths that have no caller-side recovery, matching the canonical last-process-exited halt. retain_endpoint_queue (kernel/src/cap/endpoint.rs) breaks with a diagnostic kprintln on a queue-length mismatch instead of panicking on pop_front(). The PhysFrame::from_start_address(cr3_phys) panics on the exit and syscall-entry paths are intentionally retained: corrupted CR3 is genuine memory-state corruption, not transient queue inconsistency. An explorer audit confirmed no recoverable panic surface remains (2026-05-17 00:41 UTC).

Kernel – resource quota fields fully wired

  • All three sub-items from the prior “partially wired” resource-quota finding are now enforced (closed 2026-05-16 19:19 UTC). The per-process carrier (capos_config::ResourceProfile on Process::resource_profile) lands the profile at spawn from SessionMetadata::profile via RamAccountStore. ringScratchLimitBytes sizes the per-process input/output/reply scratch buffers and rejects oversize CALLs with CAP_ERR_INVALID_REQUEST; replyScratchLimitBytes clamps the exception reply-scratch buffer to the profile ceiling (closed 2026-05-16 20:52 UTC), fixing the #175 asymmetry that produced spurious CAP_ERR_APPLICATION_EXCEPTION_TRUNCATED on small-ring processes; endpointQueueLimit and inFlightCallLimit carry the owner profile’s values into Endpoint::try_new, clamped by the kernel ceilings MAX_QUEUED_CALLS=32 / MAX_IN_FLIGHT_CALLS=32. Scope caveat: the endpoint-scoped bounds are per-endpoint relative to the owner profile, not a strict per-process counter across all endpoints the owner holds. ResourceProfileRecord @13 is tombstoned as retired13. Reproduction: make run-ring-scratch-limit, make run-reply-scratch-limit, make run-endpoint-queue-limit, make run-in-flight-call-limit.

Device Driver Foundation – Hardware-audit userspace service (durable-audit Step 2b)

  • Commits 037256ce (initial slice) and the remediation follow-up on the same ddf-audit-userspace-service branch: durable-audit Step 2b of 4 delivers demos/hardware-audit-service, which polls HardwareAuditLog.drain with the cursor protocol from Step 2a, accumulates records in memory, and serves a typed HardwareAuditReader.snapshot over a kernel-allocated Endpoint retagged for the consumer. The reader cursor is fail-closed: an expectedSequence outside {0, current drain cursor, any retained record's sequence} – or a malformed-capnp param payload – is rejected with a typed InvalidArgument exception, mirroring the kernel-side HardwareAuditLog.drain rejection so a stale or forged cursor cannot silently skip or repeat records. The schema docstring for HardwareAuditReader now states this contract explicitly. Reproduction: make run-ddf-audit-service-smoke proves boot-record accumulation, a service-handoff snapshot, a release-triggered follow-up snapshot, signatureStatus = "unsigned", and the negative cursor-mismatch rejection (matched on both the service-side snapshot-rejected exception_type=invalid-argument and the consumer-side cursor-mismatch rejected ok exception_type=invalid-argument markers). Steps 3 (segment signing with key management routed through the cryptography proposal) and 4 (durable Store-backed persistence with a defined rotation contract) remain open follow-ons.

2026-05-16

POSIX Adapter – P1.4 Slices 3 and 4: functional file I/O end-to-end

  • P1.4 Slice 3 (FdBacking File/Directory/Terminal variants and the make run-posix-file-backing-smoke proof of Terminal routing) landed at ae58f936 (closing merge 4c70a03d). P1.4 Slice 4 (absolute-path resolver, functional open()/opendir() over the bootstrap-granted root Directory cap, per-fd file position tracked across read()/write(), and the make run-posix-open-smoke proof of create+write+close, then re-open for read, plus relative-path rejection) landed at 94b29177 (closing merge de4235f9). closedir() releases the local slot only. This is the first non-shell POSIX subsystem to reach functional parity: open(path, flags)read/writeclose works end-to-end through libcapos-posix on top of the Storage Phase 3 RAM-backed Directory authority. Reproduction: make run-posix-file-backing-smoke (Slice 3), make run-posix-open-smoke (Slice 4), and the existing make run-posix-stdio-smoke regression remains green. Honest caveat: the remaining v0 dash port work is stdio adoption (Slice 5), env vector (Slice 6), printf/string subset (Slice 7), signal/time/identity stubs (Slices 8-10), and dash vendoring + smoke (Slices 11-13).

Scheduler – Phase F remote-CPU nohz activation via reschedule IPI

  • Commit 8c1601ac: the Phase F auto-nohz preflight no longer requires the lease’s target CPU to be the current CPU for the namedRing = none compute-lease shape. When the single-CPU allowedCpuMask targets a different scheduler CPU, the kernel parks a bounded remote-activation request in the target CPU’s per-CPU slot and sends a reschedule-style IPI; the target CPU drains the request from its IPI handler (timer-handler backstop) and re-runs the full disqualification check locally under try_lock before arming its own one-shot deadline – remote activation is never trusted blind. Reproduction: make run-scheduler-cpu-isolation-lease.

Device Driver Foundation – virtio-net provider RX bootstrap-grant skeleton

  • Commit b710d4fd: the provider RX path now mirrors the provider TX bootstrap-grant authority at the skeleton level over the selected virtio-net RX MSI-X route. Adds validate_provider_rx_interrupt_route as the receive-queue counterpart of the TX route validator (same admission shape, same active/resetting state gate, same interrupt_owner_for_device_owner mapping; only the PciMsixInterruptRole::RxQueue role tag differs) plus matching cap::interrupt_grant_source init/build/release entry points. Skeleton only: live RX DMA, completion delivery, and hostile-smoke coverage remain open.

Device Driver Foundation – provider RX selected-route MSI-X control

  • Commits 5ea850c3, 1d2be684, and 9f3f8a8c: the provider RX rx_interrupt cap validates the live RX issue and selected virtio-net RX route before bounded mask/unmask of the selected RX MSI-X table vector-control bit and route state. The provider-consumer smoke asserts vector-control readback, delivery-count preservation, stale methods after release, and release-while-masked cleanup back to driver-unmasked; cleanup failure leaves the live issue uncleared so future RX cap issuance stays blocked on uncertain route state. RX wait/ack, descriptors, provider CQ identity, hardware acknowledgement, deferred EOI, full RX ownership, direct DMA/IOMMU, cloud readiness, and production userspace driver readiness remain open.

SSH session – explicit UserSession.logout failure-path proof

  • Commit 9e7328e6: test(ssh-public-key-session) proves the explicit UserSession.logout failure path, closing part of the open REVIEW_FINDINGS Low item.

Storage Phase 2 closed in docs

  • Storage & Naming Phase 2 (schema BlockDevice/File/Directory interfaces) was marked done in docs/tasks/done/2026/ (commit 0551941c) after Phase 3 slices 1–3 shipped on 2026-05-14. A separate task-state reconciliation sweep (ad280bf2, closing merge 189d4af2) realigned docs/tasks/ directory state with docs/tasks/README.md ground truth.

Production Provenance Milestone

  • Landed across 1feee12b..6f775925 (2026-05-16). All GitHub Actions steps pinned to immutable commit SHAs, Rust nightly pinned to an exact date, and OVMF/qemu-system-x86/xorriso pinned to exact apt versions in the QEMU smoke apt-install. Each CI run publishes a build-provenance artifact and PRs run an advisory cross-run compare against the base-branch artifact. Reproduction: make build-provenance produces the provenance artifact locally; CI artifacts appear as build-provenance-<sha> per qemu-smoke run. Full pin inventory and bump procedure: docs/trusted-build-inputs.md. Honest caveats: the PR compare step is advisory-only (not PR-blocking); URL-based download-and-verify for OVMF and other pre-built tool binaries (Option B) remains future hardening.

2026-05-14

Device Driver Foundation – IOMMU VT-d remapping closed across A1/A2/B/C

  • The QEMU Intel IOMMU remapping milestone closed across four reviewed slices in a single day: A1 active-programmed legacy-mode table (3a60a401), A2 hardware-DMA translation proof through virtio-rng with an observed VT-d fault on an unmapped IOVA (dfedf574), B register-based context-cache / IOTLB invalidation with an ordered scrub-after-invalidate revocation cycle (24eb587e), and C two-phase revocation with hostile stale-handle / stale-completion smokes (closing merge 274ff63f, follow-up 873eef56). Reproduction: make run-iommu-remapping asserts table program proof, invalidation proof, hostile stale-handle proof, and hostile stale-completion proof all as proof_result=ok (QEMU 8.2.2). Honest caveat: QEMU-only evidence (hostile_hardware_isolation=not-claimed); the live virtio-net DMA path still uses bounce buffers, and production userspace-driver IOMMU authority remains open.

Device Driver Foundation – virtio-net provider four-outstanding TX window

  • The userspace virtio-net provider TX path reached a four-outstanding completion-queue window with full provider descriptor/avail publication, one notify doorbell per descriptor, real IRQ-dispatch-backed tx_interrupt wait/ack/mask/unmask, per-event provider CQ identity, and a four-slot bounce-buffer pool. The provider notify doorbell write moved off the broker into the provider’s own scoped notify-MMIO cap (ef979d17), and the harness now proves a general write32 verb fails closed on that same notify cap (95a65e99). Reproduction: make run-ddf-provider-consumer, make run-net.

Scheduler – per-CPU CPL0 kernel idle thread closed

  • The user-mode idle process was replaced with a per-CPU CPL0 kernel idle thread across Increments 1a..2e (final merge 2bba8d11). All four idle dispatch sites route through the CPL0 idle context: schedule() timer path, capos_block_current_syscall (block path), exit_current and exit_current_thread (two exit paths). Each scheduler CPU slot owns a dedicated CPL0 idle kernel stack and CpuContext; the synthetic idle Process record is retained only so the idle ThreadRef resolves through the scheduler’s ThreadRef-centric bookkeeping. A TLB-flush drain in kernel_timer_interrupt_handler closes the CPL0 idle-residency gap. Reproduction: make run-scheduler-cpu-isolation-lease asserts idle_path=cooperative-cpl0 for the boot/AP loop and idle_path=cpl0-dispatch-{timer,block,exit} for the dispatch sites.

Scheduler – per-CPU nohz tick suppression and SQPOLL-driven activation

  • Phase F got its first real nohz increments: per-CPU periodic-tick suppression for the single-runnable window (commit 9e269e31) and SQPOLL-driven auto-nohz activation for ring-coupled leases (commit 8edd2314), built on per-CPU CPL0 idle-thread context infrastructure (commit 5ac6b08f). Generic full-nohz and timeout-based auto-revoke remain future work.

Storage – Phase 2 schema + Phase 3 slices 1–3

  • Phase 2 added schema-only BlockDevice / File / Directory / DirEntry interfaces (commit 4c0d940c). Phase 3 then delivered three kernel CapObject slices the same day: a RAM-backed File cap with read/write/stat/truncate/sync/close and 64 KiB per-call inline-payload bound (slice 1, d06dff6b); a RAM-backed Directory cap with subtree bring-up + grant path + QEMU smoke (slice 2, b11ec9e4); and a content-addressed RAM-backed Store + name->hash Namespace pair (slice 3, 804a3f41). Reproduction: make run-file-server-smoke, make run-directory-server-smoke, make run-store-namespace-smoke. This unblocks POSIX adapter Phase P1.4 (dash port) and WASI host adapter Phase W.5 (filesystem surface), both previously blocked on the cap shape.

Remote session – self-served web UI is the default boot

  • Commit 5594c9ef wires the capOS-served remote-session web UI into the default operator boot (cue/defaults/defaults.cue, scoped loopback listener cap, make run host-port line). The Rust backend owns connection and session state; the browser receives view models and redacted transcript rows. Reproduction: make run, or the focused make run-default-web-ui.

2026-05-13

Device Driver Foundation – provider TX three-outstanding + MSI-X mask + ack

  • Provider TX CQ window expanded from two to three outstanding descriptors (d6458381, 6ee96d29), descriptor-issue-bound completions (f894ee1a, bc3280be), MSI-X mask/unmask control with atomic rollback on failure (b5af7335, 3c8dc627, ca9beb73, be506075), real IRQ-dispatch wake of provider TX waiters (203461da), and hardware dispatch acknowledgement accounting (caaa388c, bc11c1aa). Provider CQ teardown gained an end-to-end quiesce/retire proof (d8760182, dff94930, 235ed1e8). Reproduction: make run-ddf-provider-consumer.

Scheduler – SQPOLL producer-wake progress

  • Scheduler added the bounded SQPOLL producer-wake increment (commit 0dbb5542) with preserved wake-result state across stop (commit dbf8e0ff) on the Phase F prerequisite path that the 2026-05-14 SQPOLL-driven auto-nohz activation builds on.

Kernel – endpoint / park-waiter rollback hardening

  • IPC rollback paths hardened: endpoint pending-recv rollback (e46b52dc), preserved endpoint recv rollback capacity (db454b59), kept endpoint recovery live across revocation (a1ccbda1), preserved park wake status during retry (d37edf54), and drained private park waiters after unmap-retry (1e0ce242). Bounded recovery surface; no new authority.

Language adapters – POSIX stdio, WASI env, libcapos C pipe + entropy

  • POSIX adapter routed stdio writes to the Console cap (aa6a56d7, d442a3b7) and exposed posix_spawn file actions (b8fb3131). WASI added bounded per-instance environment grants (5f5028e7, 987e7814), promoted stdio-compatibility Preview 1 imports (1a79037b), and gained an unauthorized-import refusal harness (c803565f, 756b5ba8, 1b53acd8). Native libcapos shipped a C pipe smoke (b6c2d4bb) and an EntropySource.fill wrapper (b1f7a3c1, 6b3b7425).

Remote session – Paperclips Path B + Tauri scaffolding

  • Paperclips launch wiring Path B (worker + gateway + bridge) landed (701522b9), with host chat RPC facade over DTO (7cf4cf2c) and DTO schema for Path A (e159adb8). A remote-session Tauri wrapper scaffold (5691ec2a) plus capability policy hardening and preflight (eff47eb3, b41ba656, ff58cacf) prepare the desktop wrapper path.

Build hygiene – build-provenance compare harness

  • New build-provenance comparison make target (07722584, a34cb441, b3ed23cf, 779f3ce0) records runner identity in provenance (00272130) so two runners can be compared offline.
  • Substantial documentation refresh across ~40 proposals and architecture docs: cross-links sharpened, last_reviewed stamps refreshed, and proposals updated against shipped state (POSIX P1.1/P1.2/P1.3, Scheduler Phase D/E close and Phase F SQPOLL, error-handling against current ExceptionType + CapException, networking proposal against transitional kernel state, userspace-binaries Parts 4/5, scheduler Phase F.5 cross-link). The 29-page research index landed (18fbaf35). No behavior change.

2026-05-12

Scheduler – Phase F nohz scaffolding + CpuIsolationLease

  • Phase F infrastructure landed across the day: nohz telemetry (aeb0f4d2), the clockevent/deadline substrate (268b44c2), CpuIsolationLease scaffold (e9ab9e46, 97f958a7), bounded SQPOLL ring mode (6dcbb69a, 07578bec, cdbb45be), and housekeeping/ deferred-work placement (c7580873). These are the prerequisite chain the 05-14 SQPOLL-driven auto-nohz activation depends on.

Device Driver Foundation – provider TX completion delivery + IRQ regrant

  • Provider TX gained completion-event delivery (334818af), hardened interrupt delivery (dfebd411), serialized wait posting against IRQ race (bbe2fea7), and reset-disable lifetime hardening (50c6c8cd). Provider TX interrupt teardown event proof (9205af3e, d8760182), interrupt reset reassignment proof (8aee7d42), and a disabled Intel QEMU IOMMU scaffold with MMIO status diagnostics (02688941, 41314553, 277dbd26) prepared the 05-14 IOMMU Slice A1.

Userspace runtime – libcapos fail-closed on C-runtime threading

  • libcapos now fails closed on C-runtime threading attempts (57ad4bfa), closing the remediation entry tracked in docs/proposals/libcapos-c-substrate-proposal.md.

Remote session – local UI bridge hardening

  • Local UI bridge was hardened (cab6f791) with refreshed evidence (21a6a16a); login auth denial is now treated as a login failure rather than a transport error (58d198f3, b8486eac).

2026-05-11

Device Driver Foundation – provider/consumer split + DmaPool manager-authoritative

  • The selected virtio-net userspace provider path proof landed (9ca39ff8, e6fb4c91): provider-consumer authority smoke (3d59cb8b), provider shadow descriptor side effect (db5c4995, c91b1477), selected provider metadata gated to the TX queue, MMIO/IRQ smoke (c52064c0, d1c6cece), and DMA accounting extension (40a77ae0). DmaPool lifecycle became manager-authoritative (96e1107e): blocks mapped DMABuffer manager-free (149ef53e), keeps DMABuffer unmap live in the manager (d6b7a292), validates cap identity before stale side effects (4f404f44), and propagates budget checks through release paths (877ed956, 01c078f5, 45903e5e). The provider notify-MMIO no-write cap grant landed (78c627a9, d177ffa1, b8f7b83f) with stale admission proof (0ec23fe6, a5a722de, fefd5267, 3ed4418f) and submit-path carry (54b1499f).

Scheduler – one SQ consumer + session-logout hooks + context bind/revoke

  • One SQ consumer owner is now enforced (c427f3d9, b503b640) – the Phase F prerequisite that ring-coupled nohz depends on – with matching scheduling-context ring/SQ owner proof stabilizations (e5ec448c, 811a4976). Scheduling contexts can donate over endpoints (d2eab605, 6343ec84) with block/settle around donation (93706ecb, d202ca31, 599f4649), bind/revoke generation (3b6e1bb5), identity-aliasing fix (fe4e340b), notification cells (c0e4470c), and exit-cleanup budget preservation (f30a9bb3, 8b00c480, 3725c87e, eaeb1071). Session-logout cleanup hook marks bound contexts stale (594f1353), propagates shell exit to session logout (0d9b3d90, 0e9c8dd1), proves logout stale contexts fail closed (59dca4e8), donated logout skip policy (49de54f2), and blocks stale-snapshot budget refresh (9d1cd80b).

Device Driver Foundation – device-manager kernel refactor

  • The device-manager kernel module was split into focused submodules (99c37592, 734383f9, af539f6c, 9c0a5183, 98dddb72, bfdb78a0), reducing the proof-stack footprint and exposing shared authority admission helpers ahead of Slice A1’s per-domain remapping work. Process-exit teardown smokes for DMA/MMIO and Interrupt landed (b452f18e, 746c1742), and a stage for the DDF IOMMU remapping-domain ledger went in (636edfb2).

Remote session – observable self-served UI proof

  • Self-served remote-session web UI gained an observable proof (e4ab7b41, 0eb68aa8, 971d8ce8, 65fe4bf7, 28db3277, 505b553c, f0254b02), closing the read-side gate before its 05-14 default-boot promotion.

2026-05-10

Scheduler Phase D closed

  • 2026-05-10 19:39 UTC, mainline commit 77caafc0 (closeout 1a08ec23): Phase D is closed. Accepted WFQ slice covers SchedulingPolicyCap weight/latency-class authority, per-thread weighted vruntime and per-enqueue virtual_finish_ns, per-CPU WFQ run queues, bounded steal/migration invariants, fairness/interactive/ weight-change smokes, and the controlled Task 6 thread-scale gate.

    Phase D thread-scale benchmark (five runs, KVM, physical-core logical CPUs 0,1,2,3, blocking parent join, 262,144 blocks / 16 MiB, work_rounds=64):

    ComparisoncapOS Phase D WFQLinux pthreadcapOS gate
    1->2 work1.809x1.996x>= 1.6x
    1->2 total1.774x1.995x>= 1.6x
    1->4 work3.088x3.974xdiagnostic target >= 2.5x
    1->4 total2.700x3.850xdiagnostic baseline > 1.538x

    The 1->2 work/total rows passed the harness-enforced gate; the 1->4 rows were manually accepted from recorded diagnostics. Raw artifacts: target/thread-scale/20260510T193200Z/ and target/linux-thread-scale/20260510T194600Z/. Reproduction in tools/qemu-thread-scale-harness.sh (via make run-thread-scale) and the matching make run-linux-thread-scale-baseline.

    Bottleneck analysis. Linux pthread scaled near-linearly on the same physical CPU set, so the workload shape is sound and the remaining 1->4 gap is a capOS scheduler/runtime cost. The dominant contributors visible in measure-mode and post-thread-scale review are:

    1. Global Scheduler lock contention. Per-CPU WFQ run queues exist, but several scheduler decisions (cross-CPU wake targeting, direct-target stale cleanup, queue reservation accounting) still funnel through one Scheduler mutex. Total-time scaling regresses faster than work-window scaling because exit/join/block/schedule paths spend disproportionate time inside that critical section.
    2. Process-wide capability ring under one SQ consumer. A multi-thread process has one ring endpoint owned by one SQ consumer at a time. Completions, waker resolution, and direct IPC all serialize there, even when scheduler dispatch is per-CPU.
    3. Temporary four-owner scheduler-CPU assumption. The selected scheduler topology is currently hardcoded to four owner CPUs; the boot-time CPU set is not yet discovered, so workloads larger than four cores cannot be admitted at all.
    4. Periodic-tick service tax. Non-isolated CPUs still pay the periodic timer-tick cost on every scheduler tick, even when no thread is ready to run; nohz suppression today only fires inside the narrow single-runnable window and the SQPOLL-coupled lease.

    Planned architecture changes that should improve SMP / threading scalability. The roadmap response is concrete:

    • Scheduler Phase F.5: Full-SMP 16/32-core scalability (docs/backlog/scheduler-evolution.md, cross-linked from docs/proposals/smp-proposal.md). Replaces the four-owner assumption with dynamic CPU topology discovery, adds the x2APIC backend needed for higher APIC ids, shrinks scheduler shared-state serialization so local pick/requeue can avoid the global lock, and adds topology-aware placement plus an observable migration policy. A 1/2/4/8/16/32-worker hardware benchmark suite against a matching native-Linux baseline is the gate.
    • Ring v2: per-thread capability ring ownership (docs/proposals/ring-v2-smp-proposal.md). Completions route by ThreadRef -> RingEndpoint, removing the single-SQ-consumer bottleneck and unblocking concurrent scheduler-owned work on more than one CPU per process. Needs TLB shootdown + cross-CPU cleanup review.
    • Generic full-nohz + generic SQPOLL nohz (docs/architecture/scheduling.md Phase F follow-ons). Extends the bounded SQPOLL-driven activation (2026-05-14, commit 8edd2314) to arbitrary rings and threads, retiring the periodic-tick service tax on non-isolated CPUs and enabling realtime islands.
    • EEVDF policy evaluation (deferred behind Phase F). Tracked as a follow-on dispatcher policy, not a Phase D blocker.
    • SchedulingContext over endpoints (Phase E, see 2026-05-14 entry above). Already-landed; lets a server inherit a caller’s reservation, reducing IPC scheduling round-trips for the same workload class.

Device Driver Foundation – DDF authority surface broadened

  • DDF capability surface broadened in one day: DeviceMmio.map returns a read-only userspace BAR VMA with the unmap cleanup paths exercised across release/drop/driver-crash/ reset-disable; DeviceMmio.write32 exposes typed admission; DMAPool.allocateBuffer reaches a three-slot bounce-buffer pool with manager-owned generation; DMABuffer accounting tracks per-slot in-flight descriptor identity with live_inflight aggregation; and Interrupt.mask / unmask perform bounded manager-mediated route-state control. Typed denials for invalid protections (mmio-map-prot-invalid) and invalid allocate-buffer requests (dmapool-allocation-request-invalid) now route through the no-result-cap admission path. Reproduction: make run-devicemmio-grant, make run-dmapool-grant, make run-interrupt-grant, make run-hardware-grant-cycle. This remains bounded manager accounting and admission proof: no direct DMA, doorbells, IOVA exposure, IOMMU programming, or production driver consumer.

2026-05-09

Device Driver Foundation – Interrupt/DMABuffer admission + audit hardening

  • The Interrupt skeleton gained typed admission across acknowledge, wait, mask, and unmask (admission-check-only, *-not-attempted, side-effect-blocked), with the pending-IRQ token validator factored into capos-lib::device_authority. DMABuffer.submitDescriptor / .completeDescriptor follow the same admission pattern with per-slot in-flight accounting and typed freeBuffer. The manifest-granted Interrupt retains its claimed MSI-X source across cap releases for sequential grant-source reuse (commit 681e48ac).
  • HardwareAuditLog.snapshot exposes the volatile snapshot contract as typed result metadata (bounded-volatile-ring-drop-oldest, volatile-only, unsigned, production-admission-policy-not-implemented) plus typed truncation labels (commit e4cea6ff), a startSequence cursor for the retained ring, and cursor edge-case metadata for the below-oldest / past-end cases. Hardware audit smokes assert grant-source acquire/ release identity for DeviceMmio, Interrupt, DmaPool, and DmaBuffer; cap-audit assertion sets are now exact-count anchored so boot-time virtio-rng proof records cannot satisfy the grant audit checks.
  • Reproduction: make run-interrupt-grant, make run-devicemmio-grant, make run-dmapool-grant, make run-hardware-audit, make run-hardware-grant-cycle. Bounded manager accounting and read-side audit only; production userspace-driver authority remains open.

Device Driver Foundation – DMAPool parent-first release ordering

  • Commit 29b4dde5: make run-dmapool-grant releases the parent DMAPool before the result DMABuffer. Parent release stages a pending detach while the proof buffer is still attached; the DMABuffer release then frees the proof page and completes the staged zero-live pool detach as the single DmaPool cap-op-release audit. DMABuffer driver-crash and reset-disable run-net proofs also complete any pending parent release instead of orphaning it.

2026-05-08

Device Driver Foundation – DMAPool grant + hardware audit cap

  • KernelCapSource::DmaPool now grants the bounded single-proof-buffer allocation path (commit f95c6cf8): capos-rt exposes typed DmaPoolClient::allocate_buffer_wait and DmaBufferClient::info_wait, and make run-dmapool-grant proves the manager-attached buffer lifecycle plus the matching dmapool / dmabuffer audit records. An earlier info-only grant variant routed through kernel/src/cap/dmapool_grant_source.rs and reused the virtio-rng ManagerGrantSource device handle.
  • KernelCapSource::HardwareAuditLog exposes a read-only HardwareAuditLog.snapshot cap backed by a bounded volatile drop-oldest ring in kernel/src/cap/hardware_audit.rs, so userspace observes hardware-cap audit records without parsing COM1 text. Reproduction: make run-hardware-audit.
  • Interrupt waiter teardown trigger now routes through the stale-safe detach helper used by cap release / driver-crash / reset-disable cleanup (commit aeef8b41). make run-net asserts the interrupt waiter hook proof line and an exact-one cap-audit: cap=interrupt event=interrupt-waiter count.
  • Hostile-smoke gate hardened: tools/qemu-net-smoke.sh anchors the remaining proof-line assertions in the S.11.2 hostile-smoke gate with exact-count guards + anchored suffix assertions.

Cloud boot – GCP imported-image serial boot recorded

  • Cloudboot run 1778230874-715a (2026-05-08 09:06 UTC) against source 3951e275: make cloudboot-test built the 10 GiB GCE-compatible disk tarball, uploaded it to the staging bucket, created a temporary GCE image + e2-small instance, and observed the capos kernel starting serial landmark on poll attempt 2. Serial evidence shows SeaBIOS booting from Google Persistent Disk virtio-scsi (10240 MiB), 2 vCPU / 2 GiB RAM discovery, Google RSDT/MADT tables, fail-closed IOMMU policy (no MCFG/DMAR/IVRS), masked I/O APIC routing, AP online, manifest load, init start, and shell spawn. The harness copied artifacts to target/cloudboot-evidence/run-1778230874-715a/ before deleting the temporary instance, image, and staged tarball.

2026-05-07

WASI Host Adapter Phase W.2 – C and Rust hello-wasi smokes closed

  • Commit 7bfcb1d8: WASI host adapter Phase W.2 closed. Both Rust (wasm32-wasip1) and C (wasm32-wasi) hello, wasi payloads run inside the wasmi interpreter under the wasm-host capOS process and print through the host’s granted Console cap via the Preview 1 fd_write(1, ...) surface. Closed in four sub-slices: (1) the wasm-host userspace binary + system-wasm-host.cue + make run-wasm-host empty-module instantiation, carrying a one-time userspace ABI bump for wasmi’s ~3 MiB BSS; (2) the Preview 1 stdout-only import resolver in capos-wasm/src/wasi/preview1.rs (46 imports, args_get/environ_get empty, clock_time_get backed by Timer, proc_exit via capos_rt::syscall::exit, fd_write 4 KiB iov-total + 1 KiB per-call ceiling through Console; everything else returns ERRNO_NOSYS = 52); (3) the Rust demos/wasi-hello-rust/ crate with system-wasi-hello-rust.cue + the manifest-supplied payload reader in capos-wasm/src/payload.rs; (4) the C demos/wasi-hello-c/ smoke built directly against system clang-18 + wasi-libc (no libcapos/POSIX work needed – the wasm-host payload-load path from sub-slice 3 carries the C .wasm payload unchanged). Reproduction: make run-wasi-hello-rust, make run-wasi-hello-c, and make run-wasm-host for the empty-module regression. Phase W.3 (per-instance CapSet + LaunchParameters) is the next selectable phase.

2026-05-03

System Configuration Slice 3 closed

  • 2026-05-03 21:54 UTC, commit a50f610d: the System Configuration and Operator Extensibility track’s Slice 3 closed. Every owned focused-proof manifest in the inventory declares its own CUE package and imports capos.local/cue/defaults; the manifest decoder rejects unknown document-root fields with typed Error::UnknownField (pinned by system_manifest_rejects_unknown_root_field and system_manifest_accepts_only_known_root_fields); and the operator overlay worked example covers every defaults-package extension hook (MOTD, console password verifier, additional authorized SSH keys, additional seed accounts, additional resource profiles, additional binaries, additional services), verified by a 1808-byte manifest.bin delta when the worked-example overlay is dropped at repo root and make manifest is rerun in package mode. Reproduction: cargo test-config (348 tests), make manifest, make run, and the per-manifest make run-* targets named in the Slice-3 inventory table. Residual successor scope: system-measure.cue migration is owned by docs/backlog/scheduler-evolution.md; system-paperclips.cue and system-adventure.cue are demo-owned.

2026-05-02

Thread-Scale Honest Scaling Proof

  • 2026-05-02 21:38 UTC, against main commit 374f8556: the formal capOS+Linux thread-scale evidence pair was collected on the benchmark VM as the gate before Phase D. Both runs pinned to physical-core logical CPUs 0,1,2,3 on a 4-core/8-thread n2-highcpu-8 host with KVM, five runs per case, same repaired benchmark shape (blocking parent join, 262,144 blocks / 16 MiB, work_rounds=64).

    ComparisoncapOSLinux pthreadcapOS gate
    1->2 work1.883x1.988x>= 1.6x
    1->2 total1.787x1.987x>= 1.6x
    1->4 work1.566x3.963x>= 1.6x (diagnostic)
    1->4 total1.538x3.858x>= 1.6x (diagnostic)

    The 1->2 gates passed against the then-current single-global-queue scheduler. The 1->4 rows are the bottleneck-attribution diagnostic that justified Phase D’s fair-share enqueue policy: Linux scaled near-linearly on the same physical CPU set, so the workload shape was sound and the gap was a capOS scheduler bottleneck. Phase D later reduced the gap (see 2026-05-10 entry above for the post-Phase D result and the Bottleneck analysis + planned-architecture-changes block).

    Raw artifacts: target/thread-scale/20260502T213544Z/ and target/linux-thread-scale/20260502T213445Z/. Reproduction: make run-thread-scale (with CAPOS_THREAD_SCALE_RUNS=5 etc.) and make run-linux-thread-scale-baseline. Host: internal benchmark VM in single GCP zone, n2-highcpu-8, nested virtualization, kernel Linux 6.17.0-1012-gcp x86_64, CPU Intel(R) Xeon(R) CPU @ 2.80GHz, qemu-system-x86_64 8.2.2, rustc 1.97.0-nightly (c935696dd 2026-04-29).

Measure Mode Repair

  • 2026-05-02 20:23 UTC, commit 08c54075: make run-measure is green again. Two cumulative regressions: the thread-lifecycle measure-mode binary started requiring vm (VirtualMemory) and frames (FrameAllocator) caps when the park unmap/reuse smoke landed (a7af0e37, 765c6c26) but system-measure.cue was never updated; the run_park_process_exit_cleanup path called capos_rt::syscall::exit(0) to terminate the entire process, but that syscall became per-thread in 214c8e11, so the parent thread exited while the parked child kept the process alive. Repair: add the missing cap entries, bump the smoke assertion from 5 caps to 7 caps, retire the broken park-exit path, and route measure-mode exits through the same exit_last_thread ThreadControl flow as the spawn smoke. Closeout: make run-measure exits 0 in 32s with the full measure: ... segment/scheduler/timer/lock attribution intact; all other validation gates passed.

2026-05-01

In-Process Threading Scalability

  • 2026-05-01 14:58 UTC, commit 136b72de: the In-Process Threading Scalability milestone reached accepted controlled evidence only after the benchmark shape was repaired (the old 1 MiB / spinning-parent shape failed to scale even on Linux pthread at four workers). Harness defaults are now blocking parent join, 262,144 blocks (16 MiB), work_rounds=64. Controlled native-Linux evidence on a physical CPU set validated the repaired shape (1->2 1.991x work / 1.990x total; 1->4 3.958x / 3.834x). Controlled capOS evidence on the same CPU set passed both enforced 1->2 gates with 1.828x / 1.687x work/total. Unsuppressed 1->4 diagnostic recorded 3.029x / 2.386x; switch-log-suppressed 3.272x / 2.303x, showing serial scheduler switch logging materially distorts four-worker work timing. Four-core capOS scaling was not declared a closed claim – guest-measure evidence showed remaining global Scheduler lock contention plus exit/join/block/schedule overhead in total time. This was the diagnostic stepping-stone that motivated Phase D’s WFQ run queues and the Phase F.5 architecture work listed in the 2026-05-10 entry.
  • Same branch tightened caller-aware child publication for the repaired blocking-parent benchmark: publication avoids the caller only when another active ready scheduler CPU has a strictly lower non-idle dispatch load; equal-load ties keep an active-ready caller CPU instead of falling through to CPU0.

Diagnostics and Scheduler Support

  • 2026-05-01 07:28 UTC, commit d8d9dab1: benchmark attribution added guest-measure phase counters, host-summary work/total speedup gates, guest PC sampling, benchmark-only userspace symbol maps, resolved user-pc-symbols.log reports, the Linux pthread baseline, larger workload / Amdahl controls, logging-suppression A/B support, and first-slice shared-kernel lock counters for frame-allocator and ring-dispatch paths.
  • 2026-05-01 05:24 UTC, commit a88e7906: scheduler support landed as incremental slices, not milestone closeout – bounded per-scheduler-CPU runnable queues, queue reservation accounting, bounded idle-to-runnable wake targeting, wake/reschedule attribution, stale runnable / direct target cleanup proofs, a SchedulerDispatch substate separating dispatch ownership from shared thread metadata, and per-thread runtime/virtual-runtime accounting.

2026-04-30

Multi-Process SMP Concurrency

  • 2026-04-30 09:45 UTC, commit 3fb89923: Multi-Process SMP Concurrency closed. Worker elapsed reporting uses scaled user-mode cycle counts; prime-counting ranges remain contiguous while balancing upper-range cost. Accepted KVM-backed run in target/smp-process-scale/cycle-balanced-default/: medians smp1=1693, smp2=1053, smp4=2314, or 1.608x 1-to-2 speedup. Ordinary run-smoke and run-spawn under -smp 2 passed.

2026-04-28 / 2026-04-29

Session-Bound Invocation Context core gates

  • 2026-04-29 08:40 UTC: Session-Bound Invocation Context landed its core gates: process-session invariant, default endpoint caller-session metadata, stale normal endpoint rejection, transfer scopes, field-granular disclosure gating, session expiry for broker-issued shell bundle caps, guest bundle narrowing, chat membership keyed by opaque caller-session references, Aurelian player state keyed by live endpoint caller-session metadata, and terminal output liveness checks. Terminal/stdio bridge completion and final service-scoped reference derivation/rotation remained open.

2026-04-25

SMP Phase C – multi-CPU scheduling proof

  • 2026-04-25 11:47 UTC: SMP Phase C AP scheduler-owner proof closed. AP cpu=1 can run scheduler-owned user contexts while the BSP stays in kernel idle behind a one-way scheduler-owner latch (review-fix commit d88bca7). Per-CPU KernelGsBase + swapgs, PIT-calibrated xAPIC LAPIC timer/IPI, resident-mask TLB shootdown (vector 49), and split scheduler current-thread tracking landed across the day on the smp-phase-c-* branches (swapgs, lapic-ipi, tlb-shootdown, scheduler-ownership). Per-CPU run queues, reschedule IPIs, concurrent scheduler-owned work on more than one CPU, and per-thread rings (Ring v2) remained Phase C follow-ups.

Telnet Shell Demo

  • 2026-04-25 20:25 UTC, reviewed merge 2834bfc: Telnet Shell Demo closed. Adds telnet-gateway, system-telnet.cue, make run-telnet, make qemu-telnet-harness, proving QEMU host-local forwarding from 127.0.0.1:2323 to guest port 23, password login, caps, session, and clean exit through a socket-backed TerminalSession. The child shell transcript proves no raw NetworkManager, ProcessSpawner, TCP, or unknown capability interfaces. Scoped gateway authority follow-up remains open.

SMP Phase B – APs running

  • 2026-04-25 06:59 UTC: SMP Phase B closed. AP startup uses Limine MpRequest/MpInfo::bootstrap, stable AP records, AP-owned kernel/IST stacks, AP-local GDT/TSS state, capOS kernel PML4 handoff, AP-owned kernel RSP handoff, shared IDT, KernelGsBase, syscall MSRs, SMEP/SMAP state, and a parked interrupt-disabled hlt loop.

SMP Phase A and user-buffer protection

  • 2026-04-25 05:36 UTC: SMP Phase A closed. The BSP has a concrete PerCpu for syscall-stack state and current-thread mirroring; kernel-entry stack updates flow through one per-CPU hook.
  • 2026-04-25 04:00 UTC: workplan/user-buffer-validation-protection closed the private process-buffer validate_user_buffer TOCTOU finding. AddressSpace now owns validation plus HHDM-backed user copy/read helpers under the process address-space mutex.
  • 2026-04-25 03:36 UTC: final review of workplan/futexspace-private-wait-wake fixed a park-ownership bug where a sibling thread could drain a park SQE and park the wrong ThreadRef. Fix requires CAP_SQE_THREAD_OWNED plus the owning thread id for CAP_OP_PARK.

2026-04-24

  • 2026-04-24 22:41 UTC: in-process threading design freeze closed. Thread/process ownership and park authority contracts frozen; review findings fixed before merge.
  • 2026-04-24 20:53 UTC: runtime prerequisites for threading and Go closed, with follow-up rejecting writable-executable user mappings in anonymous VirtualMemory / MemoryObject paths and QEMU smoke coverage.
  • 2026-04-24 16:45 UTC: kernel networking smoke closed – QEMU virtio-net path proves modern transport discovery, virtqueue setup, descriptor completion, ARP, ICMP echo, smoltcp handoff, static IPv4, and host-backed TCP HTTP GET.
  • 2026-04-24 13:11 UTC: custom userspace target closed. Userspace artifacts build through targets/x86_64-unknown-capos.json; kernel stays on x86_64-unknown-none.
  • 2026-04-24 11:25 UTC: boot-manifest parser scope tightened by KernelBootstrapManifest, which decodes only kernel-owned fields and avoids materializing the init-owned service graph. Boot package boundary cleanup followed (10:53 UTC).
  • 2026-04-24 03:06 UTC: Ring-as-Black-Box closed. QEMU debug-tap builds export bounded metadata-only ring records; tools/ringtap-viewer/ renders correlated SQE/CQE evidence offline.
  • 2026-04-24 02:16 UTC: shared service harness extraction closed for non-speculative duplicated demo-service pieces.
  • 2026-04-24 00:34 UTC: dependency policy gate restored by allowing BSD-3-Clause for the current Argon2 closure with rationale in docs/trusted-build-inputs.md.

2026-04-23

  • 2026-04-23 22:05 UTC: Verified Core closed. make kani-lib runs the bounded local/GitHub Kani model-checking gate; make kani-lib-full adds the high-memory transfer model-checking gate. These are bounded model checks (small input sizes such as <=8 frames and 63 ELF bytes), proving the harnessed invariants within those bounds rather than for all inputs. The companion Loom check (cargo test-ring-loom) exercises a bounded concurrency model of the ring protocol, not the shipped kernel/src/cap/ring.rs.
  • 2026-04-23 21:30 EEST: boot-to-shell milestone closed. Default make run reaches setup/login, volatile credential creation, password-authenticated session minting, broker-issued shell bundles, redacted auth/session audit records, and an interactive native shell REPL over serial. capos-shell is init on shell-led manifests; anonymous shell starts on boot; login/setup mint authenticated operator bundles.
  • 2026-04-23 16:34 UTC: split UART shell session closed. make run presents login/native shell on terminal UART while kernel/debug output goes to target/qemu-console.log. The revocable read milestone closed in the same window – make run-revocable-read proves a parent can revoke a child-local BootPackage grant through CapabilityManager.

2026-04-20 To 2026-04-22

  • 2026-04-22 23:50 UTC: AP-independent review remediation closed issues around endpoint owner cleanup, ProcessSpawner badge attenuation, ProcessSpawner heap-OOM paths, queued release semantics, pinned CUE toolchain enforcement, stale authority docs, spawn hardening stability, NMI IST coverage, MemoryObject replacement for raw frame grants, generated capnp ownership, and manifest validation modularization.
  • 2026-04-21 22:21 EEST: VirtualMemory quota review finding resolved with per-address-space ownership tracking, holder quota, bounded auto-placement probes, owned-range checks, and QEMU coverage.
  • 2026-04-21 18:46 EEST: smoke demos moved into a nested demos/ userspace workspace; system.cue packages each demo as a distinct release-built binary/service.
  • 2026-04-21 16:56 UTC: cross-process CAP_OP_CALL/RECV/RETURN flow closed. Allocation-free synchronous ring dispatch (12:28 UTC), SMEP/SMAP, cap_enter blocking waits with timeout, Endpoint, and RECV/RETURN routing (01:15 UTC) closed in the same window.
  • 2026-04-20 23:05 EEST: Phase 0 and Phase 1 cleanup closed – dead-code cleanup, ELF validation hardening, deterministic error paths, corrupted ring recovery policy, capos-lib split, host tests.

2026-04-05

  • 2026-04-05 16:39 EEST: Stage 4 and Stage 5 direction took shape. Capability invocation moved from direct calls toward the shared-memory ring and cap_enter; preemptive scheduling was documented after the PIT/context-switch scheduler landed; stale cap_call proposal text was replaced with the ring-based model.
  • 2026-04-05 17:08 EEST: planning surface matured. The roadmap moved out of README.md, review findings split into their own log, and the project added userspace-binaries, SMP, Go runtime, cloud deployment, storage/naming, service architecture, GPU, error-handling, and persistence planning.
  • 2026-04-05 10:35 EEST: design grounding expanded through prior-art research on seL4, Zircon, Plan 9/Inferno, EROS/CapROS/Coyotos, Genode, and LLVM target customization. That research fed the interface-as-permission decision.
  • 2026-04-05 02:17 EEST: manifest/config and init-side planning advanced with a no-std manifest config loader, hardening tests, and early init-side manifest parsing demos.

2026-04-04

  • 2026-04-04 21:02 EEST: capOS bootstrapped as a Limine-loaded Rust kernel with serial output, then gained the first Cap’n Proto capability invocation path and a staged implementation roadmap.
  • 2026-04-04 23:12 EEST: Stage 1 through Stage 3 landed in rapid succession – virtual memory with kernel remapping and isolated process address spaces; Ring 3 user-space transition through GDT/TSS/syscall setup; and process abstraction with ELF loading, per-process address spaces/cap tables, static init, and QEMU auto-exit proof.
  • 2026-04-04 23:57 EEST: the first major design proposals appeared, including userspace TCP/IP networking and capability-based service architecture. Networking was split into its own proposal after review.

Build, Boot, and Test

The commands below are the current local workflow for the x86_64 QEMU target. The root Cargo configuration defaults to x86_64-unknown-none, so host tests must use the repo aliases instead of bare cargo test.

Prerequisites

Expected host tools:

  • Rust nightly from rust-toolchain.toml
  • make
  • qemu-system-x86_64
  • xorriso
  • curl, sha256sum, and standard build tools for pinned tool downloads
  • Go, used by the Makefile to install the pinned CUE compiler when needed
  • A Telnet client for the optional focused loopback shell demo
  • Chromium, Chromium Browser, or Google Chrome for the optional remote-session CapSet browser UI automation
  • Optional policy and proof tools for extended checks: cargo-deny, cargo-audit, cargo-fuzz, cargo-miri, and cargo-kani

The Makefile pins and verifies:

  • Limine at the commit recorded in Makefile
  • Cap’n Proto compiler version 1.2.0
  • CUE version 0.16.0

Pinned repo-selected tools are installed under CAPOS_TOOLS_ROOT, which defaults to the per-user $HOME/.capos-tools cache. Override CAPOS_TOOLS_ROOT=/path/to/cache when a host needs a different cache location.

Build the ISO

Use the default target when you need the current bootable capOS image.

make

This builds:

  • the kernel with the default bare-metal target;
  • the standalone init userspace binary used by focused spawn proofs;
  • release-built demo service binaries under demos/;
  • the capos-rt userspace binaries, including the shell proof;
  • manifest.bin from system.cue;
  • capos.iso with Limine boot files.

Relevant files: Makefile, limine.conf, system.cue, tools/mkmanifest/.

Compare Build Provenance

Use make build-provenance to write the local build record at target/build-provenance.txt. To compare two retained records locally:

make build-provenance-compare \
  BASE_PROVENANCE=path/to/base-build-provenance.txt \
  CANDIDATE_PROVENANCE=path/to/candidate-build-provenance.txt

The comparison ignores the generated timestamp and allowed local path-root movement under worktree target/ directories or .capos-tools/ caches. It fails for material provenance drift, including source commit changes, manifest or artifact hash changes, embedded binary hash changes, OVMF identity/hash changes, Rust compiler date/commit changes, host-tool version or package identity changes, and operating-system identity changes.

For PR base-vs-head CI comparison, use the environment policy:

make build-provenance-compare \
  BUILD_PROVENANCE_COMPARE_POLICY=ci-environment \
  BASE_PROVENANCE=path/to/base-build-provenance.txt \
  CANDIDATE_PROVENANCE=path/to/candidate-build-provenance.txt

That mode allows expected source and artifact hash changes while still failing for runner, GitHub-hosted image, Rust, selected-tool, package-identity, OVMF selection, and OVMF hash drift. It is a PR environment gate, not a production reproducibility claim.

Local synthetic comparison checks may create scratch records under target/provenance-fixtures/ or Python bytecode caches under tools/. Clean those scratch artifacts with:

make build-provenance-compare-clean

Boot QEMU

Use the default run targets to boot either the operator-facing system or the scripted login-path smoke.

make run
make run-smoke

make run is the operator-facing boot path. It builds the ISO with the qemu feature, boots QEMU with the interactive terminal UART on stdio, attaches virtio-net with host-local remote CapSet forwarding, and writes the separate kernel/debug UART log to target/qemu-console.log. The run output prints the actual forwarded port as remote CapSet: tcp 127.0.0.1 <port> -> guest :2327.

The plaintext loopback Telnet research demo was a Phase B fixture, not part of the default operator path. make run-telnet and make run-telnet-vm are now retired because they depended on the removed qemu-only kernel TCP listener. Use the SSH gateway smokes for current remote-shell coverage and rebuild any socket-backed terminal proof on the Phase C userspace network stack before using it as validation.

The same make run boot starts the remote-session CapSet gateway. To run the host CLI against it, use the printed port:

cargo run --manifest-path tools/remote-session-client/Cargo.toml \
  --target x86_64-unknown-linux-gnu \
  --bin remote-session-client -- --host 127.0.0.1 --port <port>

Add --launch-adventure to that command when you want the CLI to start the default-manifest Adventure service graph through serviceLaunch and require a running status.

To run the trusted local web bridge against the same QEMU instance:

CAPOS_REMOTE_SESSION_PORT=<port> make remote-session-ui

Then open http://127.0.0.1:3337/. The Rust bridge holds the TCP stream, remote session state, and backend-held remote CapSet; the browser receives only view models, launch/status descriptors, denial diagnostics, call results, and redacted transcript rows. The former automated focused proof, make run-remote-session-capset-ui, is retired because it depended on the removed qemu-only kernel TCP listener. The replacement browser proof belongs to the future Phase C Web UI L4 gate.

A Tauri desktop wrapper is available as a repo-local check/dev layer over the same Rust backend. The repo-local make remote-session-tauri target first runs a policy preflight over the reviewed scaffold, then checks for the Tauri CLI and Linux build prerequisites, reports dependency and scaffold status, and runs a deterministic wrapper check when those prerequisites are present. It follows the official Tauri v2 Linux prerequisite shape, including WebKitGTK 4.1, libxdo, OpenSSL, AppIndicator, and Rsvg development packages where applicable. Missing dependencies fail with explicit diagnostics and point back to the supported local web bridge. The operator command shape is:

CAPOS_REMOTE_SESSION_PORT=<port> make remote-session-tauri

Set CAPOS_REMOTE_SESSION_TAURI_MODE=dev to launch cargo tauri dev. CAPOS_REMOTE_SESSION_TAURI_MODE=policy tools/remote-session-tauri.sh runs only the scaffold guardrail and does not require Tauri system packages or a desktop session. CAPOS_REMOTE_SESSION_TAURI_MODE=package and CAPOS_REMOTE_SESSION_TAURI_MODE=automation are intentionally blocked with diagnostics describing the remaining packaging and desktop-automation review work. This policy preflight proves only that the current wrapper remains a check/dev scaffold with packaging disabled, the loopback URL pinned, a single main window, default core:default permission scope, and no app-specific Tauri command/plugin authority. It is not a distributable packaging or desktop automation proof. make remote-session-ui remains the supported fallback host UI path and uses the same backend-held authority boundary.

Default make run starts chat, the remote-session gateway, and shell services. It embeds Adventure server/NPC/client binaries and the terminal Paperclips binary. The current remote-session Adventure slice makes serviceLaunch a real restricted backend launch in that default manifest: the trusted backend/gateway starts adventure-server plus simple NPC companions through an approved service-runner profile and attaches or retains backend-held descriptors/caps for the Adventure/chat-facing services. Run it by starting make run, noting the printed remote CapSet forwarding port, and then using either the host CLI or CAPOS_REMOTE_SESSION_PORT=<printed-port> make remote-session-ui. make run-paperclips remains the focused authoritative Paperclips server proof; default-manifest Paperclips launch is not implemented by this slice. Raw ProcessSpawner, process owner handles, endpoint owner caps, local cap ids, result-cap slots, and browser-held capOS caps are non-goals for this UI path. Process handles stay backend-local.

GCE Web UI Proof Target Map

Use the selected-milestone proof targets below to choose the narrowest evidence gate for the GCE self-hosted Web UI ladder. Local QEMU/cloudboot targets do not prove live provider reachability, and private GCE targets do not authorize public ingress or TLS exposure.

Proof classTarget or command shapeProvesClosest non-goal
Landed local Phase C L4 substratemake run-cloud-prod-userspace-network-stack-smoltcpA non-qemu cloudboot kernel under QEMU starts the userspace smoltcp network-stack process and completes one hostfwd TCP request/response through a userspace-served TcpListenAuthority. See cloud-prod-userspace-network-stack-smoltcp-local-proof.Does not prove DHCP/IPv4 configuration, remote-session-web-ui, live GCE reachability, or public ingress.
Landed local IPv4 configurationmake run-cloud-prod-network-stack-dhcp-ipv4-configThe userspace network-stack process acquires the QEMU SLIRP DHCPv4 lease, serves NetworkManager.getConfig, installs the default route, and resolves gateway plus same-subnet ARP neighbors. See cloud-prod-network-stack-dhcp-ipv4-config-local-proof.Does not prove a Web UI listener bound through that route, live GCE reachability, DNS, TLS, or public exposure.
Retired legacy local self-served Web UImake run-remote-session-self-served-web-uiPre-Phase-C proof that served the immutable full UI bundle from a focused QEMU manifest through the kernel tcp_listen_authority socket owner. The target is not current production L4 evidence after cloud-prod-phase-c-kernel-smoltcp-virtio-net-removal retires that kernel owner.Does not prove the non-qemu cloudboot Phase C L4 path, and should not be used as a passing selected-milestone gate until rebuilt on the userspace network-stack substrate.
Landed cloudboot Web UI authority inventoryNo run target; docs-status contractThe Gate 1B inventory records the required and forbidden remote-session-web-ui grants, trusted listener/source metadata, browser-visible forbidden markers, and expected local L4 proof markers. See remote-session-webui-cloudboot-authority-inventory.Does not prove runtime listening, browser automation, GCE reachability, or public operator access.
Landed local cloudboot Web UI L4 proofmake run-cloud-prod-remote-session-web-ui-l4 owned by cloud-prod-remote-session-web-ui-l4-local-proofProves remote-session-web-ui listens on guest port 8080 through the Phase C L4 path on the non-qemu cloudboot kernel under QEMU: the userspace network-stack process serves the scoped TcpListenAuthority, the Web UI serves the full fixed-name bundle, login, one backend-held capability call, logout, stale-call failure, the manual viewer, and a cloudboot-evidence: remote-session-web-ui-l4 marker.Local cloudboot/QEMU evidence only; it does not prove live GCE NIC reachability, private provider probing, public ingress, TLS, or production release authority.
On-hold private GCE Web UI proofFuture tools/cloudboot/run-test.sh --require-web-ui-proof gate owned by cloud-gce-private-self-hosted-webui-proofMust launch the self-hosted Web UI cloudboot image in the no-public-IP GCE posture, use a reviewed private probe that crosses the live GCE virtual network boundary, record the private endpoint and Web UI/L4 markers, and tear down all created resources.On hold (2026-06-09): the cloudtest credential lacks the firewall IAM a private same-VPC probe needs against GCE default-deny ingress, and the live legacy virtio 0.9 GCE NIC has no reviewed userspace-stack serving story. It will not create public IPs, public firewall rules, DNS, TLS certificates, or browser-facing public operator ingress.
On-hold public ingress/TLS proofFuture tools/cloudboot/run-test.sh --require-public-web-ui-proof gate owned by cloud-gce-public-self-hosted-webui-ingress-tlsAfter explicit authorization, must prove the selected GCE external HTTPS load-balancer ingress posture, Google-managed certificate termination, browser-session hardening, and teardown evidence.On hold. No local target, private proof, or selected milestone status grants public exposure, broad firewall changes, certificate issuance, TLS key custody, or release authority.

make run-smoke preserves the focused legacy shell-led system-smoke.cue verification path. It drives the login and shell session through the terminal UART, captures the kernel log and terminal transcript separately, and checks that the kernel boot-launched only the first init service (capos-shell), granted only the shell bootstrap cap bundle, and then reached the expected audit, shell-bundle, child-isolation, stale-handle, and no-password-echo assertions before QEMU exits. This is distinct from the default system.cue path, where the kernel boot-launches standalone init and init starts operator-facing services.

Spawn Smoke

Use the spawn smoke when changes affect manifest-owned process creation, ProcessSpawner behavior, or bootstrap capability wiring.

make run-spawn

This boots with system-spawn.cue, the focused init-owned manifest retained for ProcessSpawner checks. Only init is boot-launched by the kernel; init uses ProcessSpawner to launch endpoint, IPC, VirtualMemory, Timer, ThreadControl, the single-thread runtime checkpoint, FrameAllocator cleanup, and hostile spawn demo children, wait for ProcessHandles, and exercise hostile spawn inputs. The target captures the kernel log separately and runs tools/qemu-spawn-smoke.sh to assert the single-init boot markers, BootPackage validation, child spawn/exit records, Timer now/sleep and per-process sleep quota proof lines, runtime FS-base proof lines, the single-thread runtime map/protect/unmap plus park-fallback checkpoint, manifest child waits, and clean halt.

Shell and Terminal Smokes

Use these focused QEMU smokes for shell, terminal, credential, and login paths.

make run-shell
make run-terminal
make run-credential
make run-login
make run-login-setup
  • make run-shell boots the focused system-shell.cue manifest (no pre-provisioned verifier) and exercises the shell entirely in its anonymous session: CapSet listing, typed capability inspection, typed application-error display, anonymous-session metadata, the anonymous launcher rejecting spawn-test because its allowlist is empty, and clean exit.
  • make run-terminal boots the focused system-terminal.cue manifest and exercises the TerminalSession substrate: visible and hidden echo input, bounded readLine, structured cancellation, and stale-input scrubbing between prompts.
  • make run-credential boots the focused CredentialStore proof manifest.
  • make run-login boots the focused password-login manifest and proves the shell’s login command prompting for username> before hidden password>, failing generically on a wrong password, succeeding for the demo account, swapping from the anonymous bundle to the operator bundle, and performing exact-grant child launch plus stale-handle release.
  • make run-login-setup boots the no-password first-boot setup manifest and proves that setup creates a volatile credential, discloses that volatility, chains into the login upgrade path, and reaches the same narrow operator shell bundle.

Durable account storage and multi-verifier local accounts are still future work; the current username-aware login path selects the manifest-seeded operator-kind account and any volatile first-boot credential record that setup creates.

Focused Service Smokes

Use these targets to prove resident services and demo clients still launch through the intended shell-granted authorities.

make run-chat
make run-adventure
make run-paperclips
make run-revocable-read
make run-memoryobject-shared
make run-ringtap-failing-call
  • make run-chat boots the focused First Chat manifest and proves a shell-spawned client can send a line through the resident singleton chat service using the broker-issued operator chat endpoint and observe the resident bot reply.
  • make run-adventure boots the focused adventure manifest and proves the shell-spawned client can drive the current scripted mission through explicit StdIO, adventure, and chat endpoint grants.
  • make run-paperclips boots the focused Paperclips terminal demo manifest, authenticates the shell, starts Paperclips server services, first launches the clean-room terminal client with explicit StdIO plus the normal PaperclipsGame endpoint, proves normal server authority cannot invoke run <ms>, rejects a forged proof_accelerator: @timer grant, then relaunches against the proof server endpoint with the explicit proof_accelerator proof authority for the accelerated transcript. The server owns generated content, game state, regular timer cadence, unlock checks, and game-rule mutation, and server-mode client help is rendered from structured server command specs. That transcript rejects an early locked autoclipper purchase, rejects an over-budget wire purchase, rejects bulk manual production, rejects a high-price sale with zero current demand, rejects manual production after automation drains wire, drives one-at-a-time manual production, explicit sales, repeatable marketing, autoclipper unlock, real-time automation, generated typed Cap’n Proto content loading, scaled business-phase production, precision-rollers, design-search, forecast-engine, the survey-drones transition to == autonomous phase ==, representative autonomous drone/factory scaling with local-matter conversion and additional clip production, the mesh-coordination and seed-probes cosmic transition, bounded probe replication and production, locked final-conversion, and clean client/shell exit.
  • make run-revocable-read exercises the revocation transcript for endpoint and boot-package authority loss.
  • make run-memoryobject-shared proves MemoryObject-backed parent/child sharing and cleanup.
  • make run-ringtap-failing-call enables debug_tap, drives a known typed launcher failure, and runs the ringtap viewer over the captured log.

Networking and Measurement Targets

Use these targets for the current network proof path and benchmark-only measurement image.

make run-net
make qemu-net-harness
make run-measure
  • make run-net attaches a QEMU virtio-net PCI device and exercises current PCI enumeration, virtio transport setup, and TX descriptor completion diagnostics, plus ARP resolution and ICMP echo validation against the QEMU user-mode gateway.
  • make qemu-net-harness runs the scripted net smoke path.
  • make run-measure enables the separate measure feature for benchmark-only counters and cycle measurements. It boots system-measure.cue, where init spawns ring-nop and grants the measurement-only NullCap and ParkBench caps through ProcessSpawner. The demo prints ring/NullCap baselines plus a park-shaped comparison between compact authority-checked SQEs and generic Cap’n Proto methods. The kernel summary includes per-segment dispatch counts, total cycles, and averages for SQE processing, validation, cap lookup, capnp decode, method body dispatch, CQE posting, and waiter wake/check. Do not treat it as the normal dispatch build.

Formatting and Generated Code

Use these local checks before claiming source formatting or generated artifacts are current.

make fmt
make fmt-check
make generated-code-check
  • make fmt formats the kernel workspace plus standalone init, demos, and capos-rt crates.
  • make fmt-check verifies formatting without modifying files.
  • make generated-code-check verifies checked-in Cap’n Proto generated code against the repo-pinned compiler path and checks generated adventure plus Paperclips content against their CUE sources.

Host Tests

Use these host-side checks for shared logic and userspace build surfaces that do not require a QEMU boot.

cargo test-config
cargo test-ring-loom
cargo test-lib
cargo test-mkmanifest
tools/check-userspace-runtime-surface.sh
make capos-rt-check
make init-capos-build
make demos-capos-build
make shell-capos-build
make capos-rt-capos-build
  • cargo test-config runs shared config, manifest, ring, and CapSet tests on the host target.
  • cargo test-ring-loom runs the bounded Loom model for SQ/CQ protocol invariants.
  • cargo test-lib runs host tests for pure shared logic such as ELF parsing, capability tables, frame allocation, and related property tests.
  • cargo test-mkmanifest runs host tests for manifest generation.
  • tools/check-userspace-runtime-surface.sh verifies capos-rt owns the userspace entry, panic, allocator, and raw syscall surface.
  • make capos-rt-check builds the standalone runtime smoke binary against targets/x86_64-unknown-capos.json, matching the userspace target used by the boot image.
  • make init-capos-build, make demos-capos-build, make shell-capos-build, and make capos-rt-capos-build expose focused custom-target build wrappers for the booted userspace crates and runtime smoke binary.

Extended Verification

Use the extended verification set for shared logic, dependency policy, fuzz targets, and bounded proof gates that are heavier than the normal host-test loop.

make dependency-policy-check
make fuzz-build
make fuzz-smoke
make kani-lib
cargo miri-lib

These require optional tools. Use them when changing dependency policy, manifest parsing, ELF parsing, capability-table/frame logic, or proof-covered shared code. make dependency-policy-check covers Rust deny/audit checks and the docs Node lockfile/audit gate with npm lifecycle scripts disabled. See the Security and Verification Proposal for the rationale behind the extended verification tiers. make kani-lib runs the bounded mandatory cap-table/frame gate.

Validation Rule

For behavior changes, a clean build is not enough. The relevant QEMU process must exercise the behavior and print observable output that proves the path works. make run-smoke is the default login-path gate; make run-spawn, make run-shell, make run-terminal, make run-credential, make run-login, make run-login-setup, make run-chat, make run-adventure, make run-paperclips, make run-revocable-read, make run-memoryobject-shared, make run-net, make qemu-net-harness, make run-ringtap-failing-call, or make run-measure are additional gates for their specific features.

Benchmarks

capOS benchmark rows are evidence records. Each row should say what workload ran, what was verified, how time was measured, what machine envelope was used, and where the raw artifacts were stored. A faster row whose verifier did not complete is not a performance result.

The broader benchmark model is in System Performance Benchmarks. Future parallel-pattern coverage is in HPC Parallel Processing Patterns.

Current CPU Workloads

capOS currently has two CPU-scaling workloads:

WorkloadTargetTimed regionVerifierPrimary use
run-smp-process-scaleIndependent worker processesworker compute only, after setup and before result reportingaggregate prime count and checksumExercises multiple process-owned rings running CPU work on more than one scheduler CPU.
run-thread-scaleSibling threads in one processchecksum work window, separate from spawn/join/shutdown totalsdeterministic root checksum and metadata checksMeasures same-process thread scheduling, per-thread rings, and scheduler overhead.

Both workloads keep serial and harness artifacts under target/. The capOS rows below were collected under QEMU/KVM. The matching Linux rows use the same workload shape where possible, but units differ by harness and should not be compared directly across systems. Compare speedup ratios within a row.

Process-Scale SMP

make run-smp-process-scale boots a focused manifest, runs independent worker processes, and times the CPU-bound worker window. Each worker owns its own process ring. The timed section avoids syscalls and serial output; the coordinator verifies the aggregate result after workers finish.

The current workload counts primes over 2..3_000_000 using balanced contiguous splits. capOS reports a worker-side user-mode cycle counter shifted right by 20 bits. Linux reports guest clock_gettime nanoseconds.

Controlled benchmark-VM reruns were recorded on GCE n2-highcpu-8 at capOS commit 0d89a91b (2026-04-30 11:09 UTC) with nested QEMU/KVM on Ubuntu 6.17.0-1012-gcp, QEMU 8.2.2, Rust nightly 1.97.0-nightly (c935696dd 2026-04-29), and host logical CPUs 0,1,2,3 mapped to distinct physical cores with SMT siblings 4,5,6,7.

Systemsmp1 mediansmp2 mediansmp4 median1-to-2 speedup1-to-4 speedup
capOS1,639 scaled cycles875 scaled cycles1,111 scaled cycles1.873x1.475x
Linux1,275,187,210 ns659,218,025 ns337,877,986 ns1.934x3.774x

The capOS 4-vCPU row improved over the 1-vCPU row but was slower than the 2-vCPU row. Linux continued improving through 4 vCPUs under the same pinning and workload. Raw capOS artifacts are under target/smp-process-scale/pinned-20260430T1113Z/; raw Linux artifacts are under target/linux-smp-process-scale/pinned-20260430T1118Z/.

SMT Run

The same harness can run an eight-logical-CPU case on the benchmark VM. That machine exposes four physical cores and eight SMT threads, so the smp8-smt row is an SMT measurement on a 4-core host.

The SMT run was recorded at commit 7c15dd47 (2026-04-30 11:45 UTC) with QEMU pinned to logical CPUs 0,1,2,3,4,5,6,7.

Systemsmp1 mediansmp2 mediansmp4 mediansmp8-smt median
capOS1,500 scaled cycles787 scaled cycles1,052 scaled cycles1,595 scaled cycles
Linux1,274,507,854 ns647,611,418 ns337,479,795 ns198,903,231 ns
System1-to-2 speedup1-to-4 speedup1-to-8 speedup
capOS1.906x1.426x0.940x
Linux1.968x3.777x6.408x

Raw capOS SMT artifacts are under target/smp-process-scale/smt8-20260430T1148Z/. Raw Linux SMT artifacts are under target/linux-smp-process-scale/smt8-20260430T1151Z/.

In-Process Thread Scaling

make run-thread-scale runs sibling threads inside one process. Child threads use per-thread rings. The workload computes fixed-size checksum blocks; the default shape is a blocking parent join, 262,144 blocks (16 MiB), and work_rounds=64.

The harness records both a work-window time and a total time. The work window brackets the checksum computation. Total time includes thread startup, synchronization, shutdown, and join overhead. For scheduler analysis, both numbers matter: work speedup shows CPU placement and dispatch during the syscall-free section, while total speedup shows the cost of the surrounding thread lifecycle.

The old 1 MiB workload with a spinning parent is historical only because the matching Linux pthread baseline also stayed flat at four workers. The current rows use the repaired 16 MiB blocking-parent shape unless noted.

Recorded evidence:

System / modePlacementRuns1-to-2 work1-to-2 total1-to-4 work1-to-4 totalNotes
Linux pthread baseline (benchmark VM, 2026-05-10 19:46 UTC)physical-core logical CPUs 0,1,2,351.996x1.995x3.974x3.850xSame checksum workload and pin set as the 2026-05-10 capOS row.
capOS (Phase D WFQ, benchmark VM, 2026-05-10 19:32 UTC)physical-core logical CPUs 0,1,2,351.809x1.774x3.088x2.700xPer-thread weights/latency classes, per-CPU WFQ queues, bounded steal path.
Linux pthread baseline (benchmark VM, 2026-05-02 21:34 UTC)physical-core logical CPUs 0,1,2,351.988x1.987x3.963x3.858xSame repaired workload before Phase D.
capOS (single global queue, benchmark VM, 2026-05-02 21:35 UTC)physical-core logical CPUs 0,1,2,351.883x1.787x1.566x1.538xShows the four-worker cost of the single global runnable queue.
Linux pthread baseline (2026-05-01 report)physical-core logical CPUs51.991x1.990x3.958x3.834xRepaired-shape baseline recorded in docs/changelog.md; target artifact directory is not named in the source record.
capOS (pre-collapse placement, 2026-05-01 report)physical-core logical CPUs51.828x1.687x3.029x2.386xCommit 136b72de; per-CPU placement model later replaced by the queue-collapse cleanup; target artifact directory is not named in the source record.
capOS, switch logs suppressed (pre-collapse, 2026-05-01 report)physical-core logical CPUs51.913x1.636x3.272x2.303xSame commit and model with scheduler switch logs suppressed; target artifact directory is not named in the source record.
capOS (post-collapse, single global queue, 2026-05-02 10:42 UTC)physical-core logical CPUs 0,1,2,3 on the benchmark VM31.890x1.792x1.504x1.436xQueue-collapse row recorded in docs/backlog/scheduler-evolution.md; target artifact directory is not named in the source record.

The 2026-05-10 Phase D WFQ row uses the same repaired shape as the 2026-05-02 pair: blocking parent join, 262,144 blocks, work_rounds=64, five runs, KVM-backed QEMU pinned to physical-core logical CPUs 0,1,2,3, and a matching Linux pthread baseline on the same pin set. Raw capOS artifacts are under target/thread-scale/20260510T193200Z/; raw Linux artifacts are under target/linux-thread-scale/20260510T194600Z/.

The 2026-05-02 capOS/Linux pair used main commit 374f8556; raw capOS artifacts are under target/thread-scale/20260502T213544Z/, and raw Linux artifacts are under target/linux-thread-scale/20260502T213445Z/.

The row improved the four-worker work window from 1.566x to 3.088x and the four-worker total window from 1.538x to 2.700x compared with the single-global-queue row. Linux on the same host and pin set recorded 3.974x work and 3.850x total at four workers. The remaining difference is the scheduler/runtime optimization target for later work.

Guest-side attribution is available with CAPOS_THREAD_SCALE_GUEST_MEASURE=1. It emits aggregate and per-phase measurements for spawn_ready, work, shutdown, and final_total, including scheduler choice, lock, timer, TLB, serial, shared-kernel-lock, network-poll, thread-placement, and sampled user-PC buckets. Host-side QEMU profiling is available with CAPOS_THREAD_SCALE_PROFILE=1.

Interpreting CPU Counts

CPU-count rows are meaningful only with a recorded topology:

  • Physical-core rows require enough physical cores for the vCPU count.
  • SMT rows should say they are SMT rows and list the logical CPU set.
  • Pinning QEMU with taskset is useful, but it is not CPU isolation by itself. Stronger runs should record isolcpus/nohz_full/rcu_nocbs, cpuset, or systemd affinity policy when used.
  • Pinning QEMU to fewer host logical CPUs than guest vCPUs measures oversubscription behavior, not core scaling.
  • Current QEMU/KVM results should stay separate from future direct cloud or bare-metal runs.

The current capOS benchmark table reaches four physical-core rows and an eight-logical-CPU SMT row on a 4-core/8-thread VM. It does not yet measure 16-core or 32-core systems.

Next CPU-Scaling Work

The next CPU-scaling milestone should be designed around direct hardware or a dedicated perf runner rather than nested QEMU as the primary evidence source. The benchmark suite needs:

  • hardware discovery records for socket/core/SMT topology, APIC mode, timer source, frequency policy, memory size, and firmware/device model;
  • workload rows at 1, 2, 4, 8, 16, and 32 workers where the machine has enough physical cores, plus separately labeled SMT rows;
  • at least one static map/reduce checksum workload, one uneven dynamic-task workload, one barrier-heavy phase loop, and one IPC/service-bound workload;
  • work-window and total-time reporting for every workload;
  • matching Linux native baselines on the same hardware where a comparable workload exists;
  • scheduler/runtime counters for queue depth, migrations, steals, reschedule IPIs, TLB shootdowns, timer ticks, lock wait/hold time, blocked time, and runnable but not running time;
  • raw artifacts with source commit, toolchain, kernel config, host topology, run count, warmup policy, and verifier output.

QEMU should remain useful for boot and regression coverage, but it should not be the primary source for a 16/32-core SMP scalability milestone.

Commands

Run the capOS process-scale workload:

make run-smp-process-scale

Run the process-scale workload with QEMU pinned to selected host CPUs:

CAPOS_SMP_SCALE_QEMU_TASKSET_CPUS=0,1 make run-smp-process-scale

Run the process-scale SMT row on a host with at least eight logical CPUs:

CAPOS_SMP_SCALE_INCLUDE_SMT=1 \
  CAPOS_SMP_SCALE_QEMU_TASKSET_CPUS=0,1,2,3,4,5,6,7 \
  make run-smp-process-scale

Run the thread-scale workload:

CAPOS_THREAD_SCALE_RUNS=5 \
  CAPOS_THREAD_SCALE_QEMU_TASKSET_CPUS=0,1,2,3 \
  make run-thread-scale

Run the larger-workload Amdahl row:

CAPOS_THREAD_SCALE_RUNS=5 \
  CAPOS_THREAD_SCALE_TOTAL_BLOCKS=1048576 \
  CAPOS_THREAD_SCALE_QEMU_TASKSET_CPUS=0,1,2,3 \
  make run-thread-scale

Run a one-sample host-side QEMU profiling pass:

CAPOS_THREAD_SCALE_PROFILE=1 \
  CAPOS_THREAD_SCALE_RUNS=1 \
  CAPOS_THREAD_SCALE_QEMU_TASKSET_CPUS=0,1,2,3 \
  make run-thread-scale

Run a one-sample guest-side measurement pass:

CAPOS_THREAD_SCALE_GUEST_MEASURE=1 \
  CAPOS_THREAD_SCALE_RUNS=1 \
  CAPOS_THREAD_SCALE_QEMU_TASKSET_CPUS=0,1,2,3 \
  make run-thread-scale

Run only the host summary parser against an existing results.csv without booting QEMU:

CAPOS_THREAD_SCALE_SUMMARY_ONLY=1 \
  CAPOS_THREAD_SCALE_SUMMARY_CSV=<results.csv> \
  CAPOS_THREAD_SCALE_SUMMARY_KVM_EVIDENCE=1 \
  CAPOS_THREAD_SCALE_QEMU_TASKSET_CPUS=0,1,2,3 \
  CAPOS_THREAD_SCALE_TOTAL_BLOCKS=262144 \
  CAPOS_THREAD_SCALE_PARENT_WAIT=join \
  CAPOS_THREAD_SCALE_WORK_ROUNDS=64 \
  tools/qemu-thread-scale-harness.sh

Run the native Linux pthread baseline for the thread-scale checksum workload:

LINUX_THREAD_SCALE_TASKSET_CPUS=0,1,2,3 \
  make run-linux-thread-scale-baseline

Run the Linux process-scale comparison:

LINUX_SMP_SCALE_KERNEL=target/linux-smp-process-scale/kernel/vmlinuz \
  tools/linux-smp-process-scale-baseline.sh

On hosts where /boot/vmlinuz is not readable by the current user, copy a kernel image into ignored target/ storage first through the host’s normal administrative path, then pass it as LINUX_SMP_SCALE_KERNEL. The script does not invoke sudo itself.

Configuration

The default capOS boot manifest (system.cue at the repo root) is layered on a shared scaffold in cue/defaults/defaults.cue. Operators can extend it without forking either file by dropping a system.local.cue overlay next to system.cue. The overlay is gitignored, so each developer/host can carry their own extensions without conflicting with git pull.

This document is the current operator-facing design for the configuration surface. The historical proposal and closeout rationale live in System Configuration and Operator Extensibility.

How the layering works

mkmanifest --package capos system.cue manifest.bin invokes cue export .:capos --out json against the repo root. CUE’s package mode unifies every non-hidden .cue file in that directory that declares package capos — currently system.cue (committed) and any system.local.cue (gitignored) the operator drops in. The shared scaffold is imported by system.cue:

import defaults "capos.local/cue/defaults"

#Manifest (the value system.cue exports) inherits all defaults from defaults.#DefaultSystem, then applies any operator overrides declared in system.local.cue. The kernel decoder reads concrete fields at the document root (schemaVersion, binaries, initConfig, kernelParams); #Manifest is documentation-only.

The decoder rejects any other top-level field name with a typed error. For an unknown field named kernelParameters the rendered message is:

unknown field `kernelParameters` at $; expected one of `schemaVersion`, `binaries`, `initConfig`, or `kernelParams`

CUE definitions (#Foo) and hidden fields (_foo) are stripped by cue export and never reach the decoder, so this only fires when the manifest projects an unintended visible name onto the document root — a typo such as kernelParameters: … instead of kernelParams: …, or a stale overlay field that was renamed in the defaults package. Fix it by renaming the projected field to one of the four accepted names, by moving the value under kernelParams.…, or by hiding the auxiliary value with a _/# prefix.

Quick start

Copy the committed example and edit:

cp system.local.cue.example system.local.cue
$EDITOR system.local.cue
make run

The Makefile picks up the new file automatically — no flag, no include line. make re-evaluates the manifest because system.local.cue is a prerequisite of the manifest rule.

Common Overlay Tasks

The examples below are complete system.local.cue fragments for common local configuration changes. Each fragment is intended to be copied as a starting point and adjusted before running make run.

Override the MOTD

package capos

#Manifest: kernelParams: motd: """
	hello, capOS dev box.
	type 'login' to authenticate.
	"""

The defaults package declares motd: string | *"...", so a concrete overlay value wins under CUE unification (a more concrete value is strictly more specific than a default).

The system hostname is set the same way via kernelParams.hostname (defaults to capos); it is served by SystemInfo.hostname and shown by the shell hostname command. Bootstrap validation rejects whitespace, control characters, and values longer than 255 bytes.

#Manifest: kernelParams: hostname: "web-01"

Add an authorized SSH key for the host operator

The default manifest declares a single host-operator seed account with the canonical 32-byte principal id local-operator-principal-default. Bind any number of authorized keys to that principal:

package capos

#Manifest: extraAuthorizedSshKeys: [{
	keyId:                "host-laptop-ed25519-2026-04"
	principalId:          "local-operator-principal-default"
	algorithm:            "ssh-ed25519"
	publicKey:            "<32-byte ed25519 public key as ASCII hex>"
	fingerprintSha256:    "<32-byte SHA-256 of the public key as ASCII hex>"
	allowedShellProfiles: ["operator"]
	source:               "manifest"
	comment:              "host laptop"
}]

Convert an existing ~/.ssh/id_ed25519.pub line to the manifest hex fields (Ed25519 example):

# extract the base64-encoded SSH wire format and decode the embedded key
ssh-keygen -e -m PKCS8 -f ~/.ssh/id_ed25519.pub | \
	openssl pkey -pubin -outform DER 2>/dev/null | \
	tail -c 32 | xxd -p -c 64
# fingerprintSha256 — SHA-256 over the same 32-byte raw public key:
ssh-keygen -e -m PKCS8 -f ~/.ssh/id_ed25519.pub | \
	openssl pkey -pubin -outform DER 2>/dev/null | \
	tail -c 32 | sha256sum | awk '{print $1}'

Use the printed hex as the publicKey and fingerprintSha256 strings.

The proposal explicitly avoids auto-ingesting ~/.ssh/*.pub from the Makefile. Manual conversion gives the operator control over which keys are trusted by the boot manifest.

Add a non-operator principal

The single-account-multi-auth invariant fixes the host operator at kind: "operator"; slice 2 rejects manifests with multiple operator seeds. Additional principals must use kind: "guest" or kind: "service":

package capos

#Manifest: extraSeedAccounts: [{
	name:            "kiosk-guest"
	displayName:     "Kiosk Guest"
	principalId:     "kiosk-guest-principal-32-bytes-x" // exactly 32 bytes
	kind:            "guest"
	credentialRefs:  []
	resourceProfile: "operator-default"
}]

Each seed account’s principalId must be unique and exactly 32 bytes; each must reference an existing resourceProfile (either operator-default from the defaults package or one declared in extraResourceProfiles).

Add a custom resource profile

package capos

#Manifest: extraResourceProfiles: [{
	name:                      "kiosk-guest-profile"
	homeQuotaBytes:            0
	tempQuotaBytes:            1048576
	processLimit:              2
	threadLimit:               4
	capLimit:                  24
	memoryCommitLimitBytes:    16777216
	frameGrantLimitPages:      64
	endpointQueueLimit:        8
	inFlightCallLimit:         4
	ringScratchLimitBytes:     16384
	logQuotaBytesPerWindow:    32768
	networkProfile:            "none"
	cpuBudgetUsPerWindow:      10000
	cpuWindowUs:               100000
	timerWaiterLimit:          2
	launcherProfile:           "bootstrap-guest"
}]

Reference the profile name from extraSeedAccounts[].resourceProfile.

Add a binary and an init-launched service

The defaults package exposes extraBinaries and extraServices hooks. The first embeds an additional binary into manifest.bin; the second appends an entry onto initConfig.services so init launches it after the base service graph. Build the binary as part of the operator workflow — the default Make targets only build the binaries already listed in the defaults package.

package capos

#Manifest: extraBinaries: [{
	name: "site-monitor"
	path: "demos/target/x86_64-unknown-capos/release/capos-demo-site-monitor"
}]

#Manifest: extraServices: [{
	name:    "site-monitor"
	binary:  "site-monitor"
	restart: "never"
	caps: [{
		name: "console"
		source: kernel: "console"
	}, {
		name: "timer"
		source: kernel: "timer"
	}],
}]

extraServices is concatenated onto _baseServices (base-first, then operator-extra), so the operator service starts after the defaults’ chat server, remote-session gateway, and shell are launched.

Override the console password verifier

The defaults package ships a development-only Argon2id PHC for the plaintext “capos”. Any non-research deployment should mint a fresh verifier and override it:

package capos

#Manifest: kernelParams: consolePasswordVerifierPhc:
	"$argon2id$v=19$m=19456,t=2,p=1$<salt-base64>$<hash-base64>"

Generate a verifier with the standalone argon2 tool (argon2 "<salt>" -id -t 2 -m 19 -p 1 -e) or from any Argon2id implementation that emits a PHC string with m=19456,t=2,p=1. The canonical 32-byte local-operator-principal-default operator principal id is unchanged; only the verifier rotates.

Host-user injection (@tag(user))

make run exports CAPOS_CUE_USER=$(USER), and mkmanifest forwards it as --inject user=.... When CAPOS_CUE_DISPLAY_NAME is unset, mkmanifest derives displayName from the same account’s first GECOS/comment field in /etc/passwd and forwards it as --inject displayName=.... If the passwd comment is unavailable or empty, displayName falls back to the account name. Other Make targets leave the structured tag variables unset, so untagged system.cue keeps the canonical operator account name. Focused demo and smoke manifests pin their own demo fixtures. The audit-correlatable principalId is fixed to the canonical 32-byte value regardless of host user, so audit history is stable across $USER changes.

mkmanifest also keeps the generic CAPOS_CUE_TAGS comma-separated escape hatch for additional key=value tags. The Makefile sets the structured variables target-scoped to make run only:

run: CAPOS_CUE_USER = $(USER)

Set additional tags via make USER=alice CAPOS_CUE_DISPLAY_NAME='Alice Smith' CAPOS_CUE_TAGS=region=eu-west run or by passing --tag key=value to mkmanifest directly. system.cue consumes user and displayName today; user must be a valid manifest seed account name. Future tags can carry hostname, locale, or other build-environment-derived values without adding new mechanisms.

Tools-root cache

CAPOS_TOOLS_ROOT defaults to $HOME/.capos-tools. The pinned toolchain (capnp, cue, mdbook, typst, limine) lives under that path so multiple capOS clones share a single download. Override with CAPOS_TOOLS_ROOT=/path/to/cache make ... for non-default placement. The Makefile and mkmanifest’s expected_cue_path follow the same default; mismatched CAPOS_CUE / CAPOS_CAPNP env values are still rejected by mkmanifest and make generated-code-check.

Schema-aware data conversion

mkmanifest cue-to-capnp converts CUE-authored data messages into arbitrary specified Cap’n Proto struct roots without routing them through the boot manifest ABI:

make cue-ensure capnp-ensure
CAPOS_CUE="$(make -s cue-path)" \
CAPOS_CAPNP="$(make -s capnp-path)" \
cargo run --manifest-path tools/mkmanifest/Cargo.toml --target "$(rustc -vV | awk '/^host:/ {print $2}')" -- \
	cue-to-capnp --import-path schema input.cue schema/example.capnp Example output.bin

The subcommand accepts the same CUE --package, --tag, and CAPOS_CUE_TAGS inputs as the manifest builder. It also accepts repeated --import-path <dir> or -I<dir> arguments plus --no-standard-import, which are passed to capnp convert as process arguments, not through a shell. The input CUE is first exported to JSON, then the pinned Cap’n Proto tool validates that JSON against the named schema and root struct.

This is the right path for configuration blobs, demo fixtures, or future schema-defined records that are not SystemManifest. It still cannot encode live capOS capability table entries or meaningful Cap’n Proto interface objects; authority transfer remains an IPC/runtime concern.

Limits and non-goals

  • A second kind: "operator" seed account is rejected by the kernel in slice 2; multi-operator support is tracked in User Identity and Policy.
  • The slice-2 overlay is not a replacement for cloud-instance configuration; cloud-metadata-driven manifest deltas are designed in Cloud Metadata.
  • The overlay does not auto-ingest ~/.ssh/*.pub; conversion is manual by design (security review on which keys count).
  • Focused-proof manifest migration onto the defaults package (slice 3, Task 2) is complete: every repo-root system-*.cue manifest declares its own CUE package and imports the defaults package, except system-paperclips.cue and system-adventure.cue (demo-owned, package-less but still importing defaults) and system-measure.cue (held by the measure-mode-repair plan). The Slice-3 inventory table in System Configuration and Operator Extensibility records the per-manifest status, package, and make run-* target.

Repository Map

This map names the main source locations for the current system. It is not an ownership file; use it to find the code behind architecture and validation claims.

Root Files

  • README.md gives the compact project overview.
  • docs/roadmap.md records long-range stages and broad feature direction.
  • docs/tasks/state.toml records the current selected milestone.
  • docs/tasks/README.md defines the task-ledger schema and dispatch semantics.
  • docs/tasks/*.md, docs/tasks/on-hold/, docs/tasks/active/, docs/tasks/review/, and docs/tasks/done/ carry task lifecycle records.
  • docs/tasks/** carries open review-finding remediation records; REVIEW_FINDINGS.md is a tombstone for pre-migration links.
  • REVIEW.md defines review expectations.
  • Makefile builds pinned tools, userspace binaries, manifests, ISO images, QEMU targets, formatting checks, generated-code checks, and policy checks.
  • rust-toolchain.toml declares the Rust nightly channel, required targets, and rust-src; it does not pin an exact nightly by date or commit.
  • .cargo/config.toml sets the default bare-metal target and useful cargo aliases.

Schema and Shared ABIs

  • docs/abi-evolution-policy.md defines compatibility classes, schema ordinal rules, ring-layout rules, version negotiation, and deprecation windows for externally visible ABI changes.
  • schema/capos.capnp defines capability interfaces, manifest structures, exceptions, ProcessSpawner, ProcessHandle, and transfer-related schema.
  • capos-abi/src/lib.rs defines small no_std ABI/policy constants shared by crates that should not depend on schema/config internals, including process quotas and credential policy limits.
  • capos-config/src/manifest.rs defines the host and no_std manifest model.
  • capos-config/src/ring.rs defines CapRingHeader, SQE/CQE structures, opcodes, flags, and transport error constants shared by kernel and userspace.
  • capos-config/src/capset.rs defines the read-only bootstrap CapSet ABI.
  • capos-config/src/cue.rs supports evaluated CUE-style manifest data.
  • capos-config/src/credential_policy.rs re-exports credential policy limits; full PHC parsing is enabled by the credential-validation feature for bootstrap validators that need credential checks.
  • capos-config/tests/ring_loom.rs models bounded ring protocol behavior with Loom.

Validation: cargo test-config, cargo test-ring-loom, make generated-code-check.

Shared Pure Logic

  • capos-lib/src/elf.rs parses ELF64 images for kernel loading and host tests.
  • capos-lib/src/cap_table.rs implements CapId, capability-table storage, stale-generation checks, grant preparation, transfer transaction helpers, commit, rollback, and the CapTable quota constants sourced from capos-abi.
  • capos-lib/src/frame_bitmap.rs implements the host-testable physical frame bitmap core.
  • capos-lib/src/frame_ledger.rs contains a bounded frame-grant helper kept for host-test coverage; current MemoryObject accounting charges CapTable::ResourceLedger.
  • capos-lib/src/lazy_buffer.rs provides bounded lazy buffers used by ring scratch paths.
  • capos-lib/src/iso9660.rs is the pure ISO 9660 primary-volume-descriptor and directory-record parser the kernel boot-ISO driver (kernel/src/iso/) delegates to; fuzz target iso9660_volume.
  • capos-lib/src/storage_format.rs holds the pure CAPOSRO1 (rofs), CAPOSST1 (disk_store), and CAPOSWF1 (writable_fs) mount parsers the kernel storage cap backers delegate to, including the shared record-layout constants the kernel writers reuse; fuzz targets storage_rofs_mount, storage_disk_store_mount, storage_writable_fs_mount.

Validation: cargo test-lib, cargo miri-lib, make kani-lib, fuzz targets under fuzz/fuzz_targets/.

Kernel

  • kernel/src/main.rs is the boot entry point, hardware setup sequence, manifest parsing path, and boot-launched service creation path. run_init resolves PID 1 from the kernel-embedded boot::INIT_ELF when initConfig.init.binary == capos_config::RESERVED_INIT_BINARY_NAME ("init") and otherwise from SystemManifest.binaries; for the embedded case it also injects the embedded image into the ProcessSpawner binary set under the reserved name so child spawns of init resolve.
  • kernel/src/boot.rs exposes boot::INIT_ELF: &[u8], the PID 1 init image packaged at build time. kernel/build.rs reads the prebuilt init/ artifact (CAPOS_INIT_ELF, with a conventional-path fallback) and generates the include_bytes! static; init/ stays a standalone crate (byte packaging, not linker merging).
  • kernel/src/spawn.rs loads user ELF images, creates process state, maps bootstrap pages, and enqueues spawned processes.
  • kernel/src/process.rs defines Process, Thread, ThreadState, per-thread kernel stacks, park waiter storage, and userspace CPU context.
  • kernel/src/sched.rs implements the single-CPU scheduler, timer-driven preemption, blocking cap_enter, direct IPC handoff, ParkSpace wait/wake, and deferred cancellation wakeups.
  • kernel/src/serial.rs implements COM1/COM2 UART setup, manifest-driven console-vs-terminal routing, and kernel print macros.
  • kernel/src/pci.rs implements early PCI config-space access through legacy I/O ports and ACPI MCFG/PCIe ECAM, with QEMU diagnostics for the current virtio-net and Q35 discovery paths, plus reusable memory-BAR subregion validation, kernel MMIO mapping helpers for in-kernel drivers, and MSI/MSI-X capability metadata discovery plus typed MSI-X table programming.
  • kernel/src/device_interrupt.rs records the current kernel-owned virtio-net MSI-X config/RX/TX sources, their generation ids, route state, in-kernel driver owner, lock-free bounded device MSI vector-pool dispatch slots, and claimed-route reassignment/release without exposing userspace interrupt authority.
  • kernel/src/device_dma.rs holds the kernel-owned, fixed-size DMA pool accounting ledgers. The net-keyed VIRTIO_NET_DMA_POOL backs virtio-net’s DmaPage path; a focused single-queue VIRTIO_BLK_DMA_POOL (reusing the shared ActivePage/QueueAccount types, same generation-checked handle and scrub-before-free invariants) backs the virtio-blk request buffer. Each device’s VirtqueueDma seam impl delegates to its own pool’s keyed API.
  • kernel/src/dma_backend.rs (always compiled) records the boot-time IOMMU probe verdict and resolves the fail-closed DMA backend selection (direct IOMMU remapping only with a verified probe, else kernel-owned bounce buffers) per the “Cloud DMA Backend” contract in docs/dma-isolation-design.md, emitting the boot proof line.
  • kernel/src/device_manager/ holds bounded in-kernel PCI device ownership records. The full DDF surface (device records, DMA pools/buffers, MSI-X interrupts, NVMe brokered controller registers, IOMMU domain ledgers, virtio ring publication, proofs) compiles only under cfg(feature = "qemu") in qemu_full.rs; the MMIO-only surface used by cap::device_mmio exists in both builds, dispatching to stub.rs (one-slot parked-region DeviceMmio record) in the production non-qemu build.
  • kernel/src/nvme_storage_backend.rs (cfg(not(feature = "qemu"))) is the fail-closed activation gate for the always-built NVMe BlockDevice read arm: modeled on dma_backend, it resolves a production handle only when a brokered controller was discovered and a live device_mmio grant is staged, otherwise the block_device grant fails closed with a typed error.
  • kernel/src/virtio_transport.rs (always compiled) is the device-agnostic virtio modern-PCI transport host surface: capability/region discovery constants and bounded volatile MMIO accessors usable outside the qemu-gated legacy virtio path.
  • kernel/src/virtio.rs (cfg(qemu)) holds the legacy in-kernel virtio transport, now a qemu-only fixture: the non-qemu production build compiles kernel/src/virtio_stub.rs instead, whose typed negative results keep stale or fixture-only kernel networking call sites failing closed. It includes the virtqueue drivers used by the IOMMU remapping proof. Its pub(crate) mod transport is the device-generic layer: split-ring/common-config constants, the MmioRegion accessor, the VirtqueueDescriptorTracker, the VirtqueueDma DMA/notify seam, the seam-driven Virtqueue/DmaPage with their poll/submit/complete loop and the multi-descriptor submit_request_chain, and the device-id-parameterized discover_modern_transport. virtio-net is one seam caller (VirtioNetDma); virtio-blk is a second (VirtioBlkDma + VirtioBlkDriver, diagnose_virtio_blk_transport, the block_device_* request API behind the BlockDevice cap). Net-specific provider/proof methods stay in the parent module as impl Virtqueue<VirtioNetDma>.
  • kernel/src/iommu.rs (cfg(qemu)) programs the Intel VT-d legacy-mode remapping tables, drives the hardware-DMA translation/fault proof, and runs the register-based invalidation revocation cycle.
  • kernel/src/iso/ (cfg(boot_iso_read) / cfg(boot_iso) / cfg(qemu)) is the boot-time ISO reader for the Boot Binary ISO Layout track. AtapiDevice (gate 1) locates the legacy IDE ATAPI device and exposes a bounded read_sectors(lba, count, buf) over polled-PIO READ(12) packet commands with range/length validation. IsoFs (gate 2) is a read-only ISO 9660 driver layered on it: it parses the primary volume descriptor, walks directory records, and serves open_file(name) -> (lba, size) under /boot/bins/, validating every directory record and derived extent against the volume size before use (fail-closed BadVolume/NotFound/NotDirectory). boot_read_proof() reads the PVD (CD001) and boot_fs_proof() walks to /boot/bins/PAYLOAD.BIN and verifies its content, both behind boot_iso_read as the make run-boot-iso-read proof. The boot_source submodule (gate 4, cfg(boot_iso)) builds a validated (name, lba, size) registry from every declared manifest binary name (mapping each name to the ISO 9660 d-character form, e.g. capos-shell -> /boot/bins/CAPOS_SHELL) and reads ELF bytes on demand behind a device mutex; run_init and ProcessSpawnerCap consume it so the boot_iso kernel loads binaries from the ISO instead of embedded NamedBlob.data. Proofs: make run-boot-iso and the default make run-smoke. Under cfg(qemu) the always-on AtapiDevice/IsoFs surface (plus a qemu-gated block_size()/list_boot_bins() enumeration helper) also backs the read-only install-source fixture cap (kernel/src/cap/installable_image.rs).

Validation: cargo build --features qemu, make run-smoke, make run-spawn, make run-net, make run-iommu-remapping.

Kernel Architecture

  • kernel/src/arch/x86_64/gdt.rs sets up kernel/user segments and TSS state.
  • kernel/src/arch/x86_64/idt.rs handles exceptions and timer interrupts; CPL3 #PF/#GP/#UD/#DB/#BP faults terminate the whole owning process through sched::exit_current_thread_terminating_process (deferred whole-process termination when sibling threads are live; proof make run-user-fault), while CPL0 faults still halt the machine.
  • kernel/src/arch/x86_64/syscall.rs implements syscall MSR setup and entry.
  • kernel/src/arch/x86_64/context.rs defines timer context-switch state.
  • kernel/src/arch/x86_64/pic.rs and pit.rs configure legacy interrupt hardware.
  • kernel/src/arch/x86_64/ioapic.rs maps MADT I/O APICs and programs masked legacy IRQ routes from interrupt-source overrides.
  • kernel/src/arch/x86_64/lapic.rs programs the xAPIC LAPIC timer and IPIs.
  • kernel/src/arch/x86_64/smap.rs enables SMEP/SMAP and brackets user memory access.
  • kernel/src/arch/x86_64/tls.rs handles FS-base/TLS support.
  • kernel/src/arch/x86_64/pci_config.rs provides legacy PCI config I/O used by the higher-level PCI module alongside its ECAM backend.
  • kernel/src/arch/x86_64/percpu.rs, smp.rs, and tlb.rs provide per-CPU data, AP startup, and TLB shootdown for the SMP scheduler.

Kernel Memory

  • kernel/src/mem/frame.rs wraps the shared frame bitmap with Limine memory map initialization and global kernel access.
  • kernel/src/mem/paging.rs manages page tables, address spaces, permissions, user mappings, W^X enforcement, and address-space teardown.
  • kernel/src/mem/heap.rs initializes the kernel heap.
  • kernel/src/mem/validate.rs validates user buffers before kernel access.

Related docs: DMA Isolation, Trusted Build Inputs.

Kernel Capabilities

  • kernel/src/cap/mod.rs initializes kernel capabilities and builds the first service’s kernel-sourced bootstrap capability table.
  • kernel/src/cap/table.rs re-exports shared capability-table logic and owns the kernel-global table.
  • kernel/src/cap/ring.rs validates and dispatches ring SQEs.
  • kernel/src/cap/transfer.rs validates transfer descriptors and prepares transfer transactions.
  • kernel/src/cap/endpoint.rs implements Endpoint CALL, RECV, RETURN, queued state, cleanup, and cancellation behavior.
  • kernel/src/cap/console.rs implements serial Console.
  • kernel/src/cap/terminal_session.rs implements the session-scoped TerminalSession line-oriented terminal with bounded readLine, echo modes, and cancellation.
  • kernel/src/cap/boot_package.rs implements the read-only BootPackage manifest-size/chunked-read capability.
  • kernel/src/cap/manual.rs implements the read-only Manual capability: it parses the boot-packaged ManualCorpus blob (embedded as the manual-corpus named binary) and answers page/apropos/topics/section/describe/ buildInfo.
  • kernel/src/cap/log.rs implements the Phase 1 monitoring log surface: LogSink (write) and LogReader (read) over a shared bounded, drop-oldest kernel recent-record ring. The sink drops records below the boot-seeded SystemConfig.logLevel threshold and forwards accepted records to serial; the reader returns records at/after a cursor with LogFilter (minLevel/componentPrefix), nextCursor, and dropped (docs/proposals/system-monitoring-proposal.md).
  • kernel/src/cap/block_device.rs implements the BlockDevice CapObject (readBlocks/writeBlocks/info/flush). In the non-qemu production build the block_device source resolves to the userspace-brokered NVMe arm (BlockDeviceBackend::NvmeBrokered, gated by kernel/src/nvme_storage_backend.rs); the qemu build routes bounded inline-Data sector I/O to the kernel-owned virtio-blk driver in kernel/src/virtio.rs as a named fixture, not production storage (proof make run-virtio-blk). The cap is scoped to one device_index: the block_device source reaches the resolved non-target boot/storage disk, and block_device_target (KernelCapSource.blockDeviceTarget @44) reaches the manifest-selected PCI identity when it names a bound non-boot virtio-blk disk. A cap for one disk grants no authority over another. The kernel binds up to device_dma::MAX_VIRTIO_BLK_DEVICES (currently 2) virtio-blk devices, each with an independent driver/DMA-pool/interrupt-route instance (VirtioBlkDriver<const DEV> / VirtioBlkDma<const DEV> over VIRTIO_BLK_DMA_POOLS[DEV]); kernel/src/pci.rs enumerates each device with a device index (proof make run-multi-virtio-blk). Target grants fail closed when the selector is absent, mismatched, or names the resolved boot disk. Counts are bounded to one bounce-buffer page.
  • kernel/src/cap/readonly_fs.rs implements the read-only filesystem service: ReadOnlyFsDirectoryCap / ReadOnlyFsFileCap parse a fixed CAPOSRO1 on-disk layout read through the kernel-owned virtio-blk driver and serve Directory.list/open + File.read/stat; every mutating method fails closed. Granted via the read_only_fs_root KernelCapSource (returns a root Directory cap; qemu-gated, mounts at grant resolution and fails closed on a malformed/absent image). Host image builder tools/mkstore-image --readonly-fs; proof make run-storage-fs.
  • kernel/src/cap/persistent_store.rs implements the disk-backed persistent Store: DiskStoreCap serves the Store interface (put/get/has/ delete) over a fixed CAPOSST1 on-disk layout read and written through a read+write BlockSource seam. put bump-allocates a data extent, writes the blob and entry record, then rewrites the superblock last as the durability commit point; delete tombstones the entry slot, and a later space-exhausting put compacts live entries through a shadow generation before recommitting the canonical front generation; the mount validates the superblock and every entry extent in-bounds and fails closed on a malformed image. The Virtio BlockSource (qemu kernel) routes to the kernel-owned virtio-blk driver byte-identically (folding in the data_region_base_lba() offset) and mounts eagerly at grant resolution; the Nvme BlockSource (built under cloud_persistent_store_over_nvme_proof) reads/writes through a granted NVMe BlockDevice window op and defers its mount-parse to the first Store call. Granted via the persistent_store KernelCapSource (virtio arm qemu-gated; the third NVMe-proof arm resolves the live device_mmio handle). Host image builder tools/mkstore-image; reboot proof make run-storage-persist (two QEMU passes on one disk image); NVMe put-then-get proof make run-cloud-provider-persistent-store-over-nvme via kernel/src/cap/persistent_store_over_nvme_proof.rs.
  • kernel/src/cap/writable_fs.rs implements the disk-backed writable filesystem service: WritableDirectoryCap serves list/open/mkdir/remove/rename/ create and WritableFileCap serves read/write/stat/truncate/sync/ close over a fixed CAPOSWF1 on-disk layout (a flat node-record array with parent pointers + a bump-allocated data region) written through a BlockSource seam. The RAM tree is the working copy; each mutation write-through-commits in the order data sector → node-record sector → superblock. A filesystem-wide fail-closed single-writer policy admits one writer at a time. The Virtio BlockSource (qemu/installable kernels) routes to the kernel-owned virtio-blk driver byte-identically (folding the data_region_base_lba() offset) and mounts the singleton eagerly; the Nvme BlockSource (built under cloud_writable_fs_over_nvme_proof) reads/writes through a granted NVMe BlockDevice window op and defers the singleton mount-parse to the first Directory/File call. Granted via the writable_fs_root KernelCapSource (virtio arm qemu-gated; the third NVMe-proof arm resolves the live device_mmio handle), which mounts the process-wide singleton volume once and hands each grant a distinct writer id; fails closed on a malformed image. The NVMe write-then-read durability proof (make run-cloud-provider-writable-fs-over-nvme via kernel/src/cap/writable_fs_over_nvme_proof.rs, which supersedes and drops the persistent-store-over-NVMe proof) exercises both BlockDevice arms with the single-writer policy intact. The combined image builder tools/mkstore-image --writable co-locates the CAPOSST1 Store sub-volume (LBA 0) and the CAPOSWF1 filesystem sub-volume on one disk; reboot proof make run-storage-writable (two QEMU passes: mutate then verify both the filesystem and the store survive). A slot becomes live on the next mount only once the superblock’s bumped node_count is observed, so a poweroff in the record-written / superblock-pending window leaves an orphan slot the mount ignores. The proof-only storage_writable_recovery feature arms an induced forced poweroff in exactly that window (recovery_crash_after_record); bounded recovery proof make run-storage-writable-recovery (pass 1 commits then is kill -9d mid-allocation, pass 2 verifies recovery to a consistent tree with the interrupted allocation atomically absent). The same crash window is proven over the NVMe BlockDevice arm by make run-cloud-provider-writable-fs-over-nvme-recovery via kernel/src/cap/writable_fs_over_nvme_recovery_proof.rs (a recovery cap-waiter clone that implies and supersedes the happy-path proof module/route/init); the cloud_writable_fs_over_nvme_recovery_proof feature widens the storage_writable_recovery crash-window cfg gate, and the host-built NVMe image (tools/mkstore-image --writable-nvme, empty superblock + root-only node table) is booted twice with -device nvme (no @20 seed). writable_fs::mount_config_root (qemu-gated) scopes a writable Directory to the system/config subtree for the boot-time data-region grant below.
  • kernel/src/cap/installable_image.rs implements the read-only install-source fixture (Installable System track item 5b): InstallableImageDirectoryCap serves list/open and InstallableImageFileCap serves read/stat/close over the booted CD-ROM ISO 9660 /boot/bins/ tree, reading through the kernel/src/iso/ boot_iso ATAPI/ISO 9660 driver behind a single shared-device mutex (so PIO does not interleave across CPUs). Every mutating method fails closed; a past-EOF read clamps to empty and an absent name is rejected, reusing the driver’s validate_extent/read_sectors range checks. Granted via the qemu-gated installable_image_source KernelCapSource (mounts the ATAPI volume and validates /boot/bins/ at grant resolution, failing the spawn closed on an absent/malformed medium). Physically scoped to the ATAPI CD-ROM, so it cannot reach the writable virtio-blk target disk (block_device_target/writable_fs_root). Consumer demo demos/installable-image-source/; manifest system-installable-image-source.cue; proof make run-installable-image-source.
  • demos/installable-system-install/ implements capos-system-install, the Installable System install flow (track item 6): under the read-only installable_image_source Directory and the target-scoped block_device_target BlockDevice selected by manifest PCI identity, it copies the packaged bootable boot-region head (BOOTHEAD.BIN) to LBA 0, writes the backup GPT (BOOTGPT.BIN) at the LBA read from the primary GPT header, and initializes an empty data region (DATAIMG.BIN, tools/mkstore-image --writable --empty-config) at the fixed cap::data_region_base_lba, validating ranges and verifying the read-back. It reads packaged files in 32 KiB windows (under the read-path reply scratch bound; see docs/tasks/done/2026-06-05/storage-file-read-reply-scratch-clamp.md) and zero-skips the FAT free space. tools/split-boot-region.py splits the mkdiskimage boot image into the head + backup GPT so only the populated prefix is packaged. Pass-1 installer manifest system-installable-install.cue; pass-2 installed manifest (baked into the boot region) system-installable-install-target.cue; harness tools/qemu-installable-install-smoke.sh; proof make run-installable-install (pass 1 installs into a second virtio-blk disk, pass 2 boots it standalone).
  • kernel/src/cap/mod.rs grant_data_region (proof-only installable_data_region feature) is the Installable System boot-time data-region mount: run_init best-effort grants init a system/config Directory (data-config) plus the persistent Store (data-store) over the auto-attached data disk, failing closed wholesale to the base manifest (caps unchanged, “no data region; base floor” diagnostic) when the disk is absent, malformed, or missing system/config. No new cap type or schema change. Proof make run-installable-data-region (seeded disk prints resolved contents; no disk and zeroed-superblock disk hit the base floor).
  • Installable System config-overlay compose/merge (track item 3): the SystemConfigOverlay capnp object + SystemManifest.extensionPoints (ManifestExtensionPoints) live in schema/capos.capnp; the typed decode, content-hash check, and compose_onto precedence (base-pins-win / overlay-adds-within-declared-extension-points / no-new-authority) live in capos-config/src/manifest.rs. init/src/main.rs apply_config_overlay reads system/config/overlay.bin from the granted data-config Directory, composes the overlay over the base plan, and falls closed to the base floor with [init] overlay rejected: <reason>. The tools/mkmanifest mkoverlay bin encodes overlays (filling the canonical hash) and tools/mkstore-image --writable --seed-overlay seeds them. Proof make run-installable-overlay.
  • Installable System generations + rollback + failed-boot auto-fallback (track item 4): userspace-only over the already-granted Store + writable system/config Directory, no schema or kernel change. init/src/main.rs run_generation_rollback_checks (gated by a base service named generation-proof) represents system-config generations as content-addressed Store objects keyed by SHA-256, tracks the known-good active pointer and a staged/attempting candidate pointer as monotonic-epoch marker files (gen-active/gen-candidate) in the writable config region, records a boot attempt durably before applying a candidate, auto-falls-back to the known-good generation when a candidate is left unconfirmed (the brick-proofing guarantee), promotes a confirmed candidate, rolls config back to a retained prior generation, and rejects a stale/replayed (lower-or-equal-epoch) pointer. A present-but-undecodable gen-candidate marker (the torn size-0 file a poweroff inside the CREATE|TRUNCATE rewrite window leaves, or garbage bytes) is discarded with a loud diagnostic and boot falls back to the known-good generation, while a corrupt gen-active marker takes a distinct loud FATAL refuse-to-boot path (the known-good generation is genuinely unknown). Manifest system-installable-generation.cue; proof make run-installable-generation boots a --seed-config disk three times (boot 1 exercises the mechanism and leaves an unconfirmed candidate; boot 2 proves across-reboot auto-fallback to the known-good generation, then leaves a torn size-0 candidate marker; boot 3 proves torn-marker recovery).
  • Installable System integrated bootable disk (track item 5, proof-only installable_disk feature, implies installable_data_region): one disk carries the boot ESP (GPT partition 1) and the co-located CAPOSST1 Store + CAPOSWF1 writable data region (GPT partition 2). kernel/src/cap/mod.rs data_region_base_lba returns the fixed partition-2 base LBA (264192) under the feature (0 otherwise), applied at the single persistent_store/ writable_fs read_range/write_range choke points so the kernel reads the region at that fixed tool/kernel-contract LBA without parsing the GPT. tools/mkdiskimage.sh --data-image/--data-offset-bytes fold the tools/mkstore-image --writable image into partition 2 and derive the ESP size from --esp-sectors (integrated disk uses the same 128 MiB ESP as the raw disk-image targets so a debug kernel fits). Manifest system-installable-disk.cue; proof make run-installable-disk boots one virtio-blk disk and asserts the data region mounts from the boot disk and a data-region-only overlay service runs.
  • kernel/src/cap/frame_alloc.rs implements FrameAllocator and MemoryObject.
  • kernel/src/cap/virtual_memory.rs implements per-process anonymous memory operations.
  • kernel/src/cap/timer.rs implements monotonic now and bounded sleep.
  • kernel/src/cap/wall_clock.rs implements the read-only WallClock.wallTime cap: UTC over a fixed boot base layered on the monotonic timebase, reporting the fail-closed untrusted ClockProvenance (Phase 1 fixed-boot-base variant; docs/proposals/time-and-clock-proposal.md).
  • kernel/src/cap/park_space.rs implements the process-local ParkSpace marker capability used by compact park (CAP_OP_PARK/CAP_OP_UNPARK) opcodes.
  • kernel/src/cap/network.rs implements the qemu-only NetworkManager, TcpListener, TcpSocket, and UdpSocket fixture caps. The kernel no longer depends on smoltcp; non-qemu manifests reject the kernel network_manager / tcp_listen_authority grant sources (fail closed), and the production socket path is the Phase C userspace network-stack process. The socket-backed SocketTerminalSession shim is retired: TcpSocket.intoTerminalSession fails closed in every dispatch path.
  • kernel/src/cap/process_spawner.rs implements ProcessSpawner and ProcessHandle.
  • kernel/src/cap/provider_cap_waiter_proof.rs (non-qemu, cloud_provider_cap_waiter_proof Cargo feature) stages a fully-programmed-route bootstrap Interrupt grant source and the InterruptCapWaiterProof cap whose Interrupt.wait injects one device_interrupt::handle_lapic_delivery dispatch and whose Interrupt.acknowledge retires the deferred LAPIC EOI; the cap’s on_release runs the masked-no-wake + reassign + stale-handle assertion chain before emitting cloudboot-evidence: provider-cap-waiter <token>. Mutually exclusive with cap::interrupt_grant_source_prod (default cloudboot path) and skips cap::provider_nic_bind_proof / cap::storage_bind_proof to keep the bound route live for the userspace cap-waiter handoff. Proof: make run-cloud-provider-cap-waiter.
  • kernel/src/cap/virtio_net_device_bringup_proof.rs (non-qemu, cloud_virtio_net_device_bringup_proof Cargo feature; mutually exclusive with cloud_provider_cap_waiter_proof and the userspace selected-write handshake proof) drives the bounded virtio status sequence kernel-side over the picked virtio-net PCI function (vendor 0x1af4, device 0x1000 / 0x1041): resolves the modern virtio PCI transport regions through virtio_transport::parse_modern_pci_transport_capabilities, maps the common configuration window through pci::map_bar_region, and drives reset → ACKNOWLEDGE → DRIVER → feature discovery + driver-feature selection (VIRTIO_F_VERSION_1 only) → FEATURES_OK → DRIVER_OK with a trailing reset on every exit path. Inline assertions gate the headline cloudboot-evidence: virtio-net-device-bringup <token> on the negotiated feature set, COMMON_NUM_QUEUES >= 2, DRIVER_OK observation, and the final reset returning device_status to 0. Marker carries queue_setup=not-attempted, tx_descriptor=not-published, userspace_cap=not-issued, msix_function_enable=not-toggled, device_autonomous_raise=not-attempted, live_cloud=not-attempted. Proof: make run-cloud-provider-virtio-net-bringup.
  • cloud_virtio_net_userspace_features_ok_proof (non-qemu; proof make run-cloud-prod-nic-driver-userspace-features-ok) is Phase C slice 1 of the userspace NIC relocation track. It makes cap::devicemmio_grant_source_prod stage the picked virtio-net modern common-config window as a selected-write DeviceMmio cap with registerWrite=selected-write-common-config-handshake; the userspace smoke drives reset -> ACKNOWLEDGE -> DRIVER -> FEATURES_OK over DeviceMmio.write32 and proves queue-address writes remain fail-closed. It is mutually exclusive with the kernel-owned virtio-net bringup, bundle, and queue-materialization proof chain over the same BDF/grant path, and with the cloud_nvme_readonly_bind_proof descendant chain because both stage a proof-specific production DeviceMmio grant source.
  • kernel/src/cap/virtio_net_tx_authority_bundle_proof.rs, kernel/src/cap/virtio_net_tx_queue_materialization_proof.rs, and kernel/src/cap/virtio_net_msix_function_enable_proof.rs are the decomposed userspace-TX track. Each is non-qemu and gated by its own focused-proof Cargo feature (cloud_virtio_net_tx_authority_bundle_proof, cloud_virtio_net_tx_queue_materialization_proof, and cloud_virtio_net_msix_function_enable_proof respectively; the last implies the second so the bundle observer + production grant-source pickers + userspace bundle smoke stay compiled in across the chain). The bundle proof observes the three production grant sources (devicemmio_grant_source_prod, dmapool_grant_source_prod, interrupt_grant_source_prod) issuing one cap each into the spawned userspace bundle smoke and asserts same-BDF; the queue-materialization proof drives the kernel-side modern-virtio status sequence through DRIVER_OK and materializes one manager-owned TX virtqueue from three zeroed brokered frames, asserting register read-backs and post-reset clearance; the MSI-X function-enable proof extends that sequence with one canonical mask-first PCI MSI-X function-level enable (set FUNCTION_MASK, then ENABLE, then clear both) plus best-effort cleanup on every exit path. Each child emits its own headline marker (cloudboot-evidence: virtio-net-tx-authority-bundle <token>, cloudboot-evidence: virtio-net-tx-queue-materialization <token>, and cloudboot-evidence: virtio-net-msix-function-enable <token>); when the later feature is active the earlier markers are intentionally suppressed because their discipline labels would be inaccurate. Proofs: make run-cloud-provider-virtio-net-tx-authority-bundle, make run-cloud-provider-virtio-net-tx-queue-materialization, make run-cloud-provider-virtio-net-msix-function-enable.
  • kernel/src/cap/virtio_net_userspace_rx_dma_proof.rs (Phase C slice 4a-ii, gated cloud_virtio_net_userspace_rx_bringup_proof) drives the first real RX DMA from the shim-owned vring: post_rx_descriptor writes the RX descriptor
    • avail over the shim’s retained RX vring physes at DMABuffer.submitDescriptor time, and drive_rx_dma (reached from the now-live provider_notify_doorbell_write_for_cap) rings the RX doorbell, submits a kernel-half SLIRP TX ARP stimulus over the retained TX physes, polls one real device->host completion, and resets the device (clearing the retained enabled flags to release the ring-buffer pins). Self-contained byte-level vring helpers are duplicated from virtio_net_polled_provider to protect run-net. The notify region is mapped kernel-side + the per-queue notify slot offsets captured by cap::devicemmio_grant_source_prod (rx_dma_notify_state). Proof make run-cloud-prod-nic-driver-userspace-rx-bringup (extended).
  • kernel/src/cap/null.rs implements the measurement-only NullCap.
  • kernel/src/cap/park_bench.rs implements the measurement-only ParkBench authority used by make run-measure.

Related docs: Capability Model, Authority Accounting.

Userspace

  • init/ is the standalone init process. In the spawn smoke, it uses ProcessSpawner, grants initial child capabilities, waits on ProcessHandles, and checks hostile spawn inputs.
  • capos-rt/src/entry.rs owns the runtime entry path and bootstrap validation.
  • capos-rt/src/alloc.rs initializes the userspace heap.
  • capos-rt/src/syscall.rs provides raw syscall wrappers.
  • capos-rt/src/capset.rs provides typed CapSet lookup helpers.
  • capos-rt/src/ring.rs implements the safe single-owner ring client, out-of-order completion handling, transfer descriptor packing, and result-cap parsing.
  • capos-rt/src/client.rs implements typed clients for Console, TerminalSession, BootPackage, ProcessSpawner, ProcessHandle, and Timer. The client-side methods are generic over Transport; result-cap-adopting methods stay on the concrete RuntimeRingClient.
  • capos-rt/src/transport.rs defines the Transport seam (the client-side CALL/completion/RELEASE ring operations) and the in-system RingTransport (RingClient viewed through the seam). A host remote transport is a later slice; see docs/backlog/capos-sdk-dual-transport.md.
  • capos/ is the front-door SDK facade crate: for the default ring feature it re-exports the capos-rt runtime, typed clients, the entry_point! macro, and a prelude. The remote feature is reserved. Standalone, like capos-rt.
  • capos-rt/src/pollselect.rs is the pure POSIX poll/select bridge: SocketReadiness -> poll revents (POLLIN/POLLOUT/POLLHUP/POLLERR/ POLLNVAL) and select set membership, plus unsupported_request_bits for fail-closed flag handling. Shared by the libcapos-posix C surface and the posix-socket-poll-select-smoke proof. Proof: make run-posix-socket-poll-select.
  • capos-rt/src/panic.rs provides the emergency Console panic output path.
  • capos-rt/src/bin/smoke.rs is the runtime smoke binary used by focused runtime proofs rather than the default boot manifest.
  • capos-service/src/lib.rs is the standalone no_std service lifecycle layer above capos-rt; slice 1 exposes ServiceMain, ServiceRuntime, and ordered initialize/dependency-wait/ready/run/drain/shutdown/cleanup phases.
  • shell/src/main.rs is the native capability shell, built as the standalone capos-shell crate and packaged by system.cue, system-shell.cue, and the focused login manifests.

Validation: make capos-rt-check, make run-smoke, make run-spawn, make run-shell, make run-terminal. The former Telnet fixture is retired with the qemu-only kernel TCP listener.

Standalone C and WASI Substrates

These are standalone crates (not workspace members) built by the Makefile.

  • libcapos/ builds libcapos.a, a no_std Rust staticlib exposing the capos-rt syscall/ring/CapSet path and typed Console/Timer/ EntropySource/VirtualMemory wrappers plus C heap shims to C consumers. Public header at libcapos/include/capos/capos.h. No POSIX surface.
  • libcapos-posix/ builds libcapos_posix.a, a no_std Rust staticlib layering a POSIX adapter over libcapos: per-process fd table, errno cell, historical UDP socket wrappers over the retired qemu-only kernel UdpSocket cap, clock over Timer, pipe/dup over Pipe, poll/select (poll.rs, <poll.h>/<sys/select.h>) over the capos-rt::pollselect readiness bridge with fail-closed unsupported-flag / EBADF / EINVAL handling, and fork/execve/waitpid via the recording-shim ProcessSpawner Move-grant path, plus the libc surface the dash port needs: stdio/string/stdlib/ctype helpers, strerror/qsort/umask/abort/ strtoll/strpbrk/lstat/getgroups/wait3/vfork, byte-order helpers (inet.rs), getrlimit/setrlimit (resource.rs), setlocale (locale.rs), times + tcgetattr, C-locale wchar/wctype multibyte (wchar.rs), the environ pointer, and the sys_siglist array. C headers (the namespaced source of truth) under libcapos-posix/include/capos/posix/ – including the dash-needed sys/types.h, termios.h, sys/resource.h, sys/times.h, wchar.h, wctype.h, locale.h, inttypes.h, and the decl-only sys/ioctl.h/sys/mman.h/arpa/inet.h/getopt.h/paths.h/ sys/param.h. libcapos-posix/sysroot/include/ is the -nostdinc bare-header sysroot (<stdio.h>, <unistd.h>, <sys/stat.h>, …) whose wrappers forward into that namespace; mirrored C ports (dash) build against it via the Makefile’s CAPOS_C_SYSROOT_INCLUDE flags on the capos-c-multitu-elf rule. Focused sysroot proof make run-c-libc-surface.
  • capos-wasm/ is the no_std WASI host adapter: a wasmi-backed Runtime, the wasm-host userspace binary, the Preview 1 import resolver, and the manifest-supplied wasm payload reader.
  • vendor/wasmi-no_std/ and vendor/dns-c-wahern/ are static-pinned, no-patches upstream snapshots consumed by capos-wasm/ and the POSIX DNS smoke; do not patch them in place (refresh procedure in each VENDORED_FROM.md).
  • vendor/dash/ is the mirror-as-is dash 0.5.13.4 snapshot (src/ stays byte-identical; capOS deviations live under patches/). Its capOS build pipeline lives outside the mirror under vendor/dash/capos/: the pinned config.h and gen-tables.sh (stages a patched source copy + runs the six host table generators). The Makefile dash target builds target/dash/dash.elf through capos-c-multitu-elf against libcapos.a + libcapos_posix.a.

Validation: make run-c-hello, make run-posix-pipe-smoke, make run-posix-printf, make run-wasm-host, make run-wasi-hello-rust, make run-wasi-random. The former POSIX DNS smoke is retired with the qemu-only kernel UdpSocket owner.

Demo Services

demos/ is a nested userspace smoke-test workspace. Each demo is a release-built service binary packaged into the boot manifest:

  • adventure-client, adventure-server, adventure-npc-shopkeeper, adventure-npc-wanderer
  • capos-chat, chat-bot, chat-client, chat-server
  • capset-bootstrap, console-paths, credential-store
  • endpoint-queue-limit-smoke, endpoint-roundtrip, ipc-server, ipc-client, in-flight-call-limit-smoke
  • frame-allocator-cleanup, memoryobject-shared-child, memoryobject-shared-parent
  • paperclips, paperclips-content
  • revocable-read, revocation-observer
  • ring-corruption, ring-reserved-opcodes, ring-nop, ring-fairness
  • service-common, shell-spawn-test, shell-typed-call
  • terminal-session, terminal-stranger
  • timer-smoke, timer-flood
  • tls-smoke, unprivileged-stranger, virtual-memory
  • user-fault-parent, user-fault-victim (user fault containment proof, make run-user-fault)

Shared demo support lives in demos/capos-demo-support/src/lib.rs and uses capos-rt for entry, allocator, syscall, CapSet, and panic support while keeping raw ring helpers for low-level transport smokes.

Validation: make run-spawn.

Manifest and Tooling

  • system.cue is the default init-owned boot manifest source. It imports the shared defaults package, boot-launches standalone init, and lets init start the shell, remote-session CapSet gateway, and resident services.
  • system-spawn.cue is the ProcessSpawner smoke manifest source.
  • system-smoke.cue is the scripted focused shell-led login/shell smoke manifest source.
  • system-chat.cue, system-adventure.cue, and system-paperclips.cue are focused resident-service and terminal-demo manifest sources.
  • system-memoryobject-shared.cue, system-revocable-read.cue, and system-measure.cue are focused regression/measurement manifest sources.
  • system-shell.cue is the focused anonymous-shell manifest source (no verifier, shell stays anonymous).
  • system-terminal.cue is the focused TerminalSession proof manifest source.
  • system-credential.cue is the focused CredentialStore proof manifest source.
  • system-login.cue is the focused password-login proof manifest source.
  • system-login-setup.cue is the focused first-boot setup proof manifest source.
  • tools/mkmanifest/ evaluates manifest input, embeds binaries, validates manifest shape, writes boot-manifest Cap’n Proto bytes, and provides cue-to-capnp for schema-aware CUE-authored data-message conversion. Its sibling mkoverlay bin encodes a SystemConfigOverlay from CUE into the system/config/overlay.bin bytes (filling the canonical content hash) for the installable-system config-overlay proof.
  • tools/manualc/ is the System Manual corpus compiler: it parses schema/capos.capnp for section-2 interface pages, reads the authored man corpus under docs/manual/man<section>/*.man, and emits the boot-packaged ManualCorpus blob. It fails the build if any in-tree capability interface lacks a section-2 page (i.e. a schema doc comment).
  • docs/manual/ holds the authored man-shaped corpus consumed by manualc (section 1 shell-command pages and section 7 concept pages); section-2 capability pages are generated from the schema, not stored here.
  • system-manual-smoke.cue is the focused Manual proof manifest source.
  • tools/agent-session-recaps/ contains private-session recap and raw-archive tooling for the agentic development experiment. The tools are tracked here; raw transcripts and generated recap stores stay outside the repo unless explicitly redacted and reviewed.
  • tools/check-generated-capnp.sh verifies checked-in generated schema output.
  • scripts/record_worklog.py emits per-task commit spans (from each task’s commits: list, falling back to task-file history) for the development timeline/Gantt; scripts/validate_backfill_tasks.py validates backfilled task-file frontmatter against the chunk’s real SHAs; scripts/check-md-links.py is the pre-commit broken-relative-link gate over all .md.
  • tools/githooks/ is the repo core.hooksPath (enabled with make hooks): prepare-commit-msg stamps provenance trailers (Plan-Item/Run-Id/ Agent-Kind) onto run-driven commits, alongside the git-lfs hooks.
  • tools/qemu-net-harness.sh runs the current QEMU net harness, with tools/qemu-net-smoke.sh asserting virtio-net transport, MSI-X metadata selection, kernel-owned MSI-X vector-pool allocation/programming, masked route-lifecycle proof, queue vector assignment, descriptor guards, ARP, and ICMP fixture lines.
  • fuzz/ contains fuzz targets for manifest Cap’n Proto decoding (with the production reader-options envelope), mkmanifest JSON conversion/validation, ELF parsing, Telnet IAC filtering, terminal line discipline, ring SQE wire validation, ISO 9660 PVD/directory-record parsing, the CAPOSRO1/ CAPOSST1/CAPOSWF1 storage mount parsers, and the capos-tls X.509 validity walk.

Validation: cargo test-mkmanifest, make generated-code-check, make fuzz-build, make fuzz-smoke.

Documentation

  • docs/capability-model.md is the current capability architecture reference.
  • docs/architecture/threading.md and docs/architecture/park.md record the accepted contracts and first implementation for in-process thread ownership and private ParkSpace authority.
  • docs/*-design.md files record targeted implemented or accepted designs.
  • docs/proposals/ contains accepted, future, exploratory, and rejected designs.
  • docs/research/ summarizes prior art (the capability-systems-survey.md synthesis plus per-system deep-dive reports).
  • docs/proposals/mdbook-docs-site-proposal.md defines the documentation site structure and status vocabulary used by the orientation pages.

First Chat Demo

The First Chat demo is the smallest runnable multi-process service demo in capOS. It boots a resident chat-server, a bounded chat-bot actor, and a native shell that can launch chat-client with explicit StdIO plus the broker-issued operator Chat endpoint grant.

The chat service is not a shell builtin. The shell only launches a client process and services that client’s StdIO endpoint while the client talks to the resident Chat endpoint. The focused manifest routes the kernel singleton chat_endpoint through init to chat-server, which is the same endpoint the broker facets into operator shell bundles.

Run It

Use the focused QEMU proof:

make run-chat

The scripted proof creates a volatile shell credential, rejects an attempted client endpoint relabel, launches chat-client under the authenticated shell session, sends one lobby message, checks membership with /who, observes the resident bot reply, quits the client, and exits the shell. The terminal transcript should include:

[chat] /join <channel>, /leave, /who, /exit, or plain text
[chat:#lobby]> hello from shell
[chat] #lobby <member-2> hello from shell
[chat] #lobby <member-1> [chat-bot] echo-bot heard you.

For default manual use, boot the ordinary playground:

make run

After login:

run "chat-client" with { stdio: client @stdio, chat: client @chat }

The default playground starts the resident chat-server and includes chat-client, but it does not start the bounded chat-bot proof actor. Use make run-chat when you need the one-shot echo-bot transcript.

For lower-level manual proof work, let make run-chat build the focused ISO, then boot capos-chat.iso yourself with the terminal UART attached to stdio and the console UART written to a log.

Useful client commands:

/join #other
/who
/leave
/exit
plain chat text

The resident bot is a bounded proof actor. If the operator waits too long before joining and sending the first lobby message, the bot can time out and exit; the chat client and server remain usable, but the bot reply will no longer appear.

What It Demonstrates

make run-chat and the manual terminal path described above currently show:

  • chat-server runs as a resident service exporting only the Chat endpoint;
  • chat-server keys membership by the opaque caller-session reference in the endpoint metadata, not by a caller-selected endpoint badge;
  • chat-bot is a separate participant with a delegated chat client endpoint and its own session-bound membership record;
  • capos-shell launches chat-client as an ordinary userspace process;
  • the foreground client receives only explicit StdIO and Chat grants;
  • caller-selected endpoint relabeling is rejected for delegated chat clients;
  • the handle supplied to join is request data only; the service assigns visible member-N labels and the handle does not select membership authority or sender identity;
  • lobby messages and bot replies are visible through the terminal transcript;
  • /who lists current channel members from the resident service;
  • client exit returns to the shell prompt, and the manifest child wait path observes clean shell and bot exits during normal completion.

Current Limits

This is not yet a distinct-local-user chat surface over Telnet or multiple terminals.

system.cue and system-chat.cue each boot one terminal-backed shell on the QEMU terminal UART, and the shell’s run command waits on the foreground client’s StdIO endpoint. Multiple chat-client runs can reuse the resident service, but the current manual flow is one foreground client at a time. The demo client still sends the hard-coded join handle shell for compatibility; the server ignores it for visible sender labels and does not request disclosed display/profile metadata from the session broker yet.

The default make run foreground shell now receives its shell bundle from AuthorityBroker, including a profile-scoped chat endpoint for operator shells. Guest and anonymous shells do not receive chat by default. An operator shell can therefore run the same chat-client command after login. This is still not a distinct durable user chat surface: the demo client joins with the hard-coded handle shell, the server assigns its own visible member label, and multiple terminal sessions still need a multi-session terminal host or network gateway before they are a real multi-user chat model.

To make distinct local users chat through Telnet or terminals, capOS still needs a multi-session terminal host or Telnet gateway that can keep multiple shell sessions alive, grant each session a broker-authorized chat root/facet, and disclose only the bounded display/profile metadata the user or broker explicitly permits.

Aurelian Frontier — Proof Slice

This page describes the current runnable proof slice of the Aurelian Frontier game. It is the end-to-end example of a capOS-native interactive application: a Roman-frontier text adventure with magic wards, warrior skills, wizard spells, NPC chat history, per-player state, and explicit capability grants. The wider game design lives in Aurelian Frontier; this page covers what runs today and how the QEMU smoke proves it.

Unlike a shell builtin, the game runs as ordinary userspace processes:

  • capos-shell launches adventure-client with only StdIO, Adventure, and Chat client capabilities.
  • adventure-server owns room, inventory, writ, combat, evidence, and effect state keyed by the endpoint caller-session scoped reference and epoch, while consuming validated read-only prototype mission content generated from adventure-content CUE source.
  • chat-server carries room messages and labels replayed room history so NPC actors do not treat old messages as fresh input.
  • adventure-npc-wanderer and adventure-npc-shopkeeper prove that separate actors can join the shared ashen-road channel without receiving ambient game authority.
  • adventure-scenario-test is a noninteractive capOS userspace test process with only Console and Adventure caps. It drives the custody scenario through AdventureClient RPCs and prints a console success marker.

Run It

Use the focused QEMU proof:

make run-adventure

The scripted run creates a volatile shell credential, launches the interactive adventure client for representative rendering and command coverage, and also asserts the resident adventure-scenario-test success marker and exit status for the complex custody path.

run "adventure-client" starts from a fresh expedition view by default. Use the client’s resume command to return to that session’s active expedition state instead of silently continuing it on launch.

For the default init-owned boot, start make run, log in or run setup, then use the MOTD compatibility commands:

spawn "chat-server" with { console: @console, chat: @chat } -> $chat
spawn "adventure-server" with { console: @console, adventure: @adventure, chat: client @chat } -> $adventure
spawn "adventure-npc-wanderer" with { console: @console, chat: client @chat } -> $wanderer
spawn "adventure-npc-shopkeeper" with { console: @console, chat: client @chat } -> $shopkeeper
run "adventure-client" with { stdio: client @stdio, adventure: client @adventure, chat: client @chat }

Normal launch commands omit legacy receiver selectors; delegated client endpoint identity is preserved by default. The adventure server derives player state from live session-bound endpoint caller metadata. The focused make run-adventure proof is the authoritative regression path. Its manifest uses selector-free Adventure and chat endpoint grants, while hostile and lower-level smokes retain explicit legacy selector fixtures for rejection coverage.

Current Mission

The implemented mission starts in fort_aurelian, crosses gate_yard and ashen_road, and reaches signal_tower, with under_vault present as a bounded site in the generated graph. The player can request and delegate a ward-writ, ask actors about the mission, quote and buy Maro’s route support, fight a ward-wraith, order Livia to expose the tower sigil, recover eagle-standard, record a wounded-legionary evacuation, seal the gate-yard breach, and get Iunia’s witness-certified temple-seal custody. Room views show canonical room, exit, actor, mob, and writ ids alongside the current mission and lead. Status and inventory separate survival, location, mission, physical items, writs, relic custody, marks, evidence, effects, and the next lead; status also prints the fixed smoke seed calendar (ashfall day 9, ash-wind, ward-static), a bounded seasonal resource count/cap summary, and a carried seasonal-resource forecast that names the next season’s degraded and expired counts. The current gameplay slice also lets active collectible seasonal resources be taken at their site; carried crops, fish, and forage participate in the next-season aging rule, while active repair-material resources can be harvested without being treated as fragile seasonal carry items. ask quartermaster about season-transition applies that aging rule: expired crops are removed, fish/forage degrade to explicit -degraded inventory tokens, and unknown or non-seasonal items stay unchanged. After the audited debrief grants Aurelian standing, the quartermaster can sell one bounded field-ration from the fixed-smoke per-expedition seasonal stock, spending that standing and adding the ration to inventory. Ordinary inventory is currently bounded to six slots. This is not a full seasonal economy or persistent calendar advance. Status also prints the active generated calendar event metadata for the fixed seed: the lantern-vigil festival’s actor-location, shop, witness, route, and rumor overlays. These event fields are metadata/status only; actor movement, event-driven shop mutation, witness blocking, route safety mutation, debrief branching, quests, gifts, and affection are not implemented. Status also prints active generated actor routine metadata for named actors, selected from the fixed calendar plus the current mission and emergency state: actor id, room id, routine kind/trigger, schedule/effect text, authority stance, and metadata-only gameplay stance. These routine records do not move actors or grant/revoke authority. Status also prints a concise regional frontier summary for the generated settlement, outpost, and route metadata, plus a concise regional market order-book summary for generated market books, buy/sell orders, crossed pure matches, and receipt-ledger ids. Market-eligible items are limited to ordinary seasonal resources, construction materials, and explicit outpost produced/consumed supplies. Writs, relics, actors, mobs, spells, skills, order tasks, and artifact/authority-gated blueprint outputs are excluded. The first live regional market transaction proof is bounded to one generated order-book match at a time: Adventure owns reserve, commit, cancel/release, stale-version rejection, idempotent replay, and ordered receipt facts behind existing quote, buy, and sell calls for explicit regional-market proof actions. Fresh committed field-ration matches now debit the player-local Aurelian chit balance once, decrement the seller ash_farm field-ration stock once, accrue two service-owned regional market fee chits once, credit two service-owned ash_farm seller-proceeds chits once, and deliver the committed quantity into the player expedition inventory only when ordinary inventory capacity can accept the full delivery; if capacity is full, replaying buy commit-field-ration from regional-market can apply the held delivery after ordinary items are dropped without spending, decrementing stock, accruing fees, or crediting proceeds again. Commit replay does not duplicate delivery, debit, outpost stock movement, fee accrual, or seller proceeds. Commit 29c065a9 at 2026-04-30 17:41 UTC added bounded order expiry to live matching and reserve: fixed-smoke day 65 keeps the field-ration proof active, while the scenario process proves a day-73 expired field-ration reserve releases without status, inventory, currency, outpost stock, fee, seller-proceeds, or delivery mutation. Commit 205fd6a0 at 2026-04-30 18:40 UTC added a bounded service-owned fee withdrawal proof: sell withdraw-fees to regional-market moves the two accrued regional-market fee chits into a service-owned treasury record exactly once, status exposes the treasury balance, replay is stable, and inventory, currency, outpost stock, seller proceeds, and delivery state do not mutate. Commit a547db3d at 2026-04-30 19:43 UTC adds a bounded receipt snapshot/restore proof: buy receipt-snapshot from regional-market clones the live regional market receipt facts, reconstructs a separate transaction state, replays the old field-ration commit against that reconstructed state, and returns proof success without mutating live status or inventory. Commit 4b44b32 at 2026-04-30 20:07 UTC adds a bounded settlement-side snapshot-view proof: buy settlement-snapshot from regional-market checks applied delivery, debit, stock, fee, proceeds, and withdrawal ids plus the current settlement balances, replays the committed field-ration fact and fee withdrawal as already applied, and returns proof success without mutating live status or inventory. The construction-job receipt snapshot work is scoped to pure Rust construction receipt snapshot semantics plus a size-constrained QEMU no-mutation probe. Pure adventure-content tests restore a separate ConstructionJobState from ordered field-repair job facts and validate malformed, over-capacity, and non-closed snapshot shapes. The focused QEMU path drives repair receipt-snapshot with field-engineer after the old completed repair only to check status/inventory stability and confirm live construction state and material stock are unchanged. The runtime command does not replay receipts into the live construction service and is not durable restart loading or a general construction persistence layer. It does not yet move NPC stores, broader outpost inventories, durable currency ledgers, durable seller-proceeds ledgers, profile ledger balances, fee ledgers, durable calendar advancement, durable crash-recovery state, or general economy behavior. Status also prints a construction foundation summary for generated blueprint, artifact, enchantment slot, and gate metadata; the first live construction-job proof is bounded to the field-engineer gate repair path: Adventure owns reserve/start, completion, cancel/release, stale-version rejection, idempotent replay, service-owned material holds and restores, and ordered job facts behind existing repair calls. It does not yet persist durable stock ledgers, replenish stock from outposts, update player output/currency inventories, advance job time, persist crash-recovery state, or provide general crafting/artifact gameplay. Status now also prints disabled-by-default optional fake-agent NPC metadata: budget count, supported fake-agent purpose count, aggregate session token budget, tool-call budget, and audit visibility. That is deterministic metadata for future optional chatter, hints, outpost summaries, personal routines, nonbinding shop flavor, and festival reactions, not live LLM gameplay or autonomous NPC authority. Status also prints the first local party foundation: a service-created local player label, the current party leader/members/pending invites, scoped ward-writ delegations, and recorded assists. Party labels are derived from live Adventure caller-session keys and do not disclose global session or principal data. The same service-local labels are used by the first physical-item transfer foundation, transfer <item> to <player>, which mutates both player inventories atomically inside Adventure, requires shared party membership, and refuses relic custody such as eagle-standard. Currency escrow and two-client transfer proof remain future work. Valid near-miss ids such as ward and wraith return explicit suggestions. The site graph, regional metadata, visible items, actors, mobs, aliases, objectives, mission text, leads, scripted proof-path metadata, named-item inspection text, and prepared-spell inspection text are authored in demos/adventure-content/content/prototype.cue, generated into demos/adventure-content/src/generated.rs, and validated by host tests before the server consumes them.

Useful commands in the current game:

look
resume
status
request ward
request ward-writ
accept ward-writ
delegate ward-writ to livia
order Livia to guard
go east
go east
say hello road
take scout-marker
quote route from maro
buy route from maro
quote regional-field-ration from regional-market
buy reserve-field-ration from regional-market
buy commit-field-ration from regional-market
buy reserve-incense from regional-market
sell cancel-incense to regional-market
sell withdraw-fees to regional-market
transfer scout-marker to player-1
repair gate with field-engineer
repair retry-field-repair with field-engineer
repair complete-field-repair with field-engineer
repair stale-field-repair with field-engineer
repair reserve-cancel-field-repair with field-engineer
repair cancel-field-repair with field-engineer
go north
order livia to dispel-sigil
inspect ward-wraith
cast ember-dart ward-wraith
skill strike ward-wraith
recover eagle-standard
ask wounded-legionary about evacuation
guard
cast shield-bind self
go south
go west
seal gate
go west
ask iunia about custody
inventory
go down

What It Proves

make run-adventure currently asserts:

  • shell-spawned game clients run with explicit StdIO, Adventure, and Chat grants;
  • ordinary adventure-client launch and look start fresh, while the explicit resume command reloads active expedition state through an Adventure cap call;
  • room joins, movement, physical item pickup, typed relic recovery, inventory, status, and representative failure messages are visible in the terminal transcript;
  • give, ask, request, accept, delegate, order, seal, recover, revoke, quote, buy, sell, trade, transfer, and repair are wired as typed adventure calls, not shell-special strings;
  • adventure-client exposes party create, party invite, party accept, party leave, party delegate, assist, and transfer <item> to <player> command paths backed by typed Adventure methods;
  • the party proof covers one-client party creation, missing local-player refusal paths for invite and assist, party status output, and help/client command availability; two-client successful accept, leave, delegate, and assist calls remain future work;
  • the transfer proof covers one-client unknown target, self-transfer, and missing-item refusals, with status or inventory unchanged as appropriate; successful two-player transfer remains covered by pure Rust state tests until the launcher/session harness can run two real Adventure clients;
  • canonical room, exit, actor, mob, and writ ids, room-view leads, common actor casing aliases, near-miss suggestions, and improved actor-task hints are visible in the terminal transcript;
  • combat status exposes hp, guard, fatigue, warrior stars, wizard circles, prepared spells, active mobs, mission state, physical items, writ authority, relic custody, marks, evidence, effects, fixed smoke seed calendar state, and objective state;
  • generated actor routine metadata is visible through status as structured status-only records filtered by the fixed calendar and current mission/emergency state;
  • generated regional market order-book metadata is visible through status as aggregate metadata and pure non-mutating crossed-match counts only;
  • market and construction coverage proves a Maro route quote, a successful route exchange, an Iunia clean-custody trade refusal that names the temple-seal gate and price, a bounded regional-market reserve/commit/retry/stale/release/cancel proof where the server owns the transaction state and receipt facts, and a bounded field-repair construction job proof where the server owns job state, service-owned material hold/release facts, held-stock mutation, and terminal facts; shell-smoke coverage also keeps the full market command-help surface, including sell, visible;
  • delegated authority can expose the ward, repeated spell actions are idempotent, and eagle-standard recovery records bounded evidence in the interactive transcript;
  • the adventure-scenario-test process covers physical-item-only take and drop, carried seasonal resource pickup, quartermaster-triggered seasonal inventory aging, post-debrief seasonal ration purchase, Iunia custody denials, witness refusal, survivor evacuation, gate sealing, temple-seal custody, categorized evidence tokens, and under_vault access through real Adventure cap calls, and asserts the fixed calendar, seasonal carry forecast, regional market delivery/replay, construction foundation, construction-job denial/reserve/replay/open-conflict/complete/stale/release/ reserve-after-release paths, agent NPC budget, and one-client party status lines through real Adventure cap calls;
  • the two-client local co-op proof remains open because the current focused manifest/session launcher path does not yet provide two distinct live Adventure caller-session keys without faking them inside one process;
  • replayed room messages are labeled as history, and the named NPC actor proof accepts visible replies whether the player observes them live or through room-history replay after movement;
  • the read-only prototype content model rejects malformed room graphs, bad aliases, overlong text, empty proof paths, malformed construction metadata, and invalid agent NPC budget metadata in host tests.
  • make generated-code-check fails if the checked-in generated adventure content drifts from the CUE source or generator.

Design Context

The gameplay and future setting plan live in the Aurelian Frontier proposal. The proposal covers the Aurelian frontier setting with magic-warrior and wizard ranks, future mobs, portals, golems, logistics, campaigns, persistent shared world state, multiplayer, and how those mechanics map onto capability-native authority.

Paperclips Terminal Demo

The Paperclips terminal demo is a small clean-room incremental game inspired by the paperclip maximizer thought experiment and by Frank Lantz’s browser implementation of that premise. In the focused manifest it now runs as a Paperclips server plus terminal client launched through the native shell. The server is authoritative for generated content, resources, GameState, proof-command gating, unlock checks, and game-rule mutation. The terminal client owns StdIO, handles the transcript, renders help from server-provided command specs, plain status from server-provided PaperclipsStatusSnapshot data, and plain projects from server-provided project entries when connected to a server, and sends gameplay requests to the server through an explicit PaperclipsGame endpoint capability.

This is still an early client/server protocol. The server owns regular timer cadence and the current command list, while command execution still uses raw text and mostly returns transcript text rather than typed command invocations or structured UI events. In server mode, PaperclipsGame.status returns a PaperclipsStatusSnapshot for plain status, and the terminal client renders the familiar status text locally. PaperclipsGame.projects likewise returns the unlocked project list for local terminal rendering of plain projects, while project <id> still executes through the raw text/server-mutating command path. The backlog tracks broader structured state/events and moving unlocked command facets behind server-issued capabilities so a future web client or web-shell gateway can use the same game authority instead of reimplementing Paperclips logic.

No source code, CSS, images, generated tables, or copied resource files from the original browser game are checked into this repository. The implementation uses original Rust code and local CUE content in demos/paperclips-content/content/paperclips.cue. During development, the original site and a public mirror were inspected for license/provenance only; neither exposed a permissive license that would allow copying assets into capOS.

Reference sources:

Run It

Use the focused QEMU proof:

make run-paperclips

The scripted proof logs into the shell, launches the child process, drives the opening refusals and business loop, scales production through repeatable marketing and explicit sales, completes a business-phase project chain, asserts the transition to autonomous phase, completes a representative autonomous drone/factory scaling step, transitions into the cosmic phase, proves a bounded probe interval with replication and production, verifies that final conversion remains locked, and then checks clean child and shell exit. The accelerated proof transcript starts with an explicit proof-capability launch, then uses ordinary player commands plus proof-only acceleration and machine-status commands:

run "paperclips" with { stdio: client @stdio, game: client @paperclips_proof_game, proof_accelerator: @proof_accelerator }
status
buy autoclipper
buy wire 1000
buy marketing
make
run 10000
price 99
sell 1
price 25
sell 1
make
run 10000
sell 1
make
run 10000
sell 1
make
run 10000
sell 1
make
run 10000
sell 1
...
project autoclipper-license
project background-jobs
status
run 5000
buy wire 2
run 600000
make
projects
project survey-drones
sell 60
buy marketing
...
project precision-rollers
project design-search
run 600000
project forecast-engine
project survey-drones
project material-harvesters
run 100000
project foundry-lines
run 1000
project mesh-coordination
run 600000
project seed-probes
run 600000
status --json
status
projects
exit

For default manual use, boot the ordinary playground:

make run

After login:

run "paperclips" with { stdio: client @stdio, timer: @timer }

The ordinary make run playground command uses the standalone fallback because the default manifest does not start the Paperclips server. The focused make run-paperclips manifest uses run "paperclips" with { stdio: client @stdio, game: client @paperclips_game, timer: @timer }, where the server owns game state and the client timer drains server-generated status messages while the player is idle at the prompt. The structured command-list, status-snapshot, and project-list methods do not change the default manifest or MOTD launch command.

Useful commands inside the demo:

status
projects
make
sell <n>
price <cents>
buy wire [bundles]
buy autoclipper [count]
buy marketing [count]
buy processor [count]
buy memory [count]
buy drone [count]
buy factory [count]
buy probe [count]
project <id>
help
exit

make starts exactly one manual paperclip. Manual work takes 500 configured milliseconds before the clip becomes available. Repeating make while work is pending is refused until the player completes the Background Jobs project; after that, repeated make commands reserve wire and queue manual jobs behind the active one. Time advancement reports completed manual cycles before the status update. Purchase counts are optional and default to one; explicit zero counts are rejected. Automation advances on configured millisecond intervals while the process is running. Normal player launches do not expose run <ms> or status --json; the focused QEMU proof passes an explicit proof_accelerator capability and uses those commands only as proof instrumentation. The shell rejects renaming an ordinary @timer grant into that proof slot. Blank input repeats the last non-empty command. The first autoclipper is granted by the Autoclipper License project, which costs cash and trust; repeatable buy autoclipper [count] purchases appear only after the license grants the starter autoclipper. Later-stage purchase commands such as buy drone, buy factory, and buy probe appear in help only after the corresponding automation path is unlocked. The projects list shows only unlocked technologies; in server mode that plain listing is rendered from structured server-provided project entries. Complete the listed projects and pay their shown costs to reveal later project chains.

The proof-only status --json command prints a single compact JSON object for scripted assertions when the process receives the proof_accelerator capability. Normal player launches do not advertise or accept it, and it stays separate from the structured plain-status snapshot used for terminal presentation. All fields are numeric and emitted in stable order. stage uses 0=business, 1=autonomous, 2=cosmic, and 3=complete; design and strategy are the two planning resources, and cosmic_matter maps to the universe-matter state.

Funds change only when clips are sold explicitly by default. Demand follows a bounded random walk during the business phase, then price modifies the current market size for sell <n>. A successful business-phase sale starts a short CUE-configured market-settlement cooldown, so repeated immediate sales are refused until timer/proof time advances. Wire is bought in CUE-configured bundles at a market price that drifts on a slower interval; repeated purchases add temporary price pressure that decays over later market updates. Repeatable marketing buys still spend funds, but each new level contributes more demand than the previous level. The CUE content owns the base marketing gain, walk thresholds, wire market thresholds, step sizes, sale cooldown, and deterministic generator parameters. It also has an autoSellEnabled rule for experiments that should sell during ms, but the checked-in demo keeps it disabled so market movement is visible.

Content Pipeline

Paperclips uses the same generated-content discipline expected for larger demos, but with a stricter runtime data path:

demos/paperclips-content/content/paperclips.cue
  -> cue export --out json
  -> tools/paperclips-content-gen
  -> schema-validated PaperclipsContent Cap'n Proto bytes in src/generated.rs
  -> paperclips-content deserializes the typed Paperclips schema at startup

The CUE file owns the game balance: initial state, purchase costs, millisecond intervals, explicit/automatic selling policy, demand rules, trust milestones, project costs, project labels/descriptions, production cadence, later-stage matter conversion and replication caps, manual-work pacing, unlock thresholds, and project effects. Rust owns mechanics, validation, command parsing, and the terminal adapter. make generated-code-check fails if the checked-in generated Cap’n Proto bytes drift from the CUE source.

Unlock Flow

The tech progression is data-driven by the project list:

  • retail phase starts with 10 wire, no cash, manual single-clip production, sales, and early wire purchases;
  • Autoclipper License grants the first autoclipper for cash plus trust, then Background Jobs enables queued manual make commands;
  • repeatable marketing investment raises dynamic demand, while later business projects improve autoclippers, unlock design search, and unlock Strategy generation;
  • Survey Drones moves the game into autonomous matter conversion;
  • harvester/foundry/mesh projects scale harvesting, production, and compute;
  • Seed Probes move the game into cosmic replication;
  • Final Conversion completes the run once reachable matter is exhausted.

What It Demonstrates

make run-paperclips currently shows:

  • capos-shell launches paperclips as a normal child process;
  • init launches Paperclips server services before the shell starts;
  • the terminal client receives only explicit StdIO and PaperclipsGame endpoint grants;
  • server-mode help is rendered from the Paperclips server’s structured command specs, so the visible command list follows server-side unlock/proof authority;
  • server-mode plain status is rendered by the terminal client from the Paperclips server’s PaperclipsStatusSnapshot, while proof-only status --json stays a separate server-gated command;
  • server-mode plain projects is rendered by the terminal client from the Paperclips server’s structured project list, while project <id> execution stays on the server-mutating text command path;
  • the normal and proof launches use separate server endpoints, so proof-only commands are decided by server-side authority rather than by client text;
  • the foreground shell services the child’s stdio bridge while the game runs, so the demo exercises real endpoint IPC between shell and child process;
  • the server’s timer capability drives regular automation without ambient clock access by the terminal client;
  • the Paperclips server owns generated content, GameState, unlock checks, proof-command gating, and game-rule mutation for the focused manifest;
  • a repeatable economic choice (buy marketing) changes the early business loop before automation is purchased;
  • representative Stage 1 refusal output remains legible: early locked buy autoclipper, insufficient-funds buy wire 1000, pending manual work, bulk manual rejection, a high-price sell 1 demand refusal, a no-wire manual production refusal, and a locked project survey-drones attempt are asserted in the focused transcript;
  • manual production and explicit sales fund Autoclipper License, which grants the first autoclipper and unlocks repeatable buy autoclipper;
  • repeatable demand investment remains a purchase path rather than a one-shot project, and the smoke asserts at least five marketing purchases before the phase transition path completes;
  • business-phase sales are paced by a timer-backed market-settlement cooldown, and the smoke asserts an immediate repeat sale is refused;
  • scaled business-phase production reaches the 10,000-clip trust threshold, then completes autoclipper-license, precision-rollers, design-search, forecast-engine, and survey-drones;
  • completing the chain is asserted by [done] project entries, the visible == autonomous phase == status line, Automation: 14 autoclippers, 1 drones, 0 factories, 0 probes, and the local matter grant;
  • the autonomous follow-up completes material-harvesters and foundry-lines, runs milliseconds, then asserts Automation: 14 autoclippers, 5 drones, 2 factories, 0 probes, lower local matter, and additional clip production;
  • the late-game follow-up completes mesh-coordination, then seed-probes, asserts == cosmic phase ==, visible probe replication, lower cosmic matter, and additional clip production, then asserts final-conversion remains locked;
  • the late-game proof also asserts a proof-only status --json line with compact, machine-readable numeric state while preserving the human transcript checks;
  • the Paperclips server maintains game state without ambient authority;
  • the pure rules layer in paperclips-content is host-testable separately from the terminal adapter and reads generated Cap’n Proto content data;
  • exiting the game closes stdio, returns to the shell, and lets the focused manifest halt through the normal debug-exit path.

This is now a coarse client/server game-state demo. It is not yet the final capability-management showcase: help, plain status, and plain projects are structured, command execution is still raw text, broader state/events remain future work, and unlocks are reflected in server-owned command specs/project lists/output rather than transferred command facets. That split is the intended path for a later web client or gateway that uses the same game capabilities.

Current Limits

The demo intentionally implements a compact terminal adaptation, not a browser-accurate port. It has no original artwork, CSS, JavaScript, exact project list, exact balancing, save file, market UI, tournament model, or complete original event text. The host tests cover early mechanics, project locking, the deterministic business-to-autonomous project chain, autonomous resource conversion caps, factory/drone scaling, cosmic probe replication, and completion gating, including a one-real-time-hour upper bound for normal creativity generation. The focused QEMU proof covers launch, the first production loop, one early automation purchase, representative Stage 1 refusal output, business-phase project chaining, the autonomous transition, one timer-driven autonomous scaling action, and a bounded cosmic probe interval. It is representative transcript coverage rather than an exhaustive full playthrough.

Future rule/content expansion is tracked in Paperclips Terminal Demo. New data-heavy content should migrate through mkmanifest cue-to-capnp: author bounded CUE, convert it to a specified Cap’n Proto root with pinned host tools, validate the result on the host, and keep runtime CUE parsing out of the demo.

Current Design Authority

The current capOS design lives in reader-facing architecture, capability, security, device, configuration, and status pages. Proposal documents remain important design history, but they stop being the primary place to patch a design after that design is implemented or accepted as the working baseline.

Stable Homes

Use these homes for current behavior and accepted contracts:

AreaCurrent-design home
Boot, manifest, init, processes, rings, IPC, session context, memory, schedulingdocs/architecture/
Capability model, authority accounting, ABI policydocs/capability-model.md, docs/authority-accounting-transfer-design.md, docs/abi-evolution-policy.md
Operator configuration and CUE overlaysdocs/configuration.md
DMA isolation, device authority, trusted inputs, panic surfacesdocs/dma-isolation-design.md, docs/devices/, docs/trusted-build-inputs.md, docs/panic-surface-inventory.md
Current status, roadmap, backlog, task lifecycledocs/status.md, docs/roadmap.md, docs/backlog/, docs/tasks/
Proposal status and archival decision recordsdocs/proposals/index.md and individual files under docs/proposals/

When a current-design home already exists, future implementation slices update that page. When none exists and the proposal has become the working design, create or extend a stable page in the nearest existing area instead of leaving the proposal as the only current reference.

Proposal Lifecycle

The proposal index classifies proposals with these roles:

  • Active design: near-term design work still being changed before or during implementation. It may remain the primary working document while the design is not stable.
  • Accepted design: selected direction. It can guide implementation, but any implemented subset needs a stable current-design page or an explicit pointer to the page that already owns the current contract.
  • Partially implemented: some behavior is in tree. The proposal must distinguish present behavior from planned behavior, and current pages should describe the implemented subset.
  • Implemented: the proposal is an archival decision record. Future changes update the stable current-design docs and code references first; the proposal changes only for archival status, links, or corrected history.
  • Superseded or Rejected: historical records. They should point at the replacement or rejection rationale and must not be cited as current behavior.

Initial Promotions

This repository already had stable homes for several implemented or accepted designs. The initial promotion set makes the weakest current-authority links explicit:

Proposal or decisionCurrent-design authorityDisposition
Session-Bound Invocation Contextdocs/architecture/session-context.md, with endpoint transport details in docs/architecture/ipc-endpoints.mdImplemented proposal becomes archival history.
Error Handlingdocs/architecture/error-handling.md, with ring transport details in docs/architecture/capability-ring.mdImplemented proposal becomes archival history.
System Configurationdocs/configuration.md and docs/architecture/manifest-startup.mdImplemented proposal stays as rationale and closeout history.
DMA Assurance Modeldocs/dma-isolation-design.mdAccepted design remains grounded in the stable DMA design page.
SMP and Scheduler Evolutiondocs/architecture/threading.md and docs/architecture/scheduling.mdAccepted design feeds current scheduler and threading contracts.

Follow-up promotions should focus on proposals whose implemented slices are large enough that readers still have to mine proposal text for current behavior. Good candidates include storage/naming, installable system, SystemInfo/System Manual, and userspace driver relocation once their current contracts settle further.

Boot Flow

Boot flow defines the trusted path from firmware-owned machine state to the first user processes. It establishes memory management, interrupt/syscall entry, capability tables, process rings, and the boot manifest authority graph.

Current Behavior

Firmware loads Limine, Limine loads the kernel and exactly one module, and the kernel treats that module as a Cap’n Proto SystemManifest. The kernel rejects boots with any module count other than one.

kmain initializes serial output, x86_64 descriptor tables, memory, paging, SMEP/SMAP, the kernel capability table, the idle process, PIC, and PIT. It then parses the manifest, validates the kernel-owned boot boundary, loads only initConfig.init.binary into a fresh AddressSpace, builds init’s bootstrap capability table and read-only CapSet page from initConfig.init.caps, enqueues init, and starts the scheduler.

Default boot uses the standalone init ELF as that init process. It receives the bootstrap authority needed to read BootPackage, validate the service graph, spawn child services, and supervise them. The foreground capos-shell is now an init-started service with the terminal, credential, session, audit, and broker capabilities needed for the local shell flow; it does not receive BootPackage or broad ProcessSpawner authority. Focused shell-led manifests such as system-smoke.cue and system-shell.cue still boot capos-shell directly as initConfig.init for narrow login/shell proofs until the run-target/init policy cleanup migrates them.

flowchart TD
    Firmware[UEFI or QEMU firmware] --> Limine[Limine bootloader]
    Limine --> Kernel[kmain]
    Limine --> Module[manifest.bin boot module]
    Kernel --> Arch[serial, GDT, IDT, syscall MSRs]
    Kernel --> Memory[frame allocator, heap, paging, SMEP/SMAP]
    Kernel --> Manifest[validate kernel manifest boundary]
    Manifest --> InitImage[parse and map init ELF]
    Manifest --> InitCaps[build init CapTable and CapSet page]
    InitImage --> InitProcess[create init Process and ring]
    InitCaps --> InitProcess
    InitProcess --> Scheduler[start round-robin scheduler]
    Scheduler --> Init[enter init]
    Init --> DefaultPath[default init-owned service graph]
    DefaultPath --> Shell[spawn capos-shell service]
    DefaultPath --> Gateway[spawn remote-session gateway and resident services]
    Init --> SpawnPath[focused system-spawn executor path]
    SpawnPath --> BootPackage[read BootPackage manifest]
    SpawnPath --> Spawner[spawn child services]
    Spawner --> Children[focused demo processes]

The invariant is that the kernel starts only initConfig.init after validating the kernel-owned manifest boundary, and no child service starts until mkmanifest/init validation has accepted service binary references, authority graph structure, and bootstrap capability source/interface checks.

Design

The boot path is deliberately single-shot. The kernel receives a single packed manifest and validates only the kernel-owned boot contract before creating init. Init then performs the userspace execution step: it reads manifest chunks from BootPackage, validates a metadata-only ManifestBootstrapPlan, resolves kernel and service cap sources, and asks ProcessSpawner to load each child ELF into its own address space with its own user stack, TLS mapping if present, ring page, and CapSet mapping.

The default manifest (system.cue) now boots an init-owned local path: the kernel launches the standalone init binary described by initConfig.init, and init spawns the shell, remote-session CapSet gateway, and resident services from initConfig.services. The shell mints an anonymous UserSession on startup through SessionManager.anonymous(), receives an empty-allowlist anonymous launcher from the broker, and waits at its own interactive prompt. The user types login (or setup on a fresh image) to upgrade in place. The smoke and shell manifests still provide focused shell-led proofs, while system-spawn.cue remains the focused init-owned graph retained for ProcessSpawner validation.

Invariants

  • Limine must provide exactly one boot module, and that module is the manifest.
  • Kernel manifest validation must complete before init is enqueued, and init BootPackage validation must complete before any child service is spawned.
  • Service ELF load failures roll back frame allocations before boot continues or fails.
  • Kernel page tables are active and HHDM user access is stripped before SMEP/SMAP are enabled.
  • The kernel passes _start(ring_addr, pid, capset_addr) in RDI, RSI, and RDX.
  • CapSet metadata is read-only user memory; the ring page is writable user memory.
  • QEMU-feature boots halt through isa-debug-exit when no runnable processes remain.

Code Map

  • kernel/src/main.rs - kmain, manifest module handling, validation, boot-only-init loading, process enqueue, halt path.
  • kernel/src/spawn.rs - ELF-to-address-space loading, fixed user stack, TLS mapping, Process construction helpers.
  • kernel/src/process.rs - process bootstrap context, ring page mapping, CapSet page mapping.
  • kernel/src/cap/mod.rs - bootstrap capability resolution and CapSet entry construction for init.
  • capos-config/src/manifest.rs - manifest decode and schema-version storage.
  • capos-config/src/validation.rs - graph/source/binary validation policy.
  • tools/mkmanifest/src/lib.rs - host-side manifest validation and binary embedding.
  • system.cue and system-spawn.cue - default and spawn-focused boot graphs.
  • limine.conf and Makefile - bootloader config, ISO construction, QEMU targets.

Validation

  • make run-smoke validates the scripted focused shell-led login path: single capos-shell init boot from system-smoke.cue, password prompt, failed-auth redaction, successful shell launch, narrow shell bundle, and clean QEMU halt.
  • make run is the operator-facing interactive boot path with the terminal UART on stdio and console/debug output logged separately.
  • make run-spawn validates that the kernel boot-launches only the standalone init with Console, BootPackage, and ProcessSpawner, and that init validates BootPackage metadata before running the focused ProcessSpawner, Timer, IPC, and memory smokes.
  • cargo test-config covers manifest decode, roundtrip, and validation logic.
  • cargo test-mkmanifest covers host-side manifest conversion and embedding checks.
  • make generated-code-check verifies checked-in Cap’n Proto generated output.

Open Work

  • The run-target/init-policy backlog still needs to migrate remaining focused shell-led manifests onto standalone init or explicitly preserve them as compatibility smokes.
  • A future manifest-loader or mkmanifest gate should reject accidental non-init default boot graphs once all focused exceptions are reconciled.

Manifest and Service Startup

The manifest is the boot package and init configuration. It names embedded binaries, the single kernel-launched init process, kernel boot parameters, and the init-owned service graph used by focused executor manifests.

Current Behavior

tools/mkmanifest requires the repo-pinned CUE compiler, evaluates system.cue, embeds declared binaries, validates binary references and the init-owned authority graph under initConfig, serializes SystemManifest, and places manifest.bin into the ISO. The kernel receives that file as the single Limine module. The diagram below is intentionally large: it separates the default init-owned boot path from the focused spawn-proof path.

flowchart TD
    Cue[system.cue or system-spawn.cue] --> Mkmanifest[tools/mkmanifest]
    Binaries[release userspace binaries] --> Mkmanifest
    Mkmanifest --> Manifest[manifest.bin SystemManifest]
    Manifest --> Limine[Limine boot module]
    Limine --> Kernel[kernel parse and validate]
    Kernel --> InitCaps[init CapTable and CapSet page]
    InitCaps --> Init[enter initConfig.init process]
    Init --> ShellPath[default system.cue: spawn shell/remote CapSet gateway/services]
    Init --> SpawnPath[focused system-spawn.cue: standalone init executor]
    SpawnPath --> BootPackage[BootPackage.readManifest chunks]
    BootPackage --> Plan[capos-config ManifestBootstrapPlan validation]
    SpawnPath --> Spawner[ProcessSpawner.spawn]
    Spawner --> Children[init-spawned child processes]

The default manifest starts only initConfig.init from the kernel, and that process is now the standalone init ELF. Init receives the bootstrap authority needed to read BootPackage, validate initConfig.services, spawn the foreground shell, remote-session CapSet gateway, resident chat service, and other default services, then wait according to the manifest policy. The shell is an init-started service; it receives terminal, credential-store, session-manager, audit-log, and authority-broker caps, mints its own anonymous UserSession, and waits for an explicit login or setup command before upgrading. It never holds BootPackage or broad ProcessSpawner authority.

Focused shell-led manifests such as system-smoke.cue and system-shell.cue still put capos-shell directly in initConfig.init for narrow login/shell proofs. That compatibility path is tracked by the run-target/init-policy backlog and should not be confused with the default system.cue boot path.

The focused system-spawn.cue manifest still puts the standalone init ELF in initConfig.init. There, init receives ProcessSpawner, a read-only BootPackage cap, and Console. It reads bounded manifest chunks into a metadata-only capos-config::ManifestBootstrapPlan, validates binary references, authority graph structure, exports, cap sources, and interface IDs, then spawns the focused smoke services. Low-level spawn grants still model receiver selectors for hostile and compatibility proofs, but normal shell client @... grants omit selector syntax and preserve delegated client endpoint identity. Raw parent-capability grants must preserve the source hold metadata, endpoint-client grants may mint selectors only from an endpoint owner or a ProcessSpawner-returned parent endpoint facet without widening it to server authority, and kernel-source Endpoint, FrameAllocator, VirtualMemory, Timer, ThreadControl, ThreadSpawner, and EntropySource grants mint fresh child-local caps without receiver selectors. QEMU-only PersistentStore grants mount the root store through the same child-local kernel-source path when a focused proof manifest names that source. Endpoint kernel grants also return parent-side client facets as ProcessSpawner result caps so init can wire later service-sourced imports without ever holding child endpoint owner caps.

mkmanifest cue-to-capnp is the adjacent general conversion path for CUE-authored data that should not become part of SystemManifest. It evaluates the input with the same pinned CUE compiler, package mode, tag injection, and CAPOS_CUE_TAGS handling as the manifest path, then passes the exported JSON to the pinned Cap’n Proto compiler through capnp convert json:binary. The caller supplies the .capnp schema file, root struct type, output path, and optional Cap’n Proto import paths. This is schema-aware serialization for data messages rooted at arbitrary specified structs; it is not a live capability or interface-object serialization path.

Design

Manifest validation has three layers:

  • Kernel bootstrap references: binary names are unique, initConfig.init.binary resolves, referenced payloads are non-empty, and init kernel cap sources match their expected interface IDs.
  • Init-owned binary references: initConfig.services[*].binary references resolve before the executor spawns children.
  • Init-owned authority graph: service names, cap names, export names, and service-sourced references are unique and resolvable; re-exporting service-sourced caps is rejected.
  • Init-owned cap sources: expected interface IDs match kernel sources or declared service exports.

Kernel startup now resolves only initConfig.init.caps. Init performs service execution in two userspace passes. The preflight pass walks initConfig.services in manifest order, resolves kernel and service-sourced caps against init grants and prior exports, and rejects an unstartable graph before spawning children. The spawn pass grants caps in declaration order, records declared exports, keeps owned parent client facets for exported child endpoints, and attenuates endpoint exports to client-only facets for importers. After every child is spawned, init drops and flushes those parent facets before waiting on children; a dropped init facet therefore cannot owner-cancel queued, pending, or in-flight child endpoint state.

Invariants

  • The manifest is schema data plus an init config tree, not shell script or ambient namespace.
  • Omitted cap sources fail closed.
  • Cap names within one service are unique and are the names userspace sees in CapSet.
  • Service exports must name caps declared by the same service.
  • Service-sourced imports must reference a declared service export.
  • Endpoint exports to importers must be attenuated to client-only facets.
  • Init must not hold endpoint owner caps for child-local manifest endpoints.
  • expectedInterfaceId checks compatibility; it is not the authority selector.
  • Legacy receiver metadata travels with cap-table hold edges and endpoint invocation metadata. Spawn-time client endpoint minting may carry the requested child selector only from owner or trusted parent endpoint result sources instead of copying the parent’s hold selector. Client facets received through ordinary spawn grants are not selector-minting authority for later spawns. Caller-selected endpoint badges are transitional compatibility state; session-bound invocation context plus broker-granted service roots/facets is the target shared-service authority model.

Code Map

  • schema/capos.capnp - SystemManifest, NamedBlob, SystemConfig, KernelCapSource, and generic CueValue storage for initConfig.
  • capos-config/src/manifest.rs - manifest structs, initConfig CUE parsing, capnp encode/decode, metadata-only ManifestBootstrapPlan, and schema-version storage.
  • capos-config/src/validation.rs - kernel bootstrap, init-owned graph, binary-reference, and capability-source validation policy.
  • tools/mkmanifest/src/lib.rs and tools/mkmanifest/src/main.rs - host-side manifest build pipeline, binary embedding, and general CUE-to-Cap’n Proto data-message conversion.
  • kernel/src/main.rs - kernel manifest module parse and validation.
  • kernel/src/cap/mod.rs - bootstrap cap creation and CapSet entry construction for init.
  • kernel/src/cap/boot_package.rs - read-only manifest-size and chunked manifest-read capability.
  • kernel/src/cap/process_spawner.rs - init-callable spawn path for packaged boot binaries.
  • capos-rt/src/client.rs - typed BootPackage and ProcessSpawner clients.
  • init/src/main.rs - BootPackage manifest reader, graph preflight, generic spawn loop, hostile spawn checks, and child waits.
  • system.cue and system-spawn.cue - default init-owned login/service graph and focused init-owned spawn manifests using initConfig.

Validation

  • cargo test-config validates manifest decode, CUE conversion, graph checks, source checks, and binary reference checks.
  • cargo test-mkmanifest validates host-side manifest conversion, embedded binary handling, pinned CUE path/version checks, pinned Cap’n Proto path/version checks, and schema-aware JSON-to-binary conversion through capnp convert when CAPOS_CAPNP is available.
  • make run-smoke validates the focused shell-led scripted login manifest: single capos-shell init boot from system-smoke.cue, failed-auth redaction, successful password auth, broker-issued shell launch, terminal isolation, and clean halt.
  • make run is the operator-facing interactive boot path with the terminal UART on stdio and console/debug output logged separately.
  • make run-spawn validates the narrower system-spawn.cue graph: the kernel boot-launches only standalone init, init validates BootPackage metadata, ProcessSpawner launches each focused child service, grants Timer to the timer smokes, and init waits for them.
  • make generated-code-check validates schema-generated Rust stays in sync.

Open Work

  • The run-target/init-policy backlog still needs to migrate remaining focused shell-led manifests or preserve them as explicit exceptions, then add a manifest-loader or mkmanifest guard against accidental non-init default boot graphs.
  • Service object identity migration still needs to retire caller-selected endpoint badge syntax from normal manifest paths. Normal shell paths already reject explicit client-grant selector syntax; low-level hostile fixtures and manifest-scoped non-identity encodings such as TCP listen ports remain separate cases.

Process Model

The process model defines how capOS represents isolated user programs, how they receive authority, how they enter and leave the scheduler, and how a parent can observe a child.

Current Behavior

A Process currently owns a user address space, a per-process capability table, a ring scratch area, a mapped capability ring, an optional read-only CapSet page, private thread/kernel-stack ledgers, and one or more Thread records. Process IDs are assigned by an atomic counter. The scheduler names current execution, run queues, direct IPC handoff, and blocking waiters with generation-checked ThreadRef values. Each thread owns its kernel stack, saved CPU context, FS base, and cap_enter blocking state, while address space, capability table, ring, CapSet, and resource accounting stay process-owned.

ELF images are loaded into fresh user address spaces. PT_LOAD segments are mapped with page permissions derived from ELF flags, the user stack is fixed at USER_STACK_BASE (0x100_0000 as of WASI Phase W.2 sub-slice 1; see capos-config/src/process_layout.rs for the canonical layout) with a linker-enforced image limit below it, and PT_TLS data is mapped into a per-process TLS area below the ring page. The process starts from a synthetic CpuContext that returns to Ring 3 with iretq.

ProcessSpawner lets a holder spawn packaged boot binaries, grant selected caps to the child, and receive result caps. Every successful spawn returns a non-transferable ProcessHandle; child-local endpoint kernel grants also return parent-side client facets so a supervisor can wire imports without sharing endpoint owner authority. ProcessHandle.wait either completes immediately for an already-exited child or registers one waiter. Child-local ThreadControl grants give runtimes ownership of their current FS base and current-thread exit. Child-local ThreadSpawner grants let a process create additional in-process threads and receive process-local ThreadHandle result caps for join, detach-on-release, and exit-code observation.

Design

Process construction separates image loading from capability-table assembly. Default boot maps only init in the kernel and gives it a bootstrap CapSet. Spawned children use the same image loading and Process creation helpers, but their grants are supplied by the calling process through ProcessSpawner. Init resolves service-sourced manifest imports against previously recorded exports before asking ProcessSpawner to create each child.

Each process starts with three machine arguments:

  • RDI - fixed ring virtual address (RING_VADDR).
  • RSI - process ID.
  • RDX - fixed CapSet virtual address, or zero if no CapSet is mapped.

Exit releases authority before the Process storage is dropped. The scheduler switches to the kernel page table before address-space teardown, cancels endpoint state for the exiting pid, completes any pending process waiter, and defers the final process drop until execution is on another kernel stack.

Future process lifecycle work should keep authority transfer explicit: parents should not gain ambient access to child internals, and child grants should come from named caps plus interface checks.

The 7.1.0 in-process threading contract is documented in In-Process Threading. It defines ThreadSpawner and ThreadHandle as process-local authorities, preserves ProcessHandle as the parent-facing whole-process lifecycle handle, and keeps process exit as the operation that releases shared capability authority.

Invariants

  • A process cannot access a resource unless its local CapTable holds a cap.
  • Bootstrap CapSet metadata is immutable from userspace.
  • A stale CapId generation must not name a reused cap-table slot.
  • ProcessSpawner raw grants require a copy-transferable cap or an endpoint owner cap; client-endpoint grants require an endpoint owner or ProcessSpawner endpoint result source and never add receive or return authority.
  • ProcessSpawner kernel-source Endpoint, FrameAllocator, VirtualMemory, ThreadControl, ThreadSpawner, and EntropySource grants are fresh child-local caps and cannot be badged. QEMU-only PersistentStore grants mount a Store cap through the child-local kernel-source path for focused persistence proofs. Endpoint kernel grants are exportable only through returned parent client facets, not through a shared owner cap in init.
  • ProcessHandle caps are non-transferable.
  • ThreadHandle caps are process-local, non-transferable, and observe only one thread in the same process.
  • At most one waiter may be registered on a ProcessHandle.
  • Process exit releases cap-table authority before the kernel stack frame is freed.

Code Map

  • kernel/src/process.rs - Process, bootstrap CPU context, ring/CapSet mapping, exit capability cleanup.
  • kernel/src/spawn.rs - ELF mapping, stack mapping, TLS mapping, process construction helpers.
  • kernel/src/sched.rs - process table, process handles, wait completion, exit path.
  • docs/architecture/threading.md - frozen 7.1.0 contract for process-owned versus thread-owned state, creation, FS-base, and join/exit behavior.
  • kernel/src/cap/process_spawner.rs - ProcessSpawnerCap, ProcessHandleCap, spawn grant validation, child-local kernel grants, child CapSet construction.
  • capos-lib/src/cap_table.rs - CapId generation and cap-table operations.
  • capos-config/src/capset.rs - fixed CapSet page ABI.
  • schema/capos.capnp - ProcessSpawner, ProcessHandle, and CapGrant.
  • init/src/main.rs - BootPackage manifest validation, generic spawn loop, child waits, and hostile spawn checks.

Validation

  • make run-smoke validates init-owned default service startup, ProcessSpawner, ProcessHandle.wait, child grants, exit cleanup, and clean halt.
  • make run-spawn validates the narrower ProcessSpawner graph for endpoint, IPC, VirtualMemory, FrameAllocator cleanup, and hostile spawn failures.
  • cargo test-lib covers CapTable generation, stale-slot, and transfer primitives.
  • cargo test-config covers CapSet and manifest metadata used to build process grants.
  • cargo build --features qemu verifies the kernel and QEMU-only paths compile.

Open Work

  • Add lifecycle operations such as kill and post-spawn grants only after their authority semantics are explicit.
  • Implement restart policy outside the kernel-side static boot graph.

Session Context

Session-bound invocation context is the current shared-service identity model. Capabilities decide what a process may invoke. The process session supplies the privacy-preserving subject context for the invocation. Request payloads, manifest strings, and legacy endpoint receiver metadata do not identify the caller and must not authorize service behavior by themselves.

Current Behavior

Every normal workload process has one immutable SessionContext installed through trusted spawn, session-manager, or broker paths. Endpoint CALL delivery includes a scoped caller-session reference plus freshness metadata by default. The server does not receive a global principal, account, profile, display name, auth source, or tenant field unless the call explicitly requests disclosure and the invoked service/facet has a matching disclosure scope.

The current endpoint ABI carries:

  • scoped_ref and scoped_ref_hi: a 128-bit opaque caller-session reference derived from a boot secret, endpoint service scope, and kernel session id;
  • epoch: a domain-separated freshness/audit value for the same service scope;
  • liveness/freshness state used to fail closed for stale ordinary sessions.

The reference is service-scoped and non-portable. A value observed by one service is not authority and is not a stable global identity in another service.

Authority Split

Capability possession answers whether a process may invoke a target. Session context answers who the invocation is attributable to, whether the session is fresh enough, which resource/accounting bucket should be used, and which subject facts may be disclosed.

The service decision is therefore layered:

  1. capability authority;
  2. invocation subject context;
  3. service-local policy and state.

For example, holding ChatRoot lets a process ask chat to join. The caller’s live session supplies the subject context. The chat service may key its per-session state by the opaque reference and may request bounded disclosure when its method contract and broker policy allow it.

Disclosure

Disclosure is opt-in and field-granular. A service receives broader subject facts only when both conditions hold:

  • the method or call shape explicitly requests the fields;
  • the invoked capability or broker-granted facet carries a service-scoped disclosure scope allowing those fields.

Without both, endpoint metadata stays opaque. Services that need display names, profile classes, or audit labels should request only those fields and treat them as service-local policy input, not as independent authority.

Transfer And Liveness

Cross-session capability transfer is allowed only when the transferred cap’s transfer scope permits it. A transferred cap carries invoke authority; the receiver’s session remains the invocation subject. service_regrant_only caps cannot cross sessions through raw copy, move, endpoint IPC, or spawn grants; a trusted service or broker regrant path must mint target-session authority.

Ordinary endpoint calls from logged-out or expired sessions fail closed. The current liveness implementation is a Live/LoggedOut state cell plus expiry; administrator revocation, recovery-only session modes, and renewal/recovery caps remain future lifecycle work. Fixed wall-clock expiry remains a bounded guardrail, not complete production interactive-session lifecycle UX.

Code Map

  • kernel/src/session_context.rs - kernel session records, liveness state, and scoped reference derivation.
  • kernel/src/cap/endpoint.rs - endpoint caller-session delivery and stale endpoint/session checks.
  • kernel/src/cap/transfer.rs and capos-lib/src/cap_table.rs - transfer-scope validation and rollback.
  • kernel/src/cap/session_manager.rs - session creation and UserSession result-cap minting.
  • kernel/src/cap/user_session.rs - UserSession capability behavior.
  • kernel/src/cap/restricted_launcher.rs and kernel/src/cap/authority_broker.rs - broker and launch surfaces that mint session-scoped bundles.
  • capos-rt/src/client.rs - runtime clients that observe session, endpoint, and logout behavior.
  • docs/architecture/ipc-endpoints.md - endpoint transport and transfer rules.
  • docs/architecture/process-model.md - spawn and process ownership model.

Validation

  • make run-session-context covers process-session invariants, default endpoint caller-session metadata, stale normal endpoint rejection, transfer scopes, and disclosure gating.
  • make run-capnp-chat-interop and the chat/adventure smokes cover ordinary service state keyed by live caller-session metadata instead of caller-chosen selectors.
  • make run-remote-session-capset-interop and focused remote-session UI smokes cover DTO gateway logout/close propagation.
  • make run-ssh-public-key-session covers UserSession.auditContext, explicit logout idempotence, and post-logout fail-closed reads.

Open Work

  • Administrator revocation, renewal/recovery, live proxy cleanup, and audit reason separation remain future lifecycle work.
  • Stable service-audit identity across endpoint replacement or service upgrade needs a future service-audit scope.
  • Delegated act-on-behalf-of subject context is a separate future design, not part of the completed session-bound invocation context milestone.
  • A dedicated result-cap move-source rollback proof is still needed before fixed expiry is treated as the whole production session lifecycle.

Design Grounding

The archival decision record is Session-Bound Invocation Context. The superseded direction is Superseded: Service Object Capabilities. Capability-system precedent is summarized in Capability-Based and Microkernel Operating Systems Survey.

In-Process Threading Contract

This page records the implemented contract for kernel-managed threads inside one process. The park authority contract is frozen separately in Park Authority. These pages are the handoff from the initial single-thread runtime checkpoint to same-process SMP work. The current slice has per-thread completion rings for spawned child threads, per-CPU WFQ run queues with bounded stealing, a caller-thread-bound SchedulingPolicyCap, and a SchedulingContext cap that records identity, bind/revoke, dispatcher budget charging/replenishment, bounded endpoint donation/return, and fixed depletion/deadline notification cells. Same-process sibling scheduling has formal accepted 1-to-2 evidence on capos-bench 2026-05-02 21:38 UTC against main commit 374f8556 (capOS work 1.883x / total 1.787x, both clearing the configured 1.6x gates; matching Linux pthread baseline 1.988x/1.987x on the same physical-core pin set). The 2026-05-02 1-to-4 row was the diagnostic that justified Phase D’s fair-share enqueue policy: capOS sat at 1.566x/1.538x while Linux scaled to 3.963x/3.858x. Phase D now runs per-CPU WFQ queues with bounded stealing and manually accepted the 2026-05-10 1-to-4 diagnostic row (3.088x/2.700x) while the harness-enforced gate remains 1-to-2 work/total speedup; see docs/benchmarks.md for the full evidence table including historical pre-collapse rows. Phase F has landed the one-SQ-consumer prerequisite, nohz telemetry, housekeeping/deferred-work placement, the clockevent/deadline substrate, and bounded SQPOLL ring mode including the non-periodic SQPOLL producer-wake progress path; the first automatic nohz activation increment is closed via docs/tasks/done/2026/scheduler-phase-f-auto-nohz-activation.md and SQPOLL-driven auto-nohz activation is also closed via docs/tasks/done/2026/scheduler-phase-f-auto-nohz-sqpoll.md; generic full-nohz for ordinary budgeted compute leases and timeout-based auto-revoke are landed; policy-service AutoNoHz issuance remains future work.

Scope

The threading milestone changes the scheduler’s unit of execution from process to thread while keeping the process as the authority, address-space, and resource-accounting boundary. Same-process sibling scheduling on multiple CPUs is functional for per-thread-ring processes. The accepted 1-to-2 performance claim is now the formal capos-bench 5-run pair recorded on 2026-05-02 21:38 UTC against main commit 374f8556: capOS work 1.883x and total 1.787x clear the configured 1.6x gates; the matching Linux pthread baseline on the same physical-core pin set (0,1,2,3) records 1.988x/1.987x, validating the workload shape. The 2026-05-02 1-to-4 row was the diagnostic that justified Phase D: capOS sat at 1.566x/1.538x while Linux scaled to 3.963x/3.858x. Phase D now runs per-CPU WFQ queues with bounded stealing and its 2026-05-10 1-to-4 row (3.088x/2.700x) was manually accepted from recorded diagnostics; the harness-enforced gate remains 1-to-2 work/total speedup. Historical pre-collapse rows and the post-collapse 3-run diagnostic remain in docs/benchmarks.md for reference. Phase E adds the SchedulingContext cap (identity, caller-thread bind, revoke, budget charging/replenishment, bounded synchronous endpoint donation/return, and fixed depletion/deadline notification cells with drain observer results), and Phase F has landed the bounded SQPOLL ring mode plus the clockevent/deadline substrate. Automatic nohz activation, realtime admission, and privileged userspace scheduler-policy services remain later work.

This contract covers:

  • process-owned versus thread-owned state;
  • the initial thread creation ABI;
  • per-thread FS-base/TLS rules;
  • thread exit and join semantics;
  • the per-thread ring blocking and completion-routing contract;
  • the caller-thread-bound SchedulingPolicyCap and SchedulingContext surfaces that mutate per-thread WFQ weight/latency-class and per-thread scheduling-context binding;
  • the handoff to the 7.1.1 park authority design.

Ownership Split

The process remains the security boundary. All threads in one process share the same address space and capability table, so a thread has the same authority as its sibling threads.

Process-owned stateThread-owned state
Process id and process generationThread id and thread generation
User address space and CR3Saved CPU context and user register state
Capability table and resource ledgerKernel stack and syscall stack top
Initial compatibility ring and ring arena ownershipPer-thread ring endpoint, scratch, and FS base
Read-only CapSet pageScheduling/blocking state
ProcessHandle exit stateThreadHandle join/exit state
Endpoint owner state and process-wide cleanup hooksWFQ weight, latency class, virtual runtime, and virtual_finish_ns enqueue tag
Process-wide resource ledgers (thread records, kernel stacks, cap-table slots)SchedulingContext binding (identity/generation, remaining budget, replenish/deadline timestamps, donation/return slot, notification recorder)

The implementation migrated incrementally. The 7.2.0 slice made each process contain a single initial Thread, with saved context, kernel stack, FS base, and blocking state stored on that thread. Later slices changed scheduler-owned queues, current execution, direct IPC handoff, and wake records to generation-checked ThreadRef values, added creation and lifecycle caps, and then assigned per-thread rings to spawned children.

Scheduler Contract

Scheduler stores runnable execution contexts as thread references, not process ids. A thread reference is (pid, process_generation, tid, thread_generation). The process generation keeps handles from naming a reused process; the thread generation keeps handles from naming a reused thread slot inside a live process.

This identity applies to Scheduler.current, run queues, direct IPC targets, Timer sleep waiters, process/terminal waiters, endpoint caller/receiver wake records, and deferred cancellation state.

Runnable ownership is split across per-CPU run queues (SCHEDULER_CPUS = 4). Each queue is ordered ascending by virtual_finish_ns, which is recomputed per enqueue from virtual_runtime_ns, the thread’s WFQ weight (clamped to [MIN_WEIGHT, MAX_WEIGHT] in capos-abi::scheduler), and a per-class slice scaled by LatencyClass (Interactive divides the slice, Batch multiplies it, Normal/IpcServer pass it through). Default placement targets the current CPU; a bounded steal path balances when a CPU’s local queue is empty, recomputes the WFQ tag at the destination, and records placement-spread / steal migrations under the measure feature. Each per-CPU queue is reserved at thread-create time to the live runnable-capable thread count so timer-tick, unblock, direct-IPC fallback, and steal-requeue paths never allocate.

The run queue, current, direct IPC target, and blocked waiter scans are thread-oriented. Address-space switches happen only when the next runnable thread belongs to a different process. TSS.RSP0, the syscall kernel stack, and FS base are updated on every thread switch because those are thread-local machine resources. Per-thread runtime_ns advances 1:1 with elapsed CPU time; virtual_runtime_ns advances by elapsed_ns * REFERENCE_WEIGHT / weight so weight changes the cumulative WFQ share rather than just an enqueue tie-breaker.

SchedulingContext bindings layer dispatcher budget on top of WFQ. A thread may carry at most one SchedulingContextThreadBinding. While bound, the dispatcher charges elapsed time against the binding’s remaining_budget_ns, replenishes from period_ns at the next replenish boundary, records deadline_or_timeout and budget_depleted notifications in the per-context fixed cells, and routes synchronous endpoint donation/return for passive receiver threads (donated_holder in the notification snapshot tracks whether the holder is the donor or the receiver). Stale-generation or revoked caps fail closed before mutating scheduler state. Realtime-island admission, CPU placement enforcement, and overrun-fault policy remain deferred.

The idle path is a per-CPU CPL0 (kernel-mode) idle thread; the former special user-mode idle process has been removed. Each CPU’s idle thread is a kernel-owned execution context — it runs on the kernel PML4 with a dedicated idle kernel stack and cannot block, exit, or hold ordinary caps. A lightweight synthetic idle Process record is retained per CPU only so the idle ThreadRef resolves through scheduler bookkeeping; it maps no user code, stack, or cap ring. See the “Idle paths” section of docs/architecture/scheduling.md.

Phase F has landed the one-SQ-consumer prerequisite, nohz telemetry, housekeeping/deferred-work placement, the clockevent/deadline substrate, and a bounded SQPOLL ring-mode worker (MAX_SQPOLL_WORKERS = 16, request_sqpoll_start_for_thread / finalize_pending_sqpoll_start_for_thread with stale-owner rollback). Tick suppression now exists behind explicit CpuIsolationLease admission, including ordinary budgeted compute leases that target a live SchedulingContext; policy-service AutoNoHz issuance and generic SQPOLL nohz for arbitrary rings remain future work.

Thread Creation ABI

Thread creation is exposed through a process-local ThreadSpawner capability. It creates threads only in the caller’s current process. It does not grant authority to another process and is non-transferable across IPC in the initial implementation.

The initial control-plane shape is:

interface ThreadSpawner {
    create @0 (
        entry :UInt64,
        stackTop :UInt64,
        arg :UInt64,
        fsBase :UInt64,
        flags :UInt64
    ) -> (handleIndex :UInt16);
}

interface ThreadHandle {
    join @0 () -> (exitCode :Int64);
    exitCode @1 () -> (exited :Bool, exitCode :Int64);
}

interface ThreadControl {
    getFsBase @0 () -> (fsBase :UInt64);
    setFsBase @1 (fsBase :UInt64) -> ();
    exitThread @2 (code :Int64) -> ();
}

Any 7.2 schema adjustment must update this page in the same branch before implementation review. The stable semantics are that creation is in-process, the returned handle is an observed result cap, ThreadHandle observes one thread rather than the whole process, and current-thread exit is available through both ThreadControl.exitThread and the raw exit(code) syscall.

The new thread starts in Ring 3 at entry with:

  • RDI = arg;
  • RSI = tid;
  • RDX = pid;
  • RCX = the current thread's ring address;
  • R8 = CAPSET_VADDR, or zero if the process has no CapSet.

The runtime supplies the user stack and TLS block. The kernel validates that entry, stackTop, and fsBase are user-canonical, that stackTop is 16-byte aligned at entry, and that reserved flags bits are zero. Page presence and stack-growth policy remain process address-space questions; before a page-fault subsystem exists, an invalid thread stack can fault the process.

Resource Accounting

Thread creation allocates kernel memory and is quota-backed by process-owned ledger state, not per-capability helper counters. The 7.2.0 checkpoint charges the initial thread during process creation; ThreadSpawner.create extends the same ledgers to additional threads. The ledger of record is:

  • PROCESS_THREAD_LIMIT, the maximum live or retained thread records in one process, initially 16;
  • PROCESS_THREAD_KERNEL_STACK_PAGES, initially matching the current per-thread kernel stack allocation size of 32 pages;
  • thread_records_used / thread_records_max;
  • thread_kernel_stack_pages_used / thread_kernel_stack_pages_max.

The initial process thread charges one thread record and one kernel-stack allocation during process creation. ThreadSpawner.create reserves a thread record and kernel-stack page budget before allocating the stack or publishing a ThreadHandle; every later failure rolls both reservations back before returning. Cap-slot reservation for the result handle remains charged to the existing process cap-table ledger.

Creation failures are controlled application exceptions. Thread count, kernel-stack budget, handle cap-slot exhaustion, and kernel stack allocation failure return Overloaded with a specific message and no partially runnable thread. Invalid entry, stack, FS base, or flags return Failed.

Thread exit releases the kernel stack only after the scheduler is running on a different kernel stack. The thread record remains charged while a live ThreadHandle, pending join waiter, or unjoined exit status can still observe it. Once the handle is released without a pending join, or once a one-shot join has consumed the status and no wait record pins it, the retained record charge is released. Process exit releases all thread records and stack charges once.

The off-stack property is enforced by an OffStackToken witness on every stack frame release path: the deferred per-thread drain calls Process::release_thread_kernel_stack, whole-process teardown calls Process::release_all_thread_kernel_stacks, and pre-publication rollback calls Process::rollback_created_thread. The token constructor is private to the scheduler module. Implicit Thread::Drop is deliberately not a release path; if a Thread value reaches its destructor with a nonzero stack, it fails closed by leaving the frames allocated instead of freeing a stack without an off-stack witness.

FS Base And TLS

FS base is thread-owned. The existing ThreadControl.getFsBase and ThreadControl.setFsBase operations keep their names, but after threading they refer to the current thread, not the whole process. setFsBase continues to reject non-user-canonical values and writes the CPU FS-base MSR immediately when called by the running thread. Both methods route through context-aware dispatch (CapCallContext::caller_thread) so the operation always targets the caller, never a different thread; calling ThreadControl from a non-live caller returns ProcessFsBaseError::CallerNotLive.

The initial process thread uses the PT_TLS block installed by ELF loading. Additional threads receive an FS base from ThreadSpawner.create; the runtime is responsible for allocating and initializing each thread’s TLS/TCB data. There is no process-global FS base. Current-thread FS-base operations are useful for the single-thread runtime checkpoint, but they must not be treated as the final threading ABI for language runtimes. True multi-threaded Go or C/POSIX-like runtime support requires each ThreadRef to own a distinct TLS block and FS base.

Context switching must save the outgoing thread’s FS base and restore the next thread’s FS base even when both threads belong to the same process and no CR3 switch is needed.

Thread Identity In Waiters And Dispatch

The concrete identity type for in-process scheduling is:

#![allow(unused)]
fn main() {
ThreadRef {
    pid,
    process_generation,
    tid,
    thread_generation,
}
}

Process identity still governs authority and accounting, but wakeup and blocking state must name a thread. 7.2 changes context-aware capability dispatch so CapCallContext carries both the caller process id for authority checks and the caller ThreadRef for wake/cancel decisions. Existing pid-only records that can resume execution or write a caller CQE must be widened before multiple threads can run in one process.

The migration target is:

  • TimerSleepWaiter stores the sleeping ThreadRef and validates the generation before waking it;
  • endpoint CALL, RECV, RETURN target, deferred-cancel, current-caller, and direct IPC handoff records store the blocked or target ThreadRef;
  • terminal line input and any other ProcessWaiter consumer store the waiting ThreadRef and validate the generation before writing a CQE;
  • ProcessHandle.wait records the waiting ThreadRef while the handle still names the child process;
  • ThreadHandle.join records the waiting ThreadRef and the target ThreadRef;
  • cap_enter blocks the current ThreadRef on that thread’s ring endpoint;
  • process-exit cleanup cancels every waiter whose pid and process_generation match the exiting process, regardless of thread id.

A generation mismatch on wake or completion is a stale waiter and must be drained without writing to userspace. This mirrors current process-generation behavior and prevents one thread slot reuse from receiving another thread’s Timer, endpoint, join, or ring completion.

Exit And Join

The current exit(code) syscall terminates the current thread. This preserves single-thread process exit because the process exits when its last non-idle thread exits, and it avoids tearing down a shared address space while sibling threads are still current on other CPUs.

Thread exit does not add a new syscall. The initial implementation added ThreadControl.exitThread(code) as a terminal capability-ring operation on the current thread, with the same current-thread termination semantics as the raw syscall. A successful invocation does not post a CQE back to the exiting thread, because cap_enter will not return to that execution context. It records the exit code, wakes or completes any valid join waiter, and removes only the current thread from scheduling. If the last non-idle thread in a process exits through exit(code) or exitThread, the process exits with that thread’s code and completes the parent-facing ProcessHandle.

Whole-process termination remains a ProcessHandle operation. It releases the shared capability table, cancels process-owned endpoint state, removes timer/park/ring waiters for every thread in the process, and completes the parent-facing ProcessHandle after the process is no longer current on any CPU.

ThreadHandle.join is process-local and one-shot. If the target thread already exited and its status is retained, join returns its code immediately and marks the status joined. If it is still live, join blocks the caller’s thread until the target exits. Self-join returns Failed. A second waiter, join after a successful join, or join after detach returns Failed; it must not park an ambiguous waiter. ThreadHandle.exitCode is nonblocking and may observe the retained status while the handle is live, but it does not consume the one-shot join right.

Releasing the last ThreadHandle before the target exits detaches the target: the thread continues to run, but no exit status is retained after it exits unless a join waiter already pins the state. Releasing the handle after exit but before join drops the retained status and releases the thread-record charge. A pending join waiter pins the handle state until completion or process exit, so cap release cannot create a use-after-free. The exiting thread’s kernel stack must not be freed while it is still executing on that stack; final process teardown performs an explicit token-gated stack release after another kernel stack is active, before the deferred Process value is dropped.

Fatal user faults remain process-fatal in the first implementation. Per-thread fault isolation can be designed later, after the basic scheduler and futex paths are stable.

Capability Ring And Blocking

The first Ring v2 implementation keeps the initial thread’s compatibility ring at RING_VADDR and gives each spawned child thread a kernel-chosen ring mapping inside the reserved process ring arena. Runtime-selected ring address ranges remain a later VirtualMemory reservation extension.

ThreadSpawner.create allocates a ring record and user mapping for the new thread, stores that mapping on the child ThreadRef, and passes the ring address in the child start registers. cap_enter blocks the current thread against that thread’s own CQ, so same-process sibling threads may block in cap_enter independently. Timer, endpoint, join, park, and cancellation paths must route completions by generation-checked ThreadRef to the target thread’s ring endpoint.

The runtime’s single-owner ring-client invariant remains local to each ring client. Well-formed userspace serializes submission and completion matching per thread ring through capos-rt; it must not have two consumers racing on the same SQ/CQ. The scheduler still refuses to run the exact same ThreadRef on two CPUs at once, but it no longer treats every multithreaded pid as tied to one scheduler CPU.

This is sufficient for functional same-process sibling scheduling. The formal accepted 1-to-2 make run-thread-scale capOS evidence is the capos-bench 2026-05-02 21:38 UTC pair (work 1.883x, total 1.787x, both clearing the configured 1.6x gates). The guest result row’s accepted field remains diagnostic; the host summary enforces the work-window and total-time gates, and refuses speedup enforcement unless CAPOS_THREAD_SCALE_QEMU_TASKSET_CPUS records the QEMU CPU pin set. Linux validates the repaired benchmark shape through four workers on physical cores (3.963x/3.858x). That capOS 4-worker row was diagnostic (1.566x/1.538x) and justified Phase D’s per-CPU WFQ queues plus bounded stealing. The 2026-05-10 Phase D rerun recorded 1-to-4 work/total diagnostics 3.088x/2.700x, manually accepted for closeout; remaining risks are the shared scheduler lock, temporary CPU pinning, CQ/join/exit/block/schedule overhead, broader workload classes, and higher-thread-count evidence.

Scheduling Policy And Context Authority

SchedulingPolicyCap is the caller-thread-bound surface for WFQ knobs. Every method routes through CapCallContext::caller_thread; there is no per-cap-object ThreadHandle, no badge-encoded thread id, and no cross-thread mutation in this slice. Cross-thread authority is deferred to the privileged scheduler-policy service plan. The schema shape is:

interface SchedulingPolicyCap {
    setWeight @0 (weight :UInt16) -> ();
    setLatencyClass @1 (class :LatencyClass) -> ();
    snapshot @2 () -> (
        weight :UInt16,
        class :LatencyClass,
        runtimeNs :UInt64,
        virtualRuntimeNs :UInt64,
    );
}

setWeight validates against [MIN_WEIGHT, MAX_WEIGHT] at the cap boundary and updates the caller thread’s WFQ weight; the new weight applies to the next enqueue’s virtual_finish_ns tag and to subsequent virtual_runtime_ns accounting. setLatencyClass swaps the per-thread LatencyClass (Normal, Interactive, IpcServer, Batch) used to scale the dispatcher slice. snapshot is a read-only observer over the core WFQ state and does not expose the measure-only counters.

SchedulingContext is the schema-typed cap for dispatcher budget authority:

interface SchedulingContext {
    info @0 () -> (info :SchedulingContextInfo);
    create @1 (spec :SchedulingContextSpec) -> (
        contextIndex :UInt16,
        identity :SchedulingContextIdentity,
        result :SchedulingContextOperationResult,
        dispatchEffect :SchedulingContextDispatchEffect,
    );
    bindCallerThread @2 () -> (
        identity :SchedulingContextIdentity,
        binding :SchedulingContextBinding,
        result :SchedulingContextOperationResult,
        dispatchEffect :SchedulingContextDispatchEffect,
    );
    revoke @3 () -> (
        identity :SchedulingContextIdentity,
        previousGeneration :UInt64,
        result :SchedulingContextOperationResult,
        dispatchEffect :SchedulingContextDispatchEffect,
    );
    drainNotifications @4 () -> (
        notifications :SchedulingContextNotificationSnapshot,
    );
}

create returns a same-interface child context as transferred result cap 0 and becomes chargeable only after bindCallerThread. revoke bumps the generation and clears any matching thread binding; later calls through the stale cap generation report staleGeneration or fail closed before mutating scheduler state. drainNotifications reads the fixed per-context budget-depleted and deadline-or-timeout slots; the scheduler updates these in place from hard paths without allocation, including the holder identity and a donatedHolder bit for endpoint donation/return. The bootstrap manifest grants SchedulingPolicyCap and SchedulingContext only to focused-proof manifests; the default boot manifest does not grant them.

Userspace API Surface

The capos-rt runtime exposes the threading caps as typed clients on top of the per-thread ring:

  • ThreadControlClientget_fs_base/set_fs_base/exit_thread, including *_wait blocking variants over RuntimeRingClient.
  • ThreadSpawnerClient::create – submits the entry/stackTop/arg/ fsBase/flags ABI and returns an OwnedCapability<ThreadHandle> delivered as transferred result cap 0 in the CQE.
  • ThreadHandleClientjoin, exit_code (nonblocking observer), and their finish_* helpers; finish_join decodes the one-shot exit code.
  • SchedulingPolicyClientset_weight, set_latency_class, and snapshot, all caller-thread-bound.
  • SchedulingContextClientinfo, create, bind_caller_thread, revoke, and drain_notifications.

A typical spawn/join pseudocode against these clients is:

#![allow(unused)]
fn main() {
let handle = thread_spawner.create_wait(
    &mut ring,
    entry_addr,
    user_stack_top,
    arg,
    fs_base,
    /* flags */ 0,
    timeout_ns,
)?;
// ... runtime work on the parent thread ...
let exit_code = thread_handle
    .join_wait(&mut ring, timeout_ns)?;
}

The userspace runtime is responsible for the user stack, TLS/TCB, and any free-list bookkeeping for retired handles; the kernel only validates the ABI fields and charges the per-process ledgers.

Park Handoff

Park authority is defined in Park Authority. The scheduler changes above must leave room for a thread block reason that is not tied to the process ring CQ. The frozen handoff is:

  • park wait blocks the current thread, not the whole process;
  • park wake makes selected generation-checked ThreadRef values runnable;
  • timeouts use the same monotonic time base as Timer;
  • private park keys are based on address-space identity plus user virtual address;
  • shared-memory park keys are MemoryObject-derived identity plus offset;
  • the first implementation starts with compact CAP_OP_PARK and CAP_OP_UNPARK operations rather than generic Cap’n Proto methods;
  • park wait SQEs are thread-owned so ring dispatch cannot park a sibling thread under the waiter’s user_data;
  • blocking park wait is a syscall-context operation that releases runtime ring-client ownership before the thread parks, while capos-rt demultiplexes reserved park CQEs back to the waiting thread.

Pre-thread 4.5.4 measurement chose the compact capability-authorized shape for failed wait and empty wake. 4.5.5 measured the real blocked/resume path through thread-lifecycle under make run-measure, so the compact ParkSpace opcodes remain the runtime ABI target for this slice.

Security Invariants

  • A thread never owns a separate capability table in the initial model.
  • A thread cannot escape the authority of its containing process.
  • A ThreadHandle names only a thread in the same process and is non-transferable in the first implementation.
  • Thread creation is charged to one process-owned thread/kernel-stack ledger of record before the thread can become runnable.
  • Process exit releases shared authority once, after all live threads are removed from scheduling.
  • Per-process resource quotas are shared by all threads.
  • ThreadControl changes only the current thread’s FS base.
  • ThreadControl.exitThread terminates only the current thread and is a capability-ring operation, not a syscall.
  • Every waiter or direct handoff that can resume execution stores a generation checked ThreadRef.
  • Process-owned user-buffer validation/copy/read paths hold the process AddressSpace lock; future shared-memory thread primitives still need mapping provenance or object pins when they derive keys from shared backing.

Implementation Order

  1. Add internal Thread state, make each process own one initial thread, move saved context / kernel stack / FS base / block state onto that thread, and charge the initial thread against private process ledgers. Done 2026-04-24 23:09 UTC.
  2. Change scheduler queues, blocking, exit cleanup, and direct IPC targets from pid-oriented state to thread references while preserving one thread per process. Done 2026-04-24 23:33 UTC.
  3. Add ThreadSpawner, ThreadHandle, and ThreadControl.exitThread with a QEMU smoke for create, join, detach, self-join rejection, second join rejection, and last-thread process exit. Done 2026-04-25.
  4. Implement the ParkSpace private wait/wake path from Park Authority after the scheduler can block and wake individual threads, then run 4.5.5 blocked/resume measurements before declaring the park ABI stable. Done 2026-04-25.

Validation

The thread-lifecycle proof creates multiple threads in one process, proves they share the address space and CapSet, proves each has an independent FS base, rejects invalid join cases, joins one thread from another, and lets the last thread exit the process. The existing make run-spawn path keeps covering runtime-fs-base and single-thread-runtime so regressions in the pre-thread runtime contract stay visible. make run-measure additionally records the private ParkSpace blocked/resume timings and proves process exit with a parked park waiter. Phase D fairness/Interactive/weight-change smokes (make run-thread-fairness, make run-thread-fairness-interactive, make run-thread-fairness-weight-change) exercise the SchedulingPolicyCap caller-thread-bound surface; the thread-scale proof carries the recorded WFQ scaling evidence. The recorded 1-to-2 work/total speedup gate is the host-enforced Phase D acceptance criterion; the 1-to-4 row remains a manually accepted diagnostic. Safe runtime park wrappers and a focused SchedulingContext budget/donation/notification smoke remain future capos-rt and harness work.

Park Authority Contract

This page freezes the 7.1.1 design contract for thread-park (park/unpark) authority. It is the handoff from the in-process threading contract to the 7.2 implementation work and records the first 7.2.3 implementation status.

Linux prior art. Park solves the same problem as Linux futex(2): userspace owns the uncontended fast path through atomic operations on a 32-bit word, and the kernel parks/wakes threads only on contention. capOS uses the distinct name Park because the contract differs in important ways from Linux’s: it is capability-gated (no ambient authority), there is no priority inheritance, no requeue, no robust lists, and the shared variant is keyed by MemoryObject identity rather than (inode, pgoff). References to “Linux futex” in this page point to that prior art, not to the capOS API surface.

Scope

The first park milestone stays single-CPU and in-process. It gives a multi-threaded runtime one kernel primitive: park the current thread when a userspace word still has an expected value, and wake parked threads associated with that word. Userspace owns the uncontended path through ordinary atomic operations; the kernel owns only the contended sleep/wake path and timeout integration.

This contract covers:

  • production park authority objects;
  • private and shared park key identity;
  • the provisional compact wait/wake transport ABI;
  • scheduler, timeout, and process-exit interactions;
  • resource-accounting and security invariants;
  • the 4.5.5 measurement loop after real thread blocking exists.

This is not a Linux futex(2) compatibility surface. Priority inheritance, requeue, robust lists, shared-memory park-words before MemoryObject mapping identity is exposed, and SMP-safe user-buffer pinning remain later work.

Implementation Status

The 2026-04-25 7.2.3 slice implements:

  • schema marker interfaces for ParkSpace and SharedParkSpace;
  • compact CAP_OP_PARK and CAP_OP_UNPARK opcodes;
  • process-local, non-transferable ParkSpace grants through boot/spawn manifests;
  • private wait/wake keyed by the caller process address space and user virtual address;
  • per-thread Park block state with finite timeout integration;
  • one reserved CQE credit per parked waiter so wake/timeout delivery cannot be crowded out by ordinary completions;
  • QEMU correctness coverage in thread-lifecycle for mismatch, immediate timeout, wake-one, wake-many, anonymous VirtualMemory multi-waiter unmap range cleanup with stale wake-after-reuse checks, anonymous VirtualMemory.decommit reuse stale waiter cleanup, and MemoryObject.unmap borrowed-mapping reuse stale waiter cleanup;
  • 4.5.5 QEMU timing coverage in run-measure.

SharedParkSpace is a marker only. capos-rt has the marker type but no safe park client wrapper yet; the current correctness and measurement demos use raw compact SQEs so the ABI can settle before runtime synchronization wrappers claim the user_data namespace.

Design Grounding

The reviewed project documents for this contract are:

  • docs/tasks/README.md;
  • docs/roadmap.md;
  • REVIEW.md;
  • docs/architecture/threading.md;
  • docs/architecture/scheduling.md;
  • docs/architecture/userspace-runtime.md;
  • docs/proposals/go-runtime-proposal.md.

The relevant research grounding is:

  • docs/research/out-of-kernel-scheduling.md for the kernel-assisted wait/wake split used by language runtimes;
  • docs/research/llvm-target.md for the Go/runtime syscall surface that needs thread creation, per-thread TLS, and futexes;
  • docs/research/genode.md for typed capability precedent and resource-accounted session state.

Authority Objects

ParkBench remains measurement-only. It is not a production authority and must not be granted by normal boot manifests.

The first production model has two authority objects:

interface ParkSpace {}
interface SharedParkSpace {}

These schema interfaces are marker interfaces for typed CapSet/result-cap identity. The wait and wake operations use compact ring opcodes rather than Cap’n Proto methods, because the pre-thread 4.5.4 measurement showed the generic Cap’n Proto path is not the right default for the park hot path.

ParkSpace is minted for a process by the same bootstrap/spawn path that grants ThreadControl and ThreadSpawner. It is process-local and non-transferable in the initial implementation. Holding it authorizes private park wait/wake only in the caller’s own address space; it does not grant memory access, cross-process wake authority, or the right to name arbitrary kernel wait queues.

SharedParkSpace is the shared-park object for a later MemoryObject-derived slice. A MemoryObject holder can derive a SharedParkSpace scoped to that MemoryObject’s backing identity. Shared park operations through that SharedParkSpace are keyed by object offset, not by one process’s virtual address. The first 7.2 implementation may leave SharedParkSpace unimplemented, but it must not choose a private-key ABI that prevents this shared-key model.

Park Keys

Private park keys are address-space scoped:

#![allow(unused)]
fn main() {
ParkKey::Private {
    address_space_id,
    address_space_generation,
    uaddr,
}
}

The first implementation can derive address_space_id and generation from the process id/generation while each process owns exactly one address space. The contract names address-space identity deliberately so a later fork/shared-AS model does not inherit a pid-shaped key.

Private parks are synchronization inside one address space. wake for a private key may wake only waiters in the same address space generation; a raw virtual address alone is never cross-process synchronization authority.

Shared park keys are MemoryObject scoped:

#![allow(unused)]
fn main() {
ParkKey::Shared {
    memory_object_id,
    memory_object_generation,
    offset,
}
}

Shared keys are disabled until the kernel can prove, while handling a park operation, that the submitted user address maps the MemoryObject backing the SharedParkSpace and can compute the byte offset in that backing object. Virtual aliases of the same shared page must converge on the same shared key. Private aliases within one address space do not converge unless they use the same user virtual address.

Shared parks require explicit shared-memory authority through the MemoryObject-derived SharedParkSpace. Never use raw virtual address alone for cross-process park/futex keys.

All park words are 32-bit and must be 4-byte aligned. wait validates the word as a readable user mapping before reading it. wake validates that the address is user-canonical and aligned; shared wake additionally validates the MemoryObject mapping identity so a caller cannot wake an unrelated object by guessing an offset.

Private-key cleanup is part of the ParkSpace contract, not an implementation detail of the Go runtime. Unmap, revoke, address-space generation change, and address-space teardown must drain or fail waiters for the old private key before the same virtual address can be reused as unrelated state. A stale private waiter may complete only against the address-space generation it was registered under; it must not observe or wake a later mapping with the same numeric uaddr.

Current implementation status: process/thread-exit cleanup exists. Anonymous VirtualMemory.unmap, VirtualMemory.decommit, and MemoryObject.unmap for borrowed mappings drain private waiters whose uaddr lies in the affected range by posting PARK_INTERRUPTED through the waiter’s reserved completion credit before making the blocked thread runnable. Cleanup removes the waiter from the address-keyed private wait table before attempting the completion. If that completion cannot be posted immediately, the thread remains blocked in a pending park-completion state with the exact completion status and reserved completion credit still charged, and scheduler wake processing retries the stored status; the waiter is not restored to the uaddr table while the virtual address can be reused. Shared park-word cleanup and explicit address-space generation teardown remain open. Until those land, the implemented private path is suitable for process-lifetime park words, anonymous VirtualMemory regions that use these unmap/decommit paths, and borrowed MemoryObject mappings that are explicitly unmapped with MemoryObject.unmap.

The ordinary QEMU proof covers wake-one, wake-many, handoff wake retry, multi-waiter private range cleanup, and stale wake-after-reuse. It does not deterministically force the transient unmap interruption ring-scratch contention race that can make the first interruption completion post fail: from userspace, waiter submission is observable before the kernel registers the waiter or after ring dispatch has released the scratch buffer. The production cleanup path therefore treats that race as a retry state outside the address-keyed waiter table rather than restoring the waiter.

Provisional Ring ABI

The 7.2 implementation starts with compact capability-authorized operations:

  • CAP_OP_PARK;
  • CAP_OP_UNPARK.

The numeric opcode values are assigned when the implementation edits capos-config/src/ring.rs. CAP_OP_PARK_BENCH remains reserved for measurement-only kernels and must not be repurposed.

CAP_OP_PARK uses the existing 64-byte SQE fields as:

SQE fieldMeaning
cap_idParkSpace for private wait, or SharedParkSpace for shared wait
user_datareturned in the wait completion CQE
addruser virtual address of the 32-bit park word
lenexpected 32-bit value
pipeline_deprelative timeout in monotonic nanoseconds; u64::MAX means no timeout
flagsmust be CAP_SQE_THREAD_OWNED
call_idowning thread id; a different thread leaves the SQE at the ring head

CAP_OP_UNPARK uses:

SQE fieldMeaning
cap_idParkSpace for private wake, or SharedParkSpace for shared wake
user_datareturned in the wake caller’s completion CQE
addruser virtual address of the 32-bit park word
lenmaximum number of waiters to wake; zero is malformed

Both operations require method_id, result_addr, result_len, pipeline_field, xfer_cap_count, and _reserved0 to be zero. CAP_OP_UNPARK also requires flags == 0, pipeline_dep == 0, and call_id == 0. Park operations are not promise-pipelineable in this slice. pipeline_dep is used as the wait timeout storage only for CAP_OP_PARK; future promise pipelining must keep rejecting CAP_SQE_PIPELINE on park opcodes or replace the park ABI in a reviewed branch.

Wait completions use non-negative CQE.result statuses:

ResultMeaning
PARK_WOKEN = 0a wake operation made the thread runnable
PARK_VALUE_MISMATCH = 1the loaded word did not equal expected
PARK_TIMED_OUT = 2the timeout expired before a wake
PARK_INTERRUPTED = 3a future cancellation/interrupt path aborted the wait

Wake completions return the non-negative number of threads woken. Malformed SQEs, invalid caps, unreadable wait words, unsupported cap object types, and stale authority use the existing negative transport errors until a later ABI adds a more specific compact-error namespace.

Ring Ownership And Dispatch Context

Park operations use the process capability ring for submission and CQE delivery, but blocking wait is not an ordinary long-lived runtime call. A runtime must not hold RuntimeRingClient while the thread is parked in CAP_OP_PARK; otherwise no sibling thread in the same process can borrow the same ring client to submit CAP_OP_UNPARK.

The runtime contract for park operations is:

  • capos-rt owns a process-wide park submission/completion path separate from the generic request-buffer RuntimeRingClient pending-call list;
  • park wait reserves a unique user_data value, writes the SQE while holding the runtime’s ring-submission lock, records a park-wait completion slot in runtime-owned memory, and releases the ring-submission lock before entering cap_enter;
  • park wait sets CAP_SQE_THREAD_OWNED and call_id to the current thread id so a sibling thread cannot drain the wait and park the wrong ThreadRef;
  • the park user_data namespace is reserved by the runtime so ordinary generic clients cannot accidentally claim a park completion;
  • all runtime CQ draining must route reserved park user_data completions to the park-wait slot instead of treating them as generic client completions;
  • if another thread drains the waiter CQE before the waiting thread returns from cap_enter, the waiting thread reads the already-recorded status from that park-wait slot;
  • park wake may use the ordinary serialized ring submission path because it completes without parking the caller’s thread.

CAP_OP_PARK is syscall-context only. Timer ring polling and any future interrupt-context ring drain must leave it unconsumed because consuming it can block the current thread and mutate scheduler state. CAP_OP_UNPARK also starts as syscall-context only; widening wake to timer polling would need a separate review of scheduler locking and completion delivery.

This design preserves one process ring and the single blocked cap_enter waiter rule. A thread blocked in Park is not the process ring’s CapEnter waiter, so a sibling can still enter the kernel to submit wake, Timer, IPC, or ordinary capability work through the same process ring.

Wait And Wake Semantics

wait is atomic with respect to wake for the same key:

  1. validate the SQE shape, including thread ownership, and authority cap;
  2. verify call_id names the current thread so a sibling cannot park on behalf of the waiter;
  3. validate the user address shape and derive the private or shared park key;
  4. lock the current process AddressSpace across validation and the user-word read for private keys; future shared keys must additionally prove mapping identity or pin the backing object;
  5. take the park bucket lock;
  6. read the 32-bit user word while the bucket lock is held;
  7. compare the loaded value with expected;
  8. if the value differs, post PARK_VALUE_MISMATCH without blocking;
  9. if the value matches and the timeout is zero, post PARK_TIMED_OUT without blocking;
  10. otherwise, record the current ThreadRef, key, timeout deadline, and user_data, then block only the current thread.

The user-word read, comparison, and enqueue are serialized with wake by the park scheduler path, and the read itself occurs while the process AddressSpace mutex is held. This prevents a page-table validation/use race and the classic lost wake where a waiter reads the old value, a sibling stores the new value and wakes no one, and the waiter then parks based on the stale read. Shared park-words still need mapping provenance or object pinning so a MemoryObject-derived key cannot be swapped out from under key derivation. The user word is not a kernel-owned mutex. Runtime code must use normal atomic load/store and memory-ordering rules around the park word.

wake derives the same key, removes up to maxWake valid waiters from that key’s FIFO list, posts PARK_WOKEN completions to the waiting process ring using the completion credits reserved when those waiters parked, and marks those ThreadRef values runnable after generation checks. A wake SQE is consumed only when the kernel can also post the wake caller’s own CQE; if that ordinary CQ slot is not available, no waiters are removed and the SQE remains pending like other uncompletable ring work. Stale waiters caused by thread or process generation mismatch are drained without writing to userspace, release their reserved completion credits, and do not count as successfully woken. If a valid waiter is still in a current or handoff CPU slot when the wake path removes it from the address-keyed table, the wake still counts that waiter as woken and records a pending PARK_WOKEN completion for scheduler retry.

Timeouts use the same monotonic time base as Timer. The kernel may convert nanoseconds to scheduler ticks internally, but the ABI remains nanoseconds. Finite deadlines post PARK_TIMED_OUT through the waiting process ring using the waiter’s reserved completion credit and wake the blocked thread if the thread generation still matches.

An explicit wake, timeout, cancellation, process exit, and unmap/revoke cleanup race must produce exactly one waiter completion or cleanup-consumption path. Once any path consumes the waiter record, the other racing paths must observe it as gone and must not post a second CQE or wake a later ThreadRef.

Process exit removes every park waiter whose pid/process generation matches the exiting process. Thread exit removes that thread’s own park waiter before the thread record can be retained for join observation. These cleanup paths must not allocate.

Unmap, mapping revoke, and address-space teardown remove or fail private waiters for the affected key/generation before the old virtual address range is made reusable for unrelated mappings. A wake or timeout racing with cleanup must either complete the old waiter under its original generation or observe that cleanup already consumed it; it must not post a completion to a new owner of the same numeric address.

Resource Accounting

Park waits are bounded by the process thread ledger. A thread can be in only one scheduler block reason, so live park waiters cannot exceed live threads. The first private ParkSpace implementation stores the wait node in thread-owned block state and links it into a fixed process-owned waiter table. That is valid only because private ParkSpace caps are process-local and the first key is the process address space plus user virtual address. Shared SharedParkSpace support must move to object-owned fixed buckets scoped to MemoryObject identity. Wait, wake, timeout, and process-exit cleanup must not allocate. Registering a blocking wait reserves one deferred CQE credit in the waiting process. Ordinary completion posting treats reserved credits as unavailable, so wake and timeout paths can always post the waiter completion without losing the waiter. If the kernel cannot reserve that credit, it must not enqueue or block the wait; it either leaves the SQE pending until capacity exists or posts a negative completion for the wait attempt without consuming a waiter slot.

ParkSpace creation is charged as ordinary process capability/table state. If the first implementation needs per-process bucket storage beyond the cap object itself, that storage must be reserved before the ParkSpace is published and released when the process exits or the cap is finally dropped.

In the first private implementation, the waiter table is process-owned and survives release of the ParkSpace handle. CAP_OP_RELEASE of the last capability handle removes submit authority but cannot free a parked waiter’s storage. A waiter can still receive a PARK_WOKEN CQE from a wake operation that already resolved the authority object, a PARK_TIMED_OUT CQE from a finite deadline, or a future PARK_INTERRUPTED CQE from an explicit cancellation path. Thread or process exit drains the wait node without posting a CQE to the exiting thread/process and releases the reserved completion credit. If a runtime drops the last ParkSpace while it has indefinite waiters, it can deadlock its own process, but it cannot create a use-after-free or leak authority outside that process. Future shared SharedParkSpace storage must use explicit non-cap-table waiter pins so object-owned buckets are not freed while parked waiters remain.

SharedParkSpace storage is charged to the MemoryObject-derived object when shared parking lands. It must not create a second unbounded resource path where a holder can allocate wait queues by touching many offsets.

Security Invariants

  • Holding a ParkSpace or SharedParkSpace authorizes blocking/waking, not memory access. Wait still requires a readable user word.
  • Private ParkSpace caps are process-local and non-transferable in the first implementation.
  • Shared park authority must be derived from MemoryObject identity and offset, not from another process’s virtual address.
  • Park wait blocks the current thread, not the whole process.
  • Park wait SQEs are thread-owned; a non-owner cap_enter leaves the SQE at the ring head instead of parking the wrong thread.
  • Park wake can only make generation-checked ThreadRef values runnable.
  • Park completions are posted to the waiting process ring using the waiter SQE’s user_data.
  • Blocking wait registration reserves one CQE credit for the eventual waiter completion, and wake must not remove a waiter unless that credit exists.
  • CAP_OP_PARK is dispatched only from syscall-context cap_enter and never from timer or interrupt-context ring polling.
  • A parked private ParkSpace waiter is stored in process-owned fixed storage; future shared SharedParkSpace waiters must pin the authority object backing their bucket table until wake, timeout, thread exit, or process exit removes the waiter.
  • One process ring still has at most one blocked cap_enter waiter in 7.2; park wait does not create an extra blocked ring waiter.
  • Private ParkSpace wait reads hold the process AddressSpace lock across validation and the user-word read. SharedParkSpace park-words remain blocked until MemoryObject mapping provenance or explicit object pins cover shared key derivation.

Measurement Handoff

4.5.4 measured failed wait and empty wake before real threads existed. That result chooses a compact capability-authorized operation as the starting ABI for 7.2 rather than a generic Cap’n Proto wait/wake method pair.

4.5.5 is closed for the first real thread-blocking path. It measures:

  • value-mismatch wait;
  • empty wake;
  • wait-to-block;
  • wake-to-runnable;
  • wake-to-resume through cap_enter.

The 2026-04-25 QEMU sample printed:

[thread-lifecycle] park path avg cycles: failed_wait=6778 empty_wake=6840 wait_to_block=55994326 wake_to_runnable=28219 wake_to_resume=28000684

The compact shape still holds for this slice: CAP_OP_PARK and CAP_OP_UNPARK remain the production runtime ABI target, while ParkBench remains measurement-only.

Implementation Order

  1. Add ParkSpace and SharedParkSpace marker interfaces plus compact opcode constants.
  2. Add a process-local ParkSpace grant path next to ThreadControl and ThreadSpawner; keep it non-transferable.
  3. Add thread-owned Park block state and fixed private waiter storage with no wait/wake allocation.
  4. Dispatch CAP_OP_PARK and CAP_OP_UNPARK against ParkSpace for private address-space keys.
  5. Add QEMU smoke coverage for mismatch, timeout, wake-one, wake-many, and handoff wake retry. Safe runtime park wrappers remain a later capos-rt slice.
  6. Run 4.5.5 blocked/resume measurements and fold the result into the final ABI decision.
  7. Drain or fail private waiters before the affected virtual address range can be reused. Anonymous VirtualMemory.unmap and VirtualMemory.decommit, plus MemoryObject.unmap for borrowed mappings, are covered; shared park-word cleanup and address-space generation teardown remain open.
  8. Add MemoryObject-derived SharedParkSpace support only after mapping provenance or object pins cover shared key derivation under the same validation/use discipline.

Validation

The thread-lifecycle proof creates multiple threads in one process, parks threads on a userspace park word, wakes them through the same ParkSpace, proves timeout and value-mismatch paths, and shows that process exit drains pending waits. make run-measure records failed-wait, empty-wake, wait-to-block, wake-to-runnable, and wake-to-resume timings for the implemented private path. Safe capos-rt park wrappers remain future runtime work.

Capability Model

How capabilities work in capOS.

What is a Capability

A capability in capOS is a reference to a kernel object that carries:

  • An interface (what methods can be called), defined by a Cap’n Proto schema
  • A permission (the object it references, enforced by the kernel)
  • A wire format (Cap’n Proto serialized messages for all invocations)

A process can only access a resource if it holds a capability to it. There is no ambient authority – no global namespace, no “open by path” syscall, no implicit resource access.

Identity Terms and Authority

capOS documentation uses identity terms as policy metadata, not as kernel authorization primitives. A user is human-facing prose. A principal is the stable identity metadata used by authentication, policy, audit, and ownership records. An account is planned durable local record state for a principal, including credential references, roles, attributes, storage-root references, and default profile names. A session is the live context that receives a concrete CapSet. Policy profiles and resource profiles select bundle fragments, approval eligibility, and quotas that a trusted broker may use when minting capabilities.

None of those terms is kernel authority: the kernel dispatches through generation-tagged CapId entries, not users, roles, accounts, groups, UIDs, or profile names. Account-store behavior, durable profile records, and broader quota policy remain future work tracked in the local users backlog.

Session-Bound Invocation Context

Services should not infer authority from caller-supplied identity fields. A request parameter such as user, principal, client, or role is data. The active model is one immutable session context per process plus explicit capabilities granted by a broker or supervisor.

The general pattern is:

  • authentication or admission creates a live SessionContext;
  • process spawn installs exactly one immutable session context in the child;
  • AuthorityBroker grants service roots/facets appropriate to that session;
  • endpoint calls carry privacy-preserving caller-session metadata by default;
  • subject details such as global principal id, display name, profile class, or external claims are disclosed only through explicit client disclosure and a matching broker/service disclosure scope. The current endpoint CALL path implements this as a disclosure request mask intersected with cap-held disclosure scope.

The kernel role is narrower. It verifies that a process holds a live cap-table entry, that the process session is live, and that transfer/spawn obey session scope. It may deliver an opaque service-scoped caller-session reference and freshness result to endpoint servers, but it must not disclose broader subject details by default. It does not decide that a process is Alice, an operator, a moderator, or an NPC. Those are policy facts maintained by session, broker, account, and application services.

Opaque receiver selectors may still exist in the IPC implementation and in historical service-object routing tests. A receiver selector is not identity metadata, not shell syntax, not a user field, not a disclosure channel, and not a role bit. New shared-service identity should use the caller session context and broker-granted service facets, not caller-selected numeric labels. The chat demo now follows this rule for membership: the server receives the endpoint caller metadata and keys member records by an opaque live caller-session reference, while chat handles remain request data and visible member labels are assigned by the service. The shared chat/adventure endpoint helper now exposes caller-session metadata through EndpointCaller instead of a badge field; the old badge-named user-data type remains only as a source-compatible alias. Terminal output and shell-serviced stdio bridges are also gated by live caller-session metadata.

Schema as Contract

Capability interfaces are defined in .capnp schema files under schema/. The schema is the canonical interface definition. Currently defined:

interface Console {
    write @0 (data :Data) -> ();
    writeLine @1 (text :Text) -> ();
}

interface TerminalSession {
    write @0 (data :Data) -> ();
    writeLine @1 (text :Text) -> ();
    readLine @2 (request :LineRequest) -> (status :LineStatus, line :Data);
}

interface FrameAllocator {
    allocFrame @0 () -> (handleIndex :UInt16);
    allocContiguous @1 (count :UInt32) -> (handleIndex :UInt16);
}

interface MemoryObject {
    info @0 () -> (pageCount :UInt32, sizeBytes :UInt64);
    map @1 (hint :UInt64, offset :UInt64, size :UInt64, prot :UInt32) -> (addr :UInt64);
    unmap @2 (addr :UInt64, size :UInt64) -> ();
    protect @3 (addr :UInt64, size :UInt64, prot :UInt32) -> ();
}

interface VirtualMemory {
    map @0 (hint :UInt64, size :UInt64, prot :UInt32) -> (addr :UInt64);
    unmap @1 (addr :UInt64, size :UInt64) -> ();
    protect @2 (addr :UInt64, size :UInt64, prot :UInt32) -> ();
}

interface Endpoint {}

interface ProcessSpawner {
    spawn @0 (name :Text, binaryName :Text, grants :List(CapGrant)) -> (handleIndex :UInt16);
}

interface ProcessHandle {
    wait @0 () -> (exitCode :Int64);
}

interface BootPackage {
    manifestSize @0 () -> (size :UInt64);
    readManifest @1 (offset :UInt64, maxBytes :UInt32) -> (data :Data);
}

# Management-only introspection. Ordinary handle release uses the system
# transport opcode CAP_OP_RELEASE, not a method here.
interface CapabilityManager {
    list @0 () -> (capabilities :List(CapabilityInfo));
    revoke @1 (capId :UInt32) -> ();
    # grant is planned for a later Stage 6 management slice
}

Each interface has a unique 64-bit TYPE_ID generated by the Cap’n Proto compiler. TYPE_ID is the schema constant. interface_id is the runtime metadata used by CapSet/bootstrap descriptions and endpoint delivery headers. Method dispatch uses the interface assigned to the capability entry plus method_id; method_id selects a method inside that schema.

This is not capability identity. A CapId is the authority-bearing handle in a process table, analogous to an fd. Multiple capabilities can expose the same interface:

  • cap_id=3 -> serial-backed Console
  • cap_id=4 -> log-buffer-backed Console
  • cap_id=5 -> Console proxy served by another process

All three use the same Console TYPE_ID, but they are different objects with different authority. The manifest/CapSet should record the expected schema TYPE_ID as interface metadata for typed handle construction. Normal CALL SQEs do not need to repeat it because the kernel or serving transport can derive it from the target capability entry. CapSqe keeps reserved tail padding for ABI stability.

The kernel exposes the initial CapSet to each process as a read-only 4 KiB page mapped at capos_config::capset::CAPSET_VADDR and passes its address in RDX to _start. The page starts with a CapSetHeader { magic, version, count } and is followed by CapSetEntry { cap_id, name_len, interface_id, name: [u8; 32] } records in manifest declaration order. Userspace looks up caps by the manifest name rather than by numeric index (capos_config::capset::find), so grants can be reordered in system.cue without breaking clients. The mapping is installed without WRITABLE so userspace cannot mutate its own bootstrap authority map.

Security invariant: a CapTable entry exposes one public interface. If the same backing state must be available through multiple interfaces, mint multiple capability entries, each wrapping the same state with a narrower interface. Do not grant one handle that accepts unrelated interface_id values; that makes hidden authority easy to miss during review.

Invocation Path

Capabilities are invoked via a shared-memory capability ring (io_uring- inspired). Each process has a submission queue (SQ) and completion queue (CQ) mapped into its address space. Two invocation paths exist:

Caller builds capnp params message
    → serialize to bytes (write_message_to_words)
    → write CALL SQE to SQ ring (pure userspace memory write)
    → advance SQ tail
    → caller invokes cap_enter for ordinary capability methods
      (timer polling only runs explicitly interrupt-safe CALL targets)
    → kernel reads SQE, validates user buffers
    → CapTable.call(cap_id, method_id, bytes)
    → kernel writes CQE to CQ ring
    ... caller reads CQE after cap_enter, or spin-polls only for
        interrupt-safe/non-CALL ring work ...
    → caller reads CQE result

CapObject::call does not receive a caller-supplied interface ID. The cap table derives the invoked interface from the target entry before invoking the object. The SQE carries only the capability handle and method ID because each capability entry owns one public interface:

#![allow(unused)]
fn main() {
pub trait CapObject: Send + Sync {
    fn interface_id(&self) -> u64;
    fn label(&self) -> &str;
    fn call(
        &self,
        method_id: u16,
        params: &[u8],
        result: &mut [u8],
        reply_scratch: &mut dyn ReplyScratch,
    ) -> capnp::Result<CapInvokeResult>;
}
}

All communication goes through serialized capnp messages, even when caller and callee are in the same address space. This ensures the wire format is always exercised and makes the transition to cross-address-space IPC seamless.

The result buffer is supplied by the caller (the user-validated SQE result region). Implementations serialize directly into it and return the number of bytes written, so the kernel’s dispatch path does not allocate an intermediate Vec<u8> per invocation.

Capability Table

Each process has its own capability table (CapTable), created at process startup. The kernel also maintains a global table (KERNEL_CAPS) for kernel-internal use. Each table maps a CapId (u32) to a boxed CapObject.

CapId encoding: [generation:8 | index:24]. The generation counter increments when a slot is freed, so stale CapIds (from a previous occupant of the slot) are rejected with CapError::StaleGeneration rather than accidentally referring to a different capability.

Generation wrap must not resurrect old authority. The implemented table retires a slot permanently when its 8-bit generation would wrap from 255 back to 0; that slot is not returned to the free list. Heavy churn can therefore exhaust a table even when many retired slots are empty, but the failure mode is CapError::TableFull, not stale-cap revalidation. Future widening of CapId generation bits is an ABI change and belongs in the schema/ring ABI evolution track.

Operations:

  • insert(obj) – register a new capability, returns its CapId
  • get(id) – look up a capability by ID (validates generation)
  • remove(id) – revoke a capability, bumps slot generation
  • call(id, method_id, params) – dispatch a method call against the interface assigned to the capability entry

Every current boot manifest gives only initConfig.init a kernel-built capability table. The default system.cue manifest boots the standalone init binary, which reads BootPackage, validates initConfig.services, and spawns capos-shell, the remote-session CapSet gateway, and resident demo services through ProcessSpawner. The Telnet gateway fixture is retired with the kernel socket owner. Focused shell-led manifests such as system-smoke.cue and system-shell.cue still boot capos-shell directly as initConfig.init for narrow login/shell proofs. Focused init-executor manifests such as system-spawn.cue also boot the standalone init binary with Console, BootPackage, and ProcessSpawner for isolated ProcessSpawner coverage. Child capabilities are assembled from explicit spawn grants in declaration order: raw grants preserve the source capability metadata, legacy endpoint-client grants attenuate an endpoint owner or ProcessSpawner endpoint result source to a client facet while preserving delegated receiver metadata, and child-local Endpoint, FrameAllocator, and VirtualMemory grants are minted for the child’s process. Endpoint kernel grants return parent-side client facets as result caps; init uses those facets for later service imports and releases them before waiting on children. Kernel bootstrap now builds only initConfig.init kernel-sourced caps; CapSource::Service resolution stays in init’s BootPackage executor path. CapRef.source is structured CUE inside initConfig.services, not an authority string:

{
    name:                "client"
    expectedInterfaceId: 0xacf0c15a7b2e0041
    source: service: {
        service: "endpoint-server"
        export:  "client"
    }
}

The source selector chooses the object or authority to grant. The expectedInterfaceId value is a schema compatibility check against the constructed object, not the authority selector itself. This distinction matters because different objects can implement the same interface.

Transport-Level Capability Lifetime

Cap’n Proto applications do not usually model capability lifetime as an application method on every interface. The RPC transport owns capability reference bookkeeping.

The standard Cap’n Proto RPC protocol is stateful per connection. Each side keeps four tables: questions, answers, imports, and exports. Import/export IDs are connection-local, not global object names. When an exported capability is sent over the connection, the export reference count is incremented. When the importing side drops its last local reference, the transport sends Release to decrement the remote export count. Implementations may batch these releases. If the connection is lost, in-flight questions fail, imports become broken, and exports/answers are implicitly released. Persistent capabilities, when implemented, are a separate SturdyRef mechanism and should not be treated as owned pointers.

References:

This distinction matters for capOS:

  • close() is application protocol. A File.close() method can flush dirty state, commit metadata, or tell a server that a session should end.
  • Release / cap drop is transport protocol. It removes one reference from the caller’s local capability namespace and eventually lets the serving side reclaim the object if no references remain.
  • Process exit is bulk transport cleanup. Dropping the process must release all caps in its table, cancel pending calls, and wake peers waiting on those calls.

capOS therefore needs a system transport layer in the userspace runtime (capos-rt / later language runtimes), not just raw SQE helpers. That transport should own typed client handles, local reference counts, promise-pipelined answers, and broken-cap state. When the last local handle is dropped, it should queue a transport-level release operation that is flushed through the kernel ring at an explicit runtime boundary.

Ordinary handle release is a transport concern, not an application method. The target design: the generated client drops the last local handle (RAII / GC / finalizer), the runtime transport queues CAP_OP_RELEASE, an explicit runtime flush or later ring-client boundary submits it, and the kernel removes the caller’s CapTable slot with mutable access to that table. Encoding ordinary local release as a regular method call on CapabilityManager was rejected because it would mutate the same table used to dispatch the call; CapabilityManager is therefore management-only (list() plus child-scoped revoke(capId), later grant()), not the default release path. CAP_OP_FINISH remains reserved in the same transport opcode namespace for application-level “end of work” signals that the transport must deliver reliably, so the kernel can tell them apart from a truly malformed opcode.

Current status: the kernel dispatches CAP_OP_RELEASE as a local cap-table slot removal and fails closed for stale or non-owned cap IDs. capos-rt bootstrap handles remain explicitly non-owning, while adopted owned handles queue CAP_OP_RELEASE on final drop and expose Runtime::flush_releases() for callers that need to force the queued releases. Result-cap adoption validates the kernel-supplied interface ID before producing an owned typed handle. CAP_OP_FINISH remains reserved and returns CAP_ERR_UNSUPPORTED_OPCODE. Process exit remains the fallback cleanup path for unreleased local slots.

Queued release is not immediate revocation. A dropped runtime handle no longer provides local typed access in that runtime, but the kernel cap-table slot is removed only after the release SQE is flushed and processed, or during process exit cleanup. Security-sensitive flows that need to invalidate authority for other holders or peers must use explicit revoke/epoch semantics such as CapabilityManager.revoke, session expiry, object epochs, or service-specific close/revoke methods; they must not rely on destructor timing.

Session expiry is also not a substitute for every revocation shape. The target session lifecycle model has separate layers:

  • a mutable session liveness cell for live, logged_out, revoked, expired, and recovery_only state behind the immutable process SessionContext;
  • broker grant leases for bundle fragments and elevated or temporary caps;
  • object/facet epochs for invalidating a live target generation.

Renewal acts on the first two layers. It may extend session liveness or mint fresh grant leases, but it must not make old ordinary grants fresh merely because the session renewed. Object/facet revocation remains an independent target-side operation.

Service authors should make this distinction explicit in protocol design:

  • Use ordinary handle drop or runtime flush_releases() only to stop this process from using one local cap slot.
  • Use a service close method when the service must observe application-level shutdown, flush durable state, or publish an orderly end-of-session result.
  • Use CapabilityManager.revoke, session expiry, object epochs, or a service-specific revoke method when existing peers or delegated holders must lose authority before the service proceeds.
  • Treat destructor/finalizer timing as advisory cleanup. It is not a security boundary, and it is not proof that another process has stopped using a cap.

Stale-Handle and Revoke Patterns

Not all kernel cap families use the same model for handling stale or revoked capabilities. The correct pattern depends on the semantics of the object, not on a blanket epoch test. Using the wrong model produces incorrect tests or incorrect behavior expectations.

Category A — Exception-based stale guard

The cap exposes an ensure_*_live guard or an equivalent consumed-state check that returns a stable typed exception (not a silent success) on a stale or consumed cap.

  • UserSession (kernel/src/cap/user_session.rs): info()/auditContext() fail closed with a stable exception message after logout(); second logout is idempotent. Proved by run-ssh-public-key-session.
  • SchedulingContext, CpuIsolationLease: expose an explicit revoke method returning staleGeneration. Subsequent info, bind_caller_thread, activation_preflight, create, and drain_notifications calls fail closed on the staled cap. Proved by run-scheduling-context (demos/scheduling-context-smoke/src/main.rs:285-313, 1129-1141) and run-scheduler-cpu-isolation-lease (demos/cpu-isolation-lease-smoke/src/main.rs:201-237).
  • ThreadHandle (kernel/src/cap/thread_handle.rs): join (sched.rs:1038-1057) returns AlreadyJoined on the second call (hard fail, not silent success) and returns TargetNotLive (sched.rs:1371,1377,1385) if the thread record is absent post-cleanup. exitCode (sched.rs:1418-1420) is a non-consuming idempotent read. The join_or_register consumed-state check is the stale guard; the joined flag is the epoch. Proved by run-thread-lifecycle (demos/thread-lifecycle/src/main.rs:293-298).

Per-cap epoch tests are applicable only to Category A caps.

Category B — Idempotent-stale-target

The cap returns silent success (or a latched result) on a stale target. No ensure_*_live guard is present by design.

  • ProcessHandle (kernel/src/cap/process_spawner.rs): terminate on an already-exited process returns Complete(0); wait re-reads the latched exit code. Writing fail-closed tests for Category B caps would test the opposite of intended behavior.

Category C — Soft-EOF / zero-write

The cap uses v0 ExceptionType policy: closing one side causes the other to drain and receive EOF; writes return zero bytes rather than an error.

  • Pipe (kernel/src/cap/pipe.rs): close causes read to drain + EOF, write returns zero bytes (schema lines 2429-2433). No epoch test needed.

Category D — No revoke verb (kernel singletons)

These caps expose no revoke or close method in the schema. The backing object lives for the process lifetime.

CredentialStore, AuthorizedKeyStore, SshHostKey, EntropySource, SystemInfo, AuditLog, HardwareAuditLog, SessionManager, AuthorityBroker, RestrictedLauncher, BootPackage. Nothing to test for stale-handle behavior.

Category E — DDF caps with release/scrub semantics

These caps use internal handle epoch validation. The full stale-handle behavior for each requires targeted per-cap investigation when a behavior gap is identified.

DmaBuffer, DeviceMmio, Interrupt.

Open residuals

  • UserSession expiry path (Category A): the expiresAtMs/anonymousMs- driven expiry path is not yet covered by a focused smoke. run-ssh-public-key-session covers the explicit logout() close-side path. Note that run-session-context is flaking on TCG-only hosts — a stability fix is needed before that smoke can be strengthened.

Access Control: Interfaces, Not Rights Bitmasks

capOS deliberately does not use a rights bitmask (READ/WRITE/EXECUTE) on capability entries, despite this being standard in Zircon and seL4. The reason is that Cap’n Proto typed interfaces already serve as the access control mechanism, and a parallel rights system creates an impedance mismatch.

Why rights bitmasks exist in other systems: Zircon and seL4 use rights because their syscall interfaces are untyped – a handle is an opaque reference to a kernel object, and the kernel needs something to decide which fixed syscalls are allowed. capOS has typed interfaces where the .capnp schema defines exactly what methods exist.

capOS’s approach: the interface IS the permission. To restrict what a caller can do, grant a narrower capability:

  • Fetch (full HTTP) → HttpEndpoint (scoped to one origin)
  • Store (read-write) → Store wrapper that rejects write methods
  • Namespace (full) → Namespace scoped to a prefix

The “restricted” capability is a different CapObject implementation that wraps the original. The kernel doesn’t know or care – it dispatches to whatever CapObject is in the slot. Attenuation is userspace/schema logic, not a kernel mechanism.

Session transfer scope: capability holds now carry reference-level transfer scope. same_session caps cannot move into another process session through raw IPC, endpoint return, or spawn grants. cross_session_shareable caps may cross and then invoke under the receiver process session. service_regrant_only caps require a trusted fixed-session broker/launcher path. These meta-rights are about the reference, not the referenced object, and do not overlap with interface-level method access control.

Non-writable filesystem caps are forwardable to a same-session child; writable caps are not. Directory/File caps are minted Copy/same_session at the read-only and RAM mint sites, so a holder can forward an opened directory or file to a ProcessSpawner.spawn child within the same session – the kernel handoff that backs POSIX fd inheritance across fork/execve. The security argument is the same for all of them: the child gains no authority the parent does not already hold, same_session keeps the cap from escaping the session, and the spawn-grant epoch wrapper keeps a forwarded child cap from outliving a revoked parent. Two flavours exist:

  • Read-only views – the read-only filesystem (readonly_fs) and the packaged-image source (installable_image), plus their read_only_fs_root/ installable_image_source bootstrap roots. Their interfaces fail closed on every mutation, so forwarding shares a pure read view. Here the interface is the permission makes the share unambiguously benign.
  • The holder’s own RAM scratch namespace – the directory::transfer_result_cap results and the kernel:directory/kernel:file bootstrap sources (via boot_cap_hold). This Directory/File interface includes mutation methods, so the forwarded cap is shared read/write with the child, not a read view. It is still safe to forward because it is the parent’s own scratch tree shared within one session, not a privilege the parent lacked.

The disk-backed writable filesystem (writable_fs) is a distinct CapObject type minted NonTransferable: a writable cap carries the filesystem-wide single-writer claim, so forwarding it would let two processes hold that claim. The ProcessSpawner Raw/Move grant modes reject a NonTransferable source, so the single-writer policy is preserved by the mint-time mode rather than a separate check. Proven by make run-spawn-grant-directory.

TerminalSession is forwardable to a same-session child, parent-retained. The bootstrap TerminalSession cap is minted Copy/same_session (matching Console) in boot_cap_hold, so a holder can forward its terminal-backed stdout/stderr to a ProcessSpawner.spawn child without losing its own terminal. TerminalSessionCap is a stateless unit struct: write/writeLine dispatch onto the shared kernel terminal and readLine resolves the caller’s session context per call (requires_live_caller_session stays true), so there is no per-session ownership state to strip on a forward. The child gains no terminal authority the parent did not already hold, and same_session keeps the cap from escaping the session. This is the non-destructive capability-model realization of POSIX “all children share the controlling tty”; the prior Move/service_regrant_only mint was a policy default, not a state-ownership requirement, and a destructive Move would have stripped a shell of its terminal on its first child spawn under full fd inheritance. Two writers reaching the same terminal serialize at the shared kernel UART; sub-line interleaving between a parent and a child writing concurrently is an accepted research-surface limitation, not an authority leak. Proven by make run-posix-terminal-forward.

See research survey for the cross-system analysis that led to this decision (§1 Capability Table Design).

Planned Enhancements (from research)

Tracked in Roadmap Stages 5-6:

  • Legacy badge / receiver selector – the current storage field is a u64 per capability hold edge, delivered to endpoint servers on invocation. Existing code still calls it a badge because it began as seL4-style client identity metadata. The active model keeps that field out of service identity: new service capability should use one immutable process session, broker-granted service roots/facets, privacy-preserving endpoint caller-session metadata, and explicit subject disclosure plus a matching disclosure scope when a service needs more than an opaque service-scoped session reference.
  • Epoch (from EROS) – per-object revocation epoch. Incrementing the epoch invalidates all outstanding references. O(1) revoke, O(1) check.

Current Limitations

  • Process-ring blocking remains process-level; private ParkSpace waits are per-thread. cap_enter(min_complete, timeout_ns) processes pending SQEs and can block one admitted thread per process until enough CQEs exist or a finite timeout expires. That ring wait is still process-owned and does not make the capability ring itself a per-thread completion queue. Separately, the implemented private ParkSpace path provides process-local per-thread wait/wake on userspace words through compact CAP_OP_PARK/CAP_OP_UNPARK operations. SharedParkSpace park-words and runtime safe park clients remain future work.
  • No persistence. Capabilities exist only at runtime.
  • Capability transfer is implemented for Endpoint CALL/RECV/RETURN. Transfer descriptors on the capability ring let callers and receivers copy or move transferable local caps through IPC messages. Delivery also enforces the cap hold’s session transfer scope; an unsupported cross-session transfer fails with CAP_ERR_TRANSFER_NOT_SUPPORTED and is reported to the caller instead of being requeued to the endpoint. See Storage and Naming “IPC and Capability Transfer” for the full design.
  • Transfer ABI (3.6.0 draft). Sideband transfer descriptors are defined in capos-config/src/ring.rs as CapTransferDescriptor:
    • cap_id is the sender-side local capability-table handle.
    • transfer_mode is either CAP_TRANSFER_MODE_COPY or CAP_TRANSFER_MODE_MOVE.
    • xfer_cap_count in CapSqe is the descriptor count.
    • For CALL/RETURN, descriptors are packed at addr + len after the payload bytes and must be aligned to CAP_TRANSFER_DESCRIPTOR_ALIGNMENT.
    • Result-cap insertion semantics are defined by CapCqe: result reports normal payload bytes, while cap_count reports how many CapTransferResult { cap_id, interface_id } records were appended immediately after those payload bytes in result_addr when CAP_CQE_TRANSFER_RESULT_CAPS is set. User space must bound-check result + cap_count * CAP_TRANSFER_RESULT_SIZE against its requested result_len.
    • Future promise pipelining must target that sideband result-cap namespace: pipeline_dep names a process-local promised answer, and pipeline_field is a zero-based CapTransferResult record index in that answer’s completion. It is not a Cap’n Proto schema field number; the kernel must not traverse opaque result payload bytes to find a capability.
    • Transfer-bearing SQEs are fail-closed:
      • unsupported transfer scope or object class: CAP_ERR_TRANSFER_NOT_SUPPORTED,
      • malformed descriptor metadata (invalid mode, reserved bits, non-zero _reserved0, misalignment, overflow): CAP_ERR_INVALID_TRANSFER_DESCRIPTOR,
      • all other reserved-field misuse remains CAP_ERR_INVALID_REQUEST.
  • Revocation propagates through object epochs. CapabilityManager.revoke invalidates child-local grant copies for the revoked object, and the ring maps revoked ordinary and endpoint use to typed Disconnected exceptions where a result buffer exists. Broader supervision/restart policy remains future work.
  • MemoryObject is the mapped bulk-data substrate. FrameAllocator returns owned MemoryObject result caps instead of raw physical addresses. The object exposes metadata plus caller-local map/unmap/protect operations for page-aligned ranges. File I/O, networking, GPU data planes, and zero-copy IPC still need service-level SharedBuffer operations built on this substrate. See Storage and Naming “Shared Memory for Bulk Data” for the broader interface design.

Future Directions

  • Broader capability-bearing services. Endpoint CALL/RECV/RETURN already carry copy/move sideband transfer descriptors and install result caps in the receiver’s local table. Remaining work is to use that transport in higher service layers: capability-bearing naming and persistence services, Directory/File and Namespace-style object models, promise pipelining over result-cap indexes, and policy for durable references. See Storage and Naming.
  • Persistence. Persistent object references should be restored through a capability-bearing naming or persistence service that can authorize the request and mint a fresh live object. Do not serialize local cap-table handles, endpoint generations, receiver selectors, or server cookies as durable authority.
  • Network transparency. Remote capability transport should use connection-local export/import tables and explicit disconnect semantics. A remote Console capability can expose the same typed interface as a local one, but the portable authority is the live object reference, not a global URL or serialized local routing selector.

ABI Evolution Policy

This policy governs externally visible capOS ABIs:

  • Cap’n Proto schema in schema/capos.capnp.
  • Generated schema bindings checked by make generated-code-check.
  • Ring and bootstrap ABI constants and layouts in capos-config/src/ring.rs, capos-config/src/capset.rs, and capos-abi/src/lib.rs.
  • Debug/log formats only when a document explicitly declares them stable.

The current project is still a research tree, not a released platform with a public compatibility promise. Even so, schema and ring changes must follow this policy before external clients, host tools, or out-of-tree runtimes depend on them.

Design Grounding

This policy is grounded in current capOS docs and the checked-in prior-art notes that apply to schema and transport evolution:

  • docs/architecture/capability-ring.md for the implemented process-wide ring, fixed 64-byte CapSqe, fixed 32-byte CapCqe, opcode boundary, and current completion semantics.
  • docs/proposals/ring-v2-smp-proposal.md for the undecided future per-thread-ring version-negotiation shape.
  • docs/proposals/error-handling-proposal.md for the transport/application error split and unsupported-operation behavior.
  • docs/trusted-build-inputs.md for generated-code drift checks and pinned Cap’n Proto tooling.
  • docs/design-risks-register.md for the prior open ABI compatibility and Ring v2 compatibility questions.
  • docs/research/capnp-error-handling.md for Cap’n Proto exception and schema error-model precedent. OS scheduling, filesystem, networking, and hardware prior-art research does not directly change this schema/ring ABI policy.

Compatibility Classes

Every ABI change must name one class in its task, review, or commit message.

ClassMeaningRequired handling
Compatible additionExisting clients keep working without recompilation or behavior change.Add tests or generated-code drift evidence. Update docs when semantics matter.
Compatible tighteningExisting malformed or previously unspecified inputs fail earlier or more specifically.Document the rejected shape and expected error. Add hostile coverage when reachable from userspace.
Soft deprecationOld shape still works, but new callers should stop using it.Mark the field/method/opcode as deprecated in docs and keep a replacement path live through the deprecation window.
Breaking changeExisting valid clients can fail, observe different semantics, or require regenerated code.Requires a proposal or backlog plan, migration notes, compatibility proof or explicit break decision, and task/risk updates when relevant.
Internal-onlyNot visible outside one crate or generated artifact and not serialized, mapped, or invoked across a boundary.Normal code review; do not label serialized or mapped data as internal-only.

Cap’n Proto Schema Rules

Schema interface IDs, method ordinals, struct field ordinals, enum discriminants, union tags, and named constants are stable once checked in.

Allowed compatible changes:

  • Add a new field with a new ordinal and a default value that old readers can safely ignore.
  • Add a new method with a new ordinal when old clients do not need it.
  • Add a new result union arm only when old clients already treat unknown or unsupported domain outcomes as a controlled failure.
  • Add a new interface or struct with a fresh ID/name.
  • Add documentation that narrows previously undocumented behavior without changing wire compatibility.

Disallowed without a breaking-change plan:

  • Reuse a removed field, method, enum, or union ordinal.
  • Change the meaning, type, units, authority, or lifetime of an existing field.
  • Rename a schema item when generated code or logs expose the old name as a public integration surface.
  • Make an optional/defaulted field mandatory for existing callers without a versioned fallback.
  • Replace a schema result union with a transport error or vice versa without an error-layer migration note.

Removed schema space stays reserved. If a field or method is retired, leave a comment at the old ordinal explaining why it is reserved and where the replacement lives.

Ring ABI Rules

The ring ABI is a fixed-layout shared-memory contract. CapSqe, CapCqe, ring header fields, opcodes, flags, transfer descriptor layout, CQE result codes, and fixed virtual addresses are kernel/userspace ABI.

Rules for the current process-wide ring:

  • Do not change the size, alignment, byte order, or meaning of an existing ring struct field without a breaking-change plan.
  • Preserve objective layout checks for current ABI structs. At minimum, capos-config/src/ring.rs must keep compile-time checks for CapSqe, CapCqe, CapTransferDescriptor, endpoint caller-session metadata, endpoint message headers, and ring capture records. Any new negotiated ring layout must add equivalent checked constants for SQE size, CQE size, transfer descriptor size, ring header offsets, SQE/CQE array offsets, and feature/version fields.
  • Do not change SQE_ARRAY_OFFSET, CQE_ARRAY_OFFSET, SQ_ENTRIES, CQ_ENTRIES, RING_VADDR, or fixed SQE/CQE sizes by arithmetic side effect. A change to any of those values is a layout change and must name its compatibility class.
  • Reserved SQE fields must be rejected unless the opcode explicitly defines them. New meanings for reserved fields require hostile tests that old kernels fail closed.
  • New opcodes must start as reserved or unsupported. A reserved opcode should return CAP_ERR_UNSUPPORTED_OPCODE; malformed non-reserved opcodes should return CAP_ERR_INVALID_REQUEST.
  • New flags must specify whether old kernels reject them, ignore them, or treat them as malformed. Silent ignore is allowed only for flags that cannot carry authority or resource effects.
  • New negative CQE result codes must be appended as new constants. Existing negative result codes cannot be renumbered or repurposed.
  • Capability transfer descriptors must continue to reject unknown reserved bits until a documented transfer mode consumes them.

Ring v2 or per-thread-ring work must declare whether it is:

  • a negotiated compatible extension to the current ring page;
  • a new ring layout selected by boot/runtime version negotiation; or
  • an intentional ABI break.

That decision belongs in the Ring v2 proposal/backlog before implementation.

Version Negotiation

When an ABI cannot be evolved by compatible addition, introduce an explicit version gate instead of inferring compatibility from struct size or accidental behavior.

Acceptable gates include:

  • manifest or boot-package schemaVersion fields;
  • a future runtime boot-info field that names ring layout and feature bits;
  • interface methods that return a structured unsupported-version result;
  • manifest/tooling checks that reject unsupported data versions before boot.

Unsupported versions must fail closed with a stable, documented error. A client must not need to parse debug text to distinguish “unsupported version” from “malformed input”.

Deprecation Window

Before external consumers exist, a deprecation may be removed after the replacement path, docs, and smokes land in main.

After external consumers are declared for an ABI, deprecated schema or ring surfaces must remain for at least one full selected milestone after the replacement is documented and tested. Removing them earlier is a breaking change and must be called out as such.

Deprecation notes must name:

  • the old field, method, opcode, flag, or constant;
  • the replacement;
  • the last proof target that still exercises the old shape;
  • the planned removal condition.

Review Gates

Schema or ring ABI changes must include the relevant checks:

  • make generated-code-check for schema/capos.capnp changes.
  • cargo test-config for manifest/schema validation changes.
  • cargo test-ring-loom for ring queue protocol changes.
  • Compile-time layout assertions and host tests for ring struct size, alignment, offsets, entry counts, and fixed virtual addresses when a ring layout changes.
  • cargo test-lib for CapTable/capability transfer semantics.
  • A focused QEMU smoke when a userspace-visible behavior changes.
  • make docs for policy or manual changes.

Reviewers should reject ABI changes that lack a compatibility class, migration notes for breaking behavior, or an unsupported-version/error story for new version gates.

Current Open ABI Decisions

  • Ring v2 backward compatibility remains undecided. Until it is decided, do not claim per-thread rings are compatible with the current process-wide ring.
  • Production release reproducibility remains separate from ABI compatibility. Final ISO, manifest, and embedded ELF checksums are tracked in docs/trusted-build-inputs.md and relevant task records.

Capability Ring

The capability ring is the userspace-to-kernel transport for capability invocation. It avoids one syscall per operation while preserving a typed Cap’n Proto method boundary and explicit completion reporting.

The current error model is documented in Error Handling. Ring CQE status values report transport failures; typed capability exceptions and ordinary schema result unions sit above that transport layer.

Current Behavior

Each non-idle process gets one 4 KiB ring page mapped at RING_VADDR. The page contains a volatile header, a 16-entry submission queue, and a 32-entry completion queue. Userspace writes CapSqe records, advances sq_tail, and uses cap_enter(min_complete, timeout_ns) to make ordinary calls progress.

sequenceDiagram
    participant U as Userspace runtime
    participant R as Ring page
    participant K as Kernel ring dispatcher
    participant C as Capability object
    U->>R: write CapSqe and advance sq_tail
    U->>K: cap_enter(min_complete, timeout_ns)
    K->>R: read sq_head..sq_tail
    K->>K: validate SQE fields and lock AddressSpace for user buffers
    K->>C: call method or endpoint operation
    C-->>K: completion, pending, or error
    K->>R: write CapCqe and advance cq_tail
    K-->>U: return available CQE count
    U->>R: read matching CapCqe

Timer polling also processes each current process’s ring before preemption, but only non-CALL operations and CALL targets that explicitly allow interrupt dispatch may run there. Ordinary CALLs wait for cap_enter.

Why ordinary CALL waits for cap_enter: Submitting a CALL SQE is only a shared-memory write. The kernel still needs a safe execution point to drain the ring and run capability code. Timer polling runs in interrupt context, so it must not execute arbitrary capability methods that may allocate, block on locks, mutate page tables, spawn processes, parse Cap’n Proto messages, or perform IPC side effects. cap_enter is the normal process-context drain point: it processes pending SQEs, posts CQEs, and then either returns the available completion count or blocks until enough completions arrive. The design keeps SQE publication syscall-free and batchable, keeps the syscall ABI limited to exit and cap_enter, and avoids turning the timer interrupt into a general capability executor. A future SQPOLL-style path can remove the explicit syscall from the hot path only by running dispatch in a worker context, not from arbitrary timer interrupt execution.

Design

CapSqe is a fixed 64-byte ABI record. CAP_OP_CALL names a local cap-table slot and method ID plus parameter/result buffers. CAP_OP_RECV and CAP_OP_RETURN implement endpoint IPC. CAP_OP_RETURN normally returns successful result bytes to the original caller; with CAP_SQE_RETURN_APPLICATION_EXCEPTION, its payload is a serialized CapException and the original caller completes with CAP_ERR_APPLICATION_EXCEPTION or the truncated application-exception code. CAP_OP_RELEASE removes a local cap-table slot through the transport. CAP_OP_CANCEL (opcode 6) cancels a pending endpoint receive posted by the same process on the same endpoint cap; pipeline_dep carries the receive SQE’s user_data. CAP_OP_NOP measures the fixed ring path. CAP_OP_PARK_BENCH (opcode 7) is a measurement-only compact opcode dispatched only by kernels built with the measure feature; normal kernels reject it as malformed. CAP_OP_FINISH is ABI-reserved and currently returns CAP_ERR_UNSUPPORTED_OPCODE.

CAP_OP_RELEASE is deliberately scoped to local transport cleanup. It removes one holder’s cap-table slot after the SQE is processed, or as part of process exit cleanup; it does not revoke peer-held caps, cancel delegated authority, or stand in for an application close method. Services that need security-visible invalidation must use an explicit control path such as CapabilityManager.revoke, session expiry, object epochs, or a service-specific close/revoke protocol. Reviewers should treat claims based only on handle drop, RAII, GC finalizers, or queued release flushing as local-cleanup claims, not revocation claims.

Opcode boundary: Ring opcodes are kernel ABI, not a loophole around the syscall surface. cap_enter and exit remain the CPU trap entrypoints, but every accepted authority-bearing or resource-mutating CAP_OP_* still adds distinct kernel semantics that must pass the capability method / ring opcode / syscall decision graph. No-authority diagnostics such as CAP_OP_NOP are still kernel ABI and must stay side-effect-free and review-visible, but they are not resource authority paths. CAP_OP_PARK and CAP_OP_UNPARK are justified because blocking wait mutates scheduler state, must be thread-owned on the process ring, reserves completion credit for later wake/timeout delivery, and needs compact capability-authorized hot-path framing. They are not a precedent for moving ordinary object methods into the opcode table for convenience.

CAP_OP_CALL may set CAP_SQE_THREAD_OWNED with call_id equal to the owning thread id. If another thread drains the shared process ring first, the kernel leaves that SQE at the head instead of consuming it and returns a distinct owner-head cap_enter result instead of blocking the non-owner behind it. This is limited to context-sensitive self-thread operations such as ThreadControl.exitThread; ordinary runtime submissions leave call_id = 0.

CAP_OP_PARK and CAP_OP_UNPARK are compact capability-authorized operations for process-local ParkSpace. Wait SQEs must set CAP_SQE_THREAD_OWNED with call_id equal to the owning thread id; a non-owner cap_enter leaves the SQE at the head just like a thread-owned CALL. They reject promise-pipeline fields and run only from syscall-context ring dispatch, not timer polling. A blocking wait consumes the SQE but posts no caller CQE immediately; instead it reserves one waiter CQE credit, parks the current thread, and later completes with a non-negative park status. Ordinary CQE posting treats reserved park credits as unavailable so wake and timeout delivery cannot lose waiter completions.

The kernel copies user params into preallocated per-process scratch, dispatches capability methods, copies serialized results into caller-provided result buffers, and posts CapCqe. Current-process user copies and transfer-descriptor loads hold the caller’s AddressSpace mutex across permission validation and the actual HHDM-backed copy/read. A successful method returns non-negative bytes written. Transport failures are negative CAP_ERR_* codes. Application exceptions are serialized CapException payloads with CAP_ERR_APPLICATION_EXCEPTION. Ordinary capability implementation errors and live endpoint CALL/RETURN target errors use this application-exception path once a valid target cap or accepted endpoint relationship has been identified; malformed ring metadata, bad user buffers, lookup failures, and endpoint rollback/transfer failures stay in the transport namespace.

Transfer-bearing CALL and RETURN SQEs pack CapTransferDescriptor records after the params/result payload. Successful result-cap transfers append CapTransferResult records after normal result bytes.

Promise-pipelined CALLs remain rejected by current kernels. When that flag is enabled, pipeline_dep names a process-local promised-answer identifier, and pipeline_field selects a zero-based CapTransferResult record from that answer’s completion. It is not a Cap’n Proto schema field number or payload path. The kernel resolves dependencies only through the sideband result-cap records it already owns; normal result bytes stay opaque to the transport.

Future behavior should use the reserved SQE fields for system transport features, not ad hoc per-interface extensions.

Choosing A Capability Method, Ring Opcode, Or Syscall

New kernel functionality should default to a normal typed capability method. The small syscall surface is only the trap surface; the ring opcode table is also a reviewed kernel ABI and must stay narrow. The decision tree below is a full-page reference in the PDF because the branches are easier to read at diagram scale than as compressed prose.

flowchart TD
    Start[New kernel-visible operation] --> Ambient{Must it run without any held capability?}
    Ambient -- yes --> Trap{Is it process lifecycle or kernel-entry control?}
    Trap -- yes --> Syscall[Consider a syscall]
    Trap -- no --> RejectAmbient[Reject or redesign around explicit authority]
    Ambient -- no --> CapMethod{Can it be expressed as a typed object method?}
    CapMethod -- no --> Redesign[Redesign the authority object or transport contract]
    CapMethod -- yes --> Hot{Is generic Cap'n Proto CALL materially wrong?}
    Hot -- no --> Method[Use CAP_OP_CALL to a capability method]
    Hot -- yes --> RingSpecific{Does it need ring/scheduler-specific semantics?}
    RingSpecific -- no --> Method
    RingSpecific -- yes --> Stable{Is the compact SQE/CQE ABI stable and capability-authorized?}
    Stable -- no --> MethodOrDesign[Keep a capability method or write a reviewed design first]
    Stable -- yes --> Opcode[Consider a new CAP_OP_* opcode]

Use a normal capability method when the operation is control plane, policy driven, service-specific, infrequent, or naturally represented by Cap’n Proto params/results. Process spawning, credential checks, storage naming, shell or network policy, virtual-memory control-plane calls, and most device-specific commands belong here unless measurement and design review prove otherwise.

Consider a compact ring opcode only when all of these are true:

  • The operation is a hot path or scheduler path where generic Cap’n Proto framing is materially wrong.
  • The operation has a small, stable field layout that fits the existing SQE/CQE model without per-interface ad hoc extensions.
  • It needs ring-specific behavior such as thread ownership, reserved completion credit, CQ ordering/backpressure, asynchronous completion delivery, or interaction with the process ring head.
  • It remains authorized by a held capability in cap_id, not by ambient process identity or guessed kernel object names.
  • It cannot be handled as a normal capability method plus a future generated fast client without losing an essential scheduler or transport invariant.

Consider a new syscall only when the operation is about entering or leaving the kernel execution context itself and cannot sensibly be authorized by a capability already available to the process. That bar is intentionally higher than the opcode bar. Ordinary resource operations should not become syscalls just because they are common.

Full-SMP Direction

The current process-wide ring is not the target ABI for full SMP. Once sibling threads in one process can run on different CPUs, a shared process CQ would force userspace to serialize completion consumption or the kernel to invent specific-wait state on top of circular-buffer slots.

The selected future direction is per-thread ring ownership, documented in Ring v2 For Full SMP. In that model, cap_enter(min_complete, timeout_ns) keeps its current aggregate wait shape, but the aggregate is the current thread’s CQ. Completion paths post by generation-checked ThreadRef, while result-cap transfers and authority still belong to the process cap table.

The first Ring v2 implementation should use kernel-chosen child-thread ring mappings. The initial fixed RING_VADDR mapping becomes a compatibility special case backed by the same RingEndpoint lifetime and waiter rules as child-thread rings. Runtime-supplied ring address ranges are deferred until VirtualMemory can reserve a ring arena without racing ordinary mappings.

The initial Phase C multi-CPU scheduler proof may continue to use the current process-wide ring as long as userspace serializes ring consumption. Ring v2 is the target for full SMP with sibling threads from one process running and waiting independently on different CPUs.

A runtime reactor can bridge the current process-wide ring for multithreaded runtimes before Ring v2: one runtime-owned drainer consumes the process CQ, matches completions by user_data, and wakes waiting threads through ParkSpace. That bridge is not the full-SMP kernel ABI.

Invariants

  • SQ and CQ sizes are powers of two and fixed by the ABI.
  • Unknown opcodes fail closed; FINISH is reserved, not silently accepted.
  • Reserved fields must be zero for currently implemented opcodes, except CAP_SQE_THREAD_OWNED CALL and PARK SQEs may carry the owning thread id in call_id.
  • Park PARK/UNPARK SQEs must keep unsupported fields zero and must not be dispatched from timer context.
  • cap_enter rejects min_complete > CQ_ENTRIES.
  • User-buffer validation and copy/read must hold the owning process AddressSpace mutex for CALL params/results, RECV result buffers, RETURN payloads, transfer descriptors, and deferred same-process completions.
  • Timer dispatch must not run capabilities that allocate, block on locks, or mutate page tables unless the cap explicitly opts in.
  • Per-dispatch SQE processing is bounded by SQ_ENTRIES.
  • Transfer descriptors must be aligned, valid, and bounded by MAX_TRANSFER_DESCRIPTORS.
  • Promise-pipelined dependency resolution must use sideband CapTransferResult ordinals, never general Cap’n Proto result traversal in the kernel.

Code Map

  • capos-config/src/ring.rs - shared ring ABI, opcodes, errors, SQE/CQE structs, endpoint message headers, transfer records.
  • kernel/src/cap/ring.rs - kernel dispatcher, SQE validation, CQE posting, cap calls, endpoint CALL/RECV/RETURN, release, transfer framing.
  • kernel/src/arch/x86_64/syscall.rs - cap_enter syscall.
  • kernel/src/sched.rs - timer polling, cap-enter blocking, direct IPC wake.
  • kernel/src/process.rs - ring page allocation and mapping.
  • capos-rt/src/ring.rs - runtime ring client, pending calls, transfer packing, result-cap parsing.
  • capos-rt/src/entry.rs - single-owner runtime ring client token and release queue flushing.
  • capos-config/tests/ring_loom.rs - bounded producer/consumer model.

Validation

  • cargo test-ring-loom validates SQ/CQ producer-consumer behavior, capacity, FIFO, CQ overflow/drop behavior, and corrupted SQ recovery.
  • make run exercises Console CALLs, reserved opcode rejection, ring corruption recovery, NOP, fairness, transfers, and endpoint IPC.
  • make run-measure exercises measurement-only counters, dispatch segment cycle summaries, the NullCap baseline, the ParkBench compact-versus-generic comparison, and the real ParkSpace blocked/resume timing path.
  • cargo test-config covers shared ring layout and helper invariants.
  • make capos-rt-check checks userspace runtime ring code under the bare-metal target.

Open Work

  • Implement CAP_OP_FINISH as part of the system Cap’n Proto transport.
  • Implement promise pipelining using the reserved pipeline_dep answer ID and pipeline_field result-cap ordinal mapping.
  • Define LINK, DRAIN, and MULTISHOT semantics before accepting those flags.
  • Add runtime-level ParkSpace wrappers and completion demultiplexing on top of the compact opcodes.
  • Add the runtime reactor bridge for multithreaded use of the current process ring, then replace it as the kernel fast path with per-thread Ring v2 completion ownership.
  • Add SQPOLL after SMP gives the kernel a spare execution context.

Error Handling

capOS uses three error layers for capability invocation. Keeping the layers separate prevents malformed transport state from looking like a service-domain decision, and prevents ordinary business outcomes from becoming generic kernel exceptions.

Current Model

LayerCarrierUse
Transport statusNegative CapCqe.result codesRing, opcode, lookup, buffer, transfer, and dispatch failures where no safe typed payload boundary exists.
Capability exceptionSerialized CapException plus CAP_ERR_APPLICATION_EXCEPTION or CAP_ERR_APPLICATION_EXCEPTION_TRUNCATEDCapability-level infrastructure failures after a target capability or accepted endpoint relationship exists.
Schema result unionInterface-specific result payloadExpected service or domain outcomes such as not-found, denied-by-policy, conflict, invalid domain input, or accepted/rejected business results.

Transport failures are intentionally small and mechanical. Examples include a bad SQE layout, an invalid params or result buffer, an unsupported opcode, a malformed transfer descriptor, or a capability lookup that fails before a live target object is identified.

Capability exceptions are for infrastructure failures at a valid capability boundary: target gone, target overloaded, method unimplemented, argument value rejected by the documented capability contract, or a target-side invariant failure. The exception message is diagnostic and must not carry kernel pointers, secret bytes, or unrelated process-private state.

Schema result unions are the normal application surface. A filesystem notFound, service-level permissionDenied, ordinary conflict, or accepted conditional rejection belongs in the interface result, not in CapException.

Current Transport Namespace

The ring transport uses signed 32-bit completion results. Non-negative values are opcode-specific successes. Negative values are defined in capos-config/src/ring.rs:

CodeNameMeaning
-1CAP_ERR_INVALID_REQUESTMalformed request metadata or a non-reserved opcode value.
-2CAP_ERR_INVALID_PARAMS_BUFFERParams buffer is unmapped, out of range, or unreadable.
-3CAP_ERR_INVALID_RESULT_BUFFERResult buffer is unmapped, out of range, or unwritable.
-4CAP_ERR_INVOKE_FAILEDLookup or dispatch failed before a successful typed result was produced.
-5CAP_ERR_UNSUPPORTED_OPCODEOpcode is reserved but not dispatched by this kernel.
-6CAP_ERR_TRANSFER_NOT_SUPPORTEDTransfer mode or descriptor layout is recognized but unsupported.
-7CAP_ERR_INVALID_TRANSFER_DESCRIPTORTransfer descriptor layout is malformed or carries reserved bits.
-8CAP_ERR_TRANSFER_ABORTEDTransfer transaction failed without committing partial capability state.
-9CAP_ERR_APPLICATION_EXCEPTIONA structured CapException was written to the result buffer.
-10CAP_ERR_APPLICATION_EXCEPTION_TRUNCATEDAn exception occurred, but no complete detail fit in the result buffer.

Capability Exceptions

schema/capos.capnp defines ExceptionType and CapException. The current exception kinds are Failed, Overloaded, Disconnected, Unimplemented, and the capOS-specific InvalidArgument.

The kernel serializes ordinary capability implementation errors through kernel/src/cap/ring.rs. capos-rt/src/client.rs decodes application-exception CQEs into ClientError::Application(ApplicationException). The runtime treats Disconnected as a broken local handle.

A path should produce CapException only when all of these are true:

  • a live target capability was identified, or an endpoint operation is acting on an already accepted call, receive, or return relationship;
  • the failure is attributable to capability semantics rather than malformed ring metadata;
  • the affected caller supplied a result buffer large enough to receive the serialized exception, otherwise the result is the truncated exception code.

Endpoint RETURN

Endpoint RETURN is asymmetric because the result belongs to the original caller, not the returning receiver. A server can set CAP_SQE_RETURN_APPLICATION_EXCEPTION on CAP_OP_RETURN to return a serialized CapException to the caller. The server’s own RETURN completion reports only whether the return transport succeeded.

Revoked endpoint RETURN also reports Disconnected to the original caller when that caller supplied a result buffer. Receiver-side lookup and CQ-space failures that cannot be tied to the caller’s result buffer remain transport failures.

Code Map

  • capos-config/src/ring.rs - transport error constants, SQE/CQE layout, and endpoint transport flags.
  • schema/capos.capnp - ExceptionType, CapException, and per-interface result unions.
  • kernel/src/cap/ring.rs - exception serialization, ring dispatch, endpoint RETURN exception handling, and InvalidArgument sentinel mapping.
  • kernel/src/cap/endpoint.rs - endpoint queue, in-flight call, and revoked endpoint state.
  • capos-rt/src/client.rs - runtime decoding into ClientError.
  • docs/architecture/capability-ring.md - ring ABI and opcode dispatch rules.
  • docs/architecture/ipc-endpoints.md - endpoint CALL/RECV/RETURN transport.

Validation

  • make run-spawn covers cross-process endpoint RETURN propagation for Failed, Overloaded, and Unimplemented, plus reserved opcode and no-result-buffer exception paths.
  • make run-smoke covers same-process endpoint use and revoked-cap behavior.
  • cargo test-lib covers cap-table stale-slot and transfer rollback behavior that the transport error paths depend on.
  • cargo test-ring-loom covers ring queue behavior that completion delivery depends on.

Open Work

  • Promise pipelining and future multishot/link/drain ring behavior must carry the same three-layer error split.
  • Long-lived services should prefer stable result-union variants over generic text errors for ordinary domain outcomes.
  • Future external clients need compatibility rules for exception taxonomy evolution once the ABI is treated as cross-version or separately released.

Design Grounding

The archival decision record is Error Handling. Relevant research notes are Cap’n Proto Error Handling and OS Error Handling.

IPC and Endpoints

Endpoints let one process serve capability calls to another process without adding a separate IPC syscall surface. The same ring transport carries ordinary kernel capability calls and cross-process endpoint calls.

Current Behavior

An Endpoint is a kernel capability object with queues for pending client calls, pending server receives, and in-flight calls awaiting RETURN. A service that owns the raw endpoint can receive and return. Importers receive a ClientEndpoint facet that can CALL but cannot RECV or RETURN.

sequenceDiagram
    participant Client
    participant ClientRing as Client ring
    participant Endpoint
    participant ServerRing as Server ring
    participant Server
    Server->>ServerRing: submit RECV on raw endpoint
    Client->>ClientRing: submit CALL on client facet
    ClientRing->>Endpoint: deliver params and caller result target
    Endpoint->>ServerRing: complete RECV with EndpointMessageHeader and params
    ServerRing-->>Server: cap_enter returns completion
    Server->>ServerRing: submit RETURN with call_id and result
    ServerRing->>Endpoint: take in-flight target
    Endpoint->>ClientRing: post caller CQE with result and receiver metadata
    ClientRing-->>Client: wait returns matching completion

If a CALL arrives before a RECV, the endpoint queues bounded params. If a RECV arrives before a CALL, the endpoint queues the receive request. Delivered calls move into the in-flight queue until the server returns or cleanup cancels them.

Design

Endpoint IPC is capability-oriented. The manifest can export a raw endpoint from one service; importers get a narrowed client facet. This keeps server-only authority out of clients without introducing rights bitmasks.

CALL and RETURN may carry sideband transfer descriptors. Copy transfers insert a new cap into the receiver while preserving the sender. Move transfers reserve the sender slot, insert the destination, then remove the source on commit. RETURN-side transfers append result-cap records after the normal result payload. Cross-session delivery is additionally checked against the cap hold transfer scope: same-session caps fail closed, cross-session-shareable caps may cross, and service-regrant-only caps need a trusted fixed-session regrant path. CALL SQEs may also request field-granular session disclosure. The kernel intersects that request with the invoked cap’s disclosure scope before delivering any subject fields, so a request without scope or scope without a request exposes only the default opaque caller-session metadata.

Legacy receiver metadata is stored on cap-table hold edges and delivered to servers with endpoint invocation metadata, so one endpoint can distinguish transitional callers without one object per caller. Some ABI structs still name this field badge; that name is compatibility state, not the normal shared-service authority model. Session-bound invocation context is the replacement model for normal workload paths: every normal process has one immutable session context, endpoint calls expose privacy-preserving caller-session metadata by default, and shared services derive user-facing state from broker-granted capabilities plus service-scoped session references. See Session Context.

Delegated Client Relabeling Containment

The Gate 0 containment rule is narrow: a process that holds an imported ClientEndpoint may delegate that same client identity, but it may not mint a sibling identity by setting another legacy badge during spawn. Endpoint owners and explicit trusted mint paths remain transitional mechanisms for low-level tests. Normal shared services use broker-granted roots/facets plus session-bound invocation context instead of service-object badges.

Normal capos-shell help and smoke expectations must therefore omit arbitrary badge N launch examples. Omitted shell badge syntax preserves the source identity instead of selecting badge zero. Legacy badge syntax may remain reachable only as a debug or hostile-test input, and QEMU coverage for the Telnet blocker must prove both explicit client @name badge N and low-level legacy badge-zero relabel encodings from a nonzero delegated client facet fail closed.

Shell-serviced stdio bridges now bind the active child wait to the first opaque live caller-session reference seen on the bridge endpoint. A later call from a different live caller session is answered with an empty result and the child is terminated; transferred caps are released before either normal transfer rejection or caller-session rejection returns. Normal StdIO.close is treated as a clean child close rather than a security rejection.

Future IPC should add notification objects for lightweight signaling and promise pipelining for Cap’n Proto-style dependent calls.

Invariants

  • Only raw endpoint holders may RECV or RETURN.
  • Imported endpoint caps are ClientEndpoint facets and must reject RECV and RETURN from userspace.
  • Delegating an imported client facet must preserve its server-visible object identity. Only endpoint owners or explicit trusted mint paths may create sibling client identities, and normal services should not treat that identity as user/session authority.
  • Endpoint queues are bounded by call count, receive count, in-flight count, per-call params, and total queued params.
  • Each in-flight call has a kernel-assigned non-zero call_id.
  • CALL delivery copies params into kernel-owned queued storage before the caller can resume.
  • Move transfer commit must not leave both source and destination live.
  • Transfer rollback must preserve source authority if destination insertion or result delivery fails.
  • Process exit must cancel queued state involving that pid and wake affected peers when possible.

Code Map

  • kernel/src/cap/endpoint.rs - endpoint queues, client facet, call IDs, cancellation by pid.
  • kernel/src/cap/ring.rs - endpoint CALL/RECV/RETURN dispatch, result copying, deferred cancellation CQEs.
  • kernel/src/cap/transfer.rs - transfer descriptor loading and transaction preparation.
  • capos-lib/src/cap_table.rs - cap-table transfer primitives and rollback.
  • kernel/src/cap/mod.rs - manifest export resolution and client-facet construction.
  • capos-config/src/ring.rs - EndpointMessageHeader, transfer descriptors, transfer result records, endpoint opcodes.
  • demos/capos-demo-support/src/lib.rs - endpoint, IPC, transfer, and hostile IPC smoke routines.
  • demos/endpoint-roundtrip, demos/ipc-server, demos/ipc-client - QEMU smoke binaries.
  • demos/ipc-zerocopy-producer, demos/ipc-zerocopy-consumer - QEMU smoke for the multi-message shared-buffer zero-copy IPC pattern.

Validation

  • make run-smoke validates same-process endpoint RECV/RETURN, cross-process IPC, endpoint exit cleanup, legacy badged calls, transfer success/failure paths, and clean halt.
  • make run-spawn validates init-spawned endpoint-roundtrip, server, and client processes.
  • make run-memoryobject-shared validates a one-shot shared-buffer handoff over an endpoint cap transfer.
  • make run-ipc-zerocopy validates the multi-message zero-copy IPC pattern at the substrate level: the producer transfers one MemoryObject to the consumer and then exchanges four record payloads through the shared mapping while endpoint CALLs carry only sequence numbers and checksums. The demo drives raw SQE/CQE construction through capos-demo-support rather than a typed runtime client and uses an ad-hoc seq+checksum framing because the typed SharedBuffer ABI, ring-shaped producer/consumer metadata, and notification primitives are still pending; production services (File.readBuf, BlockDevice.readBlocks, NIC RX/TX rings) will reuse the same MemoryObject substrate through that future surface, not the demo’s framing.
  • cargo test-lib covers cap-table transfer preflight, provisional insertion, commit, rollback, stale generation, and slot exhaustion cases.
  • cargo test-ring-loom covers ring queue behavior that endpoint IPC depends on for completion delivery.

Open Work

  • Add notification objects for signal-style events.
  • Add Cap’n Proto promise pipelining after endpoint routing can resolve dependent answers.
  • Add a typed SharedBuffer capability surface (ring-shaped producer/consumer metadata, completion signaling, lifetime/quota rules) on top of the raw MemoryObject substrate exercised by make run-ipc-zerocopy.
  • Add epoch-based revocation if broad authority invalidation becomes necessary.

Authority Graph and Resource Accounting for Transfer

This document defines the authority graph and resource-accounting contract originally tracked as Security Verification Track S.9 in docs/proposals/security-and-verification-proposal.md. It covers:

  • capability transfer (xfer_cap_count, copy/move, rollback)
  • ProcessSpawner prerequisites (spawn quotas and result-cap insertion)

Security Verification Track S.9 is complete when this design contract is concrete enough to guide implementation. The invariants and acceptance criteria below are implementation gates for capability transfer, ProcessSpawner, Security Verification Track S.8, and Security Verification Track S.12 follow-up work, not requirements for declaring the Security Verification Track S.9 design artifact complete. Current capability-semantics follow-up items live in docs/backlog/stage-6-capability-semantics.md.

Current Implementation and Target Contract

The current implementation defines ResourceLedger fields in capos-lib/src/cap_table.rs for capability slots, outstanding calls, scratch bytes, frame-grant pages, and virtual-reservation pages. Cap-slot and frame/virtual page reservations are wired into current reservation paths. Outstanding-call and scratch-byte counters are present ledger fields but are not yet fully wired into reservation/preflight paths. Endpoint queue quota, diagnostic log-rate accounting, and CPU token-bucket accounting below are target contract fields for future implementation work, not current ResourceLedger members.

1. Authority Graph Model

Authority is modeled as a directed multigraph:

  • Nodes:
    • Process(Pid)
    • Object(ObjectId) (kernel object identity, independent of per-process CapId)
  • Edges:
    • Hold(Pid -> ObjectId) with metadata:
      • cap_id (table-local handle)
      • interface_id
      • badge
      • transfer_mode (copy, move, non_transferable)
      • origin (kernel, spawn_grant, ipc_transfer, result_cap)

Security invariant A1: all authority is represented by Hold edges; no operation can create object authority outside this graph.

Security invariant A2: each process mutates only its own CapTable edges except through explicit transfer/spawn transactions validated by the kernel.

Security invariant A3: for every live Hold edge there is exactly one cap_id slot in one process table referencing the object generation.

2. Per-Process Resource Ledger and Quotas

Each process owns a kernel-maintained ResourceLedger. For wired reservation paths, enforcement is fail-closed at reservation time (before side effects). The target contract completes enforcement for present-but-unwired fields and extends the ledger with endpoint queue, diagnostic log, and CPU budget counters.

ResourceLedger {
  // Current ledger fields.
  cap_slots_used / cap_slots_max
  outstanding_calls_used / outstanding_calls_max
  scratch_bytes_used / scratch_bytes_max
  frame_grant_pages_used / frame_grant_pages_max
  virtual_reservation_pages_used / virtual_reservation_pages_max

  // Target/future fields.
  endpoint_queue_used / endpoint_queue_max
  log_bytes_window_used / log_bytes_per_window (token bucket)
  cpu_time_us_window_used / cpu_budget_us_per_window (token bucket)
}

Initial quota profile for Stage 6/5.2 bring-up (tunable by kernel config):

  • cap_slots_max: 256
  • outstanding_calls_max: 64
  • scratch_bytes_max: 256 KiB
  • frame_grant_pages_max: 4096 pages (16 MiB at 4 KiB pages)
  • virtual_reservation_pages_max: kernel-configured virtual reservation budget
  • Future target fields: endpoint_queue_max 128 messages, log_bytes_per_window 64 KiB/sec with 256 KiB burst, and cpu_budget_us_per_window 10,000 us per 100,000 us window.

Security invariant Q1: no counter may exceed its max.

Security invariant Q2: every resource reservation has a matched release on all success, error, timeout, process-exit, and rollback paths.

Security invariant Q3: quota checks for transfer/spawn happen before mutating sender or receiver capability state.

3. Diagnostic Rate Limiting and Aggregation

Repeated invalid ring/cap submissions are aggregated per process and error key.

  • Key: (pid, error_code, opcode, cap_id_bucket)
  • Buckets:
    • cap_id_bucket = exact cap id for stale/invalid cap failures
    • cap_id_bucket = 0 for structural ring errors
  • Per-key token bucket: allow first N=4 emissions/sec, then suppress.
  • Suppressed counts are flushed once per second as one summary line:
    • pid=X invalid submissions suppressed=Y last_err=...

Security invariant D1: invalid submission floods cannot consume unbounded serial bandwidth or scheduler time in log formatting.

Security invariant D2: suppression never hides first-observation diagnostics for a new (pid,error,opcode,cap bucket) key.

4. Transfer and Rollback Semantics

Transfers (xfer_cap_count > 0) use a kernel transfer transaction (TransferTxn) scoped to a single SQE dispatch. The current ring ABI does not provide kernel-owned SQE sequence numbers or a durable transaction table, so userspace replay of a copy-transfer SQE is repeatable: each replay is treated as a new copy grant. Move-transfer replay fails closed after the source slot is removed or reserved by the first successful dispatch.

Future exactly-once replay suppression requires transaction identity scoped to (sender_pid, call_id, sqe_seq) and a monotonic transfer epoch. Until that exists, exactly-once claims apply only within one dispatch attempt, not across malicious rewrites of shared SQ ring indexes.

Sensitive interfaces must choose their transfer mode deliberately:

Transfer modeSemanticsSuitable forRequired negative tests
copyRepeatable grant; sender keeps authority and replaying the same copy-transfer SQE can mint another receiver hold.Stateless or explicitly shareable caps where duplicate receivers are acceptable and audited.Replay mints only allowed duplicate holds; quota exhaustion fails closed; copy across forbidden session/transfer scope is rejected.
moveSingle authority handoff; sender loses the source hold after successful destination insertion. Replay fails closed after source reservation/removal.Linear resources, accepted sockets, terminal sessions, one-shot result caps, and authority that should have one active owner.Replay after success fails; rollback restores sender on partial failure; receiver cannot observe authority before commit.
non_transferableNo IPC/spawn transfer.Process-local control caps, raw spawn/network/device authority, private keys, and caps whose authority depends on caller-local state.IPC/spawn transfer attempts fail closed and leave sender/receiver tables unchanged.

Copy-transfer replay is therefore acceptable only for caps whose interface contract says repeated receivers are safe. Sensitive caps must be move-only or non-transferable until the interface has an explicit replay threat model and hostile tests.

Phases:

  1. Prepare:
    • validate SQE transport fields and xfer_cap_count
    • validate sender ownership/generation/transferability for each exported cap
    • reserve receiver quota (cap_slots, outstanding_calls, scratch if needed)
    • pin sender entries in txn state (no sender table mutation yet)
  2. Commit:
    • insert destination edges exactly once
    • for copy: increment object refcount/export ref
    • for move: remove sender slot only after destination insertion succeeds
    • publish completion/result
  3. Finalize:
    • release transient reservations
    • mark txn terminal (committed or aborted)

On any error before Commit, rollback is full:

  • receiver inserts are not visible
  • sender slots/refcounts unchanged
  • reservations released
  • CQE returns transfer failure (CAP_ERR_TRANSFER_ABORTED / subtype)

On error during Commit, kernel executes compensating rollback to preserve exactly-once visibility: either all inserts are visible with matching sender state transition, or none are visible.

Security invariant T1: each transfer descriptor is applied at most once within a single SQE dispatch attempt.

Security invariant T2: move transfer is atomic from observer perspective; no state exists where both sender and receiver lose authority due to partial apply.

Security invariant T3: copy-transfer SQE replay is explicitly repeatable until kernel-owned transaction identity exists. Move-transfer replay fails closed after source removal or source reservation.

Security invariant T4: CAP_OP_RELEASE removes one local hold edge only from the caller table and decrements remote export refs exactly once.

5. Integration with 3.6 Capability Transfer

3.6 implementation must consume this design directly:

  • CALL and RETURN validate all currently-reserved transfer fields fail-closed when unsupported.
  • xfer_cap_count path is wired through TransferTxn (no ad hoc direct inserts).
  • Badge propagation is explicit in transfer descriptors and copied into destination edge metadata.
  • CAP_OP_RELEASE uses the same authority ledger and refcount bookkeeping.

3.6 acceptance criteria:

  1. Copy transfer produces one new receiver edge and retains sender edge.
  2. Move transfer produces one new receiver edge and deletes sender edge atomically.
  3. Any transfer failure leaves sender and receiver CapTables unchanged.
  4. Copy replay is an explicit repeatable-grant policy until a kernel-owned transaction identity is added; move replay fails closed after source removal or reservation.
  5. CAP_OP_RELEASE on stale/non-owned cap fails closed without mutating other process tables.

6. Integration with 5.2 ProcessSpawner Prerequisites

5.2 must use the same accounting and transfer machinery:

  • spawn() preflights child quotas (cap_slots, outstanding_calls, scratch, frame_grant_pages, endpoint queue baseline) before mapping child memory or scheduling.
  • Parent-provided CapGrant entries are inserted via the same transfer transaction semantics (copy for initial grants in 5.2.2).
  • Returned ProcessHandle is inserted through the standard result-cap insertion path and accounted as a normal cap slot.
  • Child setup rollback must unwind:
    • address space mappings
    • ring page
    • CapSet page
    • kernel stack
    • allocated frames
    • provisional capability edges/reservations

5.2 acceptance criteria:

  1. Spawn failure at any step leaves no child-visible process and no leaked ledger usage.
  2. Successful spawn accounts all child bootstrap resources within quotas.
  3. Parent and child cap-table accounting remains balanced under repeated spawn/exit cycles.
  4. ProcessHandle.wait and exit cleanup release outstanding-call/scratch/frame usage deterministically.

7. Implementation Notes for Verification Tracks

This design unblocks:

  • Security Verification Track S.8 hostile-input tests for quota and invalid-transfer failures.
  • Security Verification Track S.12 Kani bounds refresh for ledger and transfer invariants.
  • Target 12 in docs/proposals/security-and-verification-proposal.md with explicit allocator hooks and fail-closed exhaustion behavior.

Userspace Runtime

The userspace runtime owns the repeated mechanics that every service needs: bootstrap validation, heap initialization, typed capability lookup, ring submission, completion matching, application exception decoding, and handle lifetime.

  • Go VirtualMemory Contract defines the caller-buffer reserve, commit, and decommit methods allocator paths need.
  • Programming Languages summarizes current native Rust support and planned language-runtime tracks.
  • Memory Management documents the implemented kernel VirtualMemory and MemoryObject behavior.
  • Go Runtime is the owning language runtime proposal; LLVM Target records the Go runtime OS hooks that drive this work.

Current Behavior

Runtime-owned _start receives (ring_addr, pid, capset_addr), initializes a fixed heap, validates the ring address, reads the read-only CapSet page, installs an emergency Console panic path when available, calls capos_rt_main(runtime), and exits with the returned code.

The Runtime lends out at most one RuntimeRingClient at a time. The client wraps the raw ring page, keeps request buffers alive until completions are matched, handles out-of-order completions, packs copy-transfer descriptors, and parses result-cap records. Owned runtime handles queue CAP_OP_RELEASE when the last local reference is dropped; the release queue flushes when a ring client is borrowed or dropped, or when code calls Runtime::flush_releases() explicitly. Promise placeholders are currently bookkeeping only; their future SQE coordinates map AnswerId.raw() to pipeline_dep and a result-cap record index to pipeline_field.

Design

The runtime separates non-owning bootstrap references from owned local handles. CapSet entries produce typed Capability<T> values only when the interface ID matches the requested type, and the same manifest-order CapSet entries remain available for diagnostic and shell surfaces that need to list or inspect what a process was actually granted. Result-cap adoption performs the same interface check before producing OwnedCapability<T>.

Typed clients are thin wrappers over the ring client. They encode Cap’n Proto params, submit CALL SQEs, wait for a matching CQE, decode transport errors, and decode kernel-produced CapException payloads into client errors. Endpoint servers can use submit_endpoint_return_exception() to return a serialized CapException to the original caller over the same endpoint RETURN path. The handwritten TimerClient exposes monotonic now reads and sleep calls over the same completion-matching path. The handwritten VirtualMemoryClient exposes map, reserve, commit, decommit, unmap, and protect calls for runtime heap/arena allocation over anonymous user pages. It has both the ordinary allocation-backed async methods and synchronous caller-buffer methods for allocator growth paths that cannot allocate while asking the kernel for more memory. This matches the reserve/commit/decommit surface specified in Go VirtualMemory Contract. The handwritten ThreadControlClient exposes current-process FS-base reads and updates for runtimes that need to swap a language-managed TLS base after process startup.

The 7.1.0 threading contract keeps one process ring and the runtime’s single-owner ring-client invariant for the first in-process threading implementation. Future multi-threaded runtimes must serialize blocking ring entry through capos-rt until a runtime reactor or Ring v2 lands. The reactor bridge uses one runtime-owned CQ drainer plus ParkSpace-backed wait records; the full-SMP kernel target is per-thread rings, where cap_enter waits on the current thread’s CQ. After 7.2, the existing ThreadControlClient methods apply to the current thread’s FS base rather than to a process-wide saved FS base. ThreadControl.exitThread and the raw exit(code) syscall both terminate the current thread; the process exits when its last live thread exits.

The 7.2.3 park slice adds a process-local ParkSpace marker type and compact CAP_OP_PARK / CAP_OP_UNPARK operations. capos-rt should expose those operations as runtime synchronization primitives in a later slice; the current thread-lifecycle proof uses raw SQEs so the runtime does not prematurely claim the park user_data namespace. Blocking park wait is not an ordinary RuntimeRingClient call: the wait SQE must be thread-owned for the current thread, and the runtime must reserve park user_data values, write the wait SQE under its ring-submission lock, release that lock before cap_enter, and demultiplex park CQEs into runtime-owned wait slots so a sibling thread can still submit the wake. The temporary single-thread park fallback remains only as the pre-thread runtime checkpoint proof.

Future generated clients should preserve this split: transport lifetime and completion matching belong in the runtime, while interface-specific encoding belongs in generated or handwritten client wrappers.

Invariants

  • ring_addr must equal RING_VADDR; runtime bootstrap rejects any other address.
  • The CapSet header magic/version must validate before lookup.
  • CapSet handles are non-owning unless explicitly adopted.
  • Only one runtime ring client may be live at a time for a process.
  • Until Ring v2, multithreaded generic client waits must flow through a runtime reactor/demux path rather than letting multiple threads consume the process CQ directly.
  • Park wait must not hold the live runtime ring client while the kernel parks the current thread.
  • Request params and result buffers must outlive their matching CQE.
  • A result cap can be consumed only once and only with the expected interface ID.
  • Promise placeholders must map to sideband result-cap record indexes, not schema field paths.
  • Dropping the final owned handle queues exactly one local CAP_OP_RELEASE; Runtime::flush_releases() forces queued releases and reports rejected kernel release results.
  • Release flushing treats stale or already-removed caps as non-fatal cleanup.

Code Map

  • capos-rt/src/entry.rs - _start, Runtime, bootstrap validation, single-owner ring token, release queue flushing.
  • capos-rt/src/alloc.rs - fixed userspace heap initialization.
  • capos-rt/src/capset.rs - typed CapSet lookup and manifest-order iteration wrappers.
  • capos-rt/src/ring.rs - ring client, pending calls, completion matching, copy-transfer packing, result-cap parsing.
  • capos-rt/src/client.rs - Console, TerminalSession, BootPackage, ProcessSpawner, ProcessHandle, VirtualMemory, Timer, ThreadControl, ThreadSpawner, and ThreadHandle clients, and exception decoding.
  • capos-rt/src/lib.rs - typed capability marker types and owned handle reference counting.
  • capos-rt/src/panic.rs - emergency Console output path.
  • capos-rt/src/syscall.rs - raw syscall instructions and public syscall wrappers, including the hostile smoke probe for the removed ambient write syscall.
  • targets/x86_64-unknown-capos.json - userspace target specification.
  • tools/check-userspace-runtime-surface.sh - source check that keeps runtime primitives owned by capos-rt.
  • init/src/main.rs, capos-rt/src/bin/smoke.rs, and shell/src/main.rs - current runtime users.

Validation

  • make capos-rt-check builds the runtime smoke binary against targets/x86_64-unknown-capos.json, matching the booted userspace target.
  • make init-capos-build, make demos-capos-build, make shell-capos-build, and make capos-rt-capos-build expose focused custom-target build wrappers for the current userspace crates and runtime smoke binary.
  • tools/check-userspace-runtime-surface.sh verifies init, demos, and shell do not define _start, panic handlers, global allocators, raw syscall instructions, or entry-point macros outside capos-rt.
  • make run-smoke validates runtime entry, typed Console calls, exception decoding, owned handle release, result-cap parsing through IPC, and clean process exit.
  • make run-spawn validates ProcessSpawnerClient, ProcessHandleClient, VirtualMemoryClient, TimerClient, ThreadControlClient, ThreadSpawnerClient, ThreadHandleClient, result-cap adoption, and release behavior under init spawning. The single-thread-runtime child proves the first runtime-shaped checkpoint over caller-buffer VirtualMemory calls and Timer; the thread-lifecycle child proves in-process create, self-join rejection, join, detach, last-thread exitThread, and private ParkSpace wait/wake correctness.
  • make run-shell validates CapSet iteration, capability inspection, typed application-error decoding, guest session metadata, exact-grant spawning, ProcessHandle waits, and stale-handle release behavior in the focused shell-launch proof manifest.
  • make run-terminal validates TerminalSessionClient writes, bounded line reads, hidden-echo input handling, and structured cancellation in the focused terminal proof manifest.
  • cd capos-rt && cargo test --lib --target x86_64-unknown-linux-gnu covers host-testable runtime invariants when run explicitly.

Open Work

  • Add generated client bindings after the schema surface stabilizes.
  • Implement promise/answer transport semantics beyond current placeholders.
  • Add typed ParkSpace clients with runtime-owned user_data demultiplexing.
  • Define release behavior for queued handles when a process exits before the release queue flushes.

Memory Management

Memory management gives the kernel controlled ownership of physical frames, separates user processes, enforces page permissions, and exposes memory authority only through explicit capabilities.

Current Behavior

The frame allocator builds a bitmap from the Limine memory map, marks all non-usable frames as used, reserves frame zero, and reserves its own bitmap frames. The heap is initialized separately for kernel allocation.

Paging initialization builds a new kernel PML4, remaps kernel sections with section-specific permissions, copies upper-half mappings with NX applied and user access stripped, switches CR3, then enables page-global support. SMEP/SMAP are enabled after those mappings are active.

Each user AddressSpace owns its lower-half page tables and clones the kernel’s upper-half mappings. Dropping an address space walks the user half and frees mapped frames, committed anonymous frames retained behind VM_PROT_NONE, and page-table frames. VirtualMemory lets a process reserve anonymous address ranges, commit and decommit physical backing, unmap reservations, and protect committed pages. Anonymous reservations charge the process virtual reservation ledger. Committed anonymous pages charge ResourceLedger::frame_grant_pages.

FrameAllocator allocation methods return a MemoryObject result capability, not a physical address. The normal result payload carries the result-cap index, and the CQE transfer-result record carries the local cap id plus MemoryObject interface id. MemoryObject.info exposes page count and size; MemoryObject.map maps page-aligned object ranges into the caller address space, MemoryObject.unmap removes those borrowed mappings, and MemoryObject.protect updates their page-table flags. Held MemoryObject caps charge the holder’s frame_grant_pages ledger, and final CAP_OP_RELEASE or process exit frees the owned frames once no borrowed address-space mapping still holds the backing alive.

Design

The kernel keeps physical allocation host-testable by placing bitmap logic in capos-lib and wrapping it with kernel HHDM access in kernel/src/mem/frame.rs. Page-table manipulation stays in the kernel because it is architecture-specific.

ELF loading and VirtualMemory both use page-table flags to preserve W^X: non-executable data gets NX, writable mappings are explicit, and userspace pages must be USER_ACCESSIBLE. The CapSet and ring bootstrap pages occupy reserved virtual pages; VirtualMemory rejects ranges that overlap either one.

User-buffer validation for process-owned buffers uses the process AddressSpace mutex. The kernel checks that user pointers stay below the user address limit, verifies page-table permissions for the requested read/write access, and copies through the HHDM mapping while holding the same address-space lock. This keeps validation and use tied to one stable page-table view. The legacy current-CR3 validator remains only for callers that already provide an equivalent page-table stability guarantee.

Committed VirtualMemory pages and held MemoryObject caps use the same per-process frame-grant ledger, with quota checks before frame allocation or mapping side effects. Anonymous reservation consumes a separate virtual page quota, so guard ranges and Go-style sysReserve arenas do not spend physical commit budget. Held MemoryObject caps charge for the backing they keep reachable, and each live borrowed MemoryObject mapping reserves frame-grant pages until it is unmapped. This prevents a process from mapping an object, releasing the cap to drop the cap-slot charge, and keeping the backing pinned without quota. The address space records borrowed pages separately from sparse anonymous reservations so teardown and unmap can distinguish anonymous pages from object-backed pages. Future file/network/DMA resources should reuse that authority ledger instead of adding one-off counters per cap.

Invariants

  • Frame addresses are 4 KiB aligned.
  • The frame bitmap’s own frames are never returned as free frames.
  • Upper-half kernel mappings are not user-accessible.
  • Kernel text is RX, rodata is read-only NX, and data/bss are RW NX.
  • User address spaces own only lower-half page-table frames.
  • Process frame-grant usage covers committed anonymous VM pages, held MemoryObject caps, and live borrowed MemoryObject mappings.
  • Process virtual-reservation usage covers reserved anonymous VM pages whether or not they are committed.
  • Committed VM_PROT_NONE pages retain their frames and data while exposing no present user PTE; reserved uncommitted pages consume no frame-grant quota.
  • Object-backed user mappings are tracked as borrowed pages and hold the MemoryObject backing alive until unmapped or address-space teardown.
  • MemoryObject unmap/protect only succeeds for borrowed pages backed by the same object.
  • VirtualMemory caps are bound to one address space and are not valid cross-process service exports.
  • CapSet is read-only/no-execute; ring is writable/no-execute.
  • VirtualMemory cannot reserve, map, commit, decommit, unmap, or protect the ring or CapSet pages.
  • VirtualMemory commit/decommit/protect/unmap only succeeds for ranges covered by anonymous reservations owned by the cap’s address space.
  • Capability-ring CALL/RECV/RETURN buffers, transfer descriptors, process and thread wait completions, and private ParkSpace word reads must validate and copy/read while holding the target process AddressSpace lock.

Code Map

  • capos-lib/src/frame_bitmap.rs - host-testable physical frame bitmap core.
  • capos-lib/src/cap_table.rs - capability holds and per-process ResourceLedger frame-grant accounting.
  • capos-lib/src/frame_ledger.rs - bounded frame-grant helper retained for host tests.
  • kernel/src/mem/frame.rs - Limine memory-map integration and global frame allocator wrapper.
  • kernel/src/mem/heap.rs - kernel heap setup.
  • kernel/src/mem/paging.rs - kernel remap, AddressSpace, page mapping, VM-cap page tracking, user copy helpers.
  • kernel/src/mem/validate.rs - user-address bounds and legacy current-CR3 validation helper.
  • kernel/src/cap/frame_alloc.rs - FrameAllocator capability and cleanup.
  • demos/memoryobject-shared-parent/ and demos/memoryobject-shared-child/ - QEMU shared MemoryObject smoke.
  • tools/qemu-memoryobject-shared-smoke.sh - transcript checks for the shared MemoryObject smoke.
  • kernel/src/cap/virtual_memory.rs - VirtualMemory capability.
  • kernel/src/spawn.rs - ELF, stack, and TLS user mappings.
  • kernel/src/arch/x86_64/smap.rs - SMEP/SMAP setup and legacy direct user access guard.

Validation

  • cargo test-lib covers frame bitmap, frame ledger, ELF parser, and cap-table pure logic.
  • cargo miri-lib runs host-testable capos-lib tests under Miri when installed.
  • make kani-lib proves the bounded mandatory frame-bitmap, stale-handle, cap-slot/frame-grant accounting, and transfer preflight fail-closed invariants when Kani is installed.
  • make run-smoke validates ELF mapping, process teardown, TLS, and clean shell-led halt.
  • make run-spawn validates MemoryObject-backed FrameAllocator cleanup, VirtualMemory reserve/commit/decommit/VM_PROT_NONE/quota/release smoke, and runtime spawn checks.
  • make run-memoryobject-shared validates a parent allocating and mapping a MemoryObject, transferring it to a child, observing a child write through the same backing pages, unmapping both sides, and halting cleanly.
  • make run-ipc-zerocopy validates the multi-message shared point-to-point buffer pattern at the substrate level: a producer transfers one MemoryObject to the consumer and then exchanges four record payloads through the shared mapping while endpoint CALLs carry only sequence numbers and checksums. This is a substrate proof, not the production data-plane shape: typed SharedBuffer with explicit producer/consumer ring metadata, notification primitives, and consuming service APIs (File.readBuf, BlockDevice.readBlocks, NIC RX/TX rings) are tracked under Open Work.
  • make run-spawn validates ELF load failure rollback and frame exhaustion handling through ProcessSpawner.

Open Work

  • Extend frame-grant accounting only if future DMA pinning or service-owned shared-buffer pools need authority beyond held MemoryObject caps and live borrowed mappings.
  • Define page-pinning or mapping-identity rules for future shared WaitSet, DMA, and service-owned shared-buffer paths that must keep physical backing stable beyond a single locked copy/read. The owning planning track is Memory Authority Model.
  • Add file, block, network, and DMA service APIs that use MemoryObject-backed SharedBuffer caps for zero-copy data paths.
  • Add DMA isolation and device memory capability boundaries before userspace drivers.
  • Add huge-page handling only with explicit ownership and teardown rules.

Scheduling

Scheduling decides which thread runs, preserves CPU state across preemption and blocking, and integrates capability-ring progress with process-owned execution resources.

Current Behavior

The scheduler stores shared process/thread metadata in Scheduler::processes: BTreeMap<Pid, Process>. Dispatch-owned runnable state lives in SchedulerDispatch: a per-CPU run_queues: [VecDeque<ThreadRef>; SCHEDULER_CPUS] array ordered ascending by Thread.virtual_finish_ns, per-CPU current and handoff_current slots, idle-thread slots, the direct-IPC target preference, run-queue reservation accounting, and deferred drop/stack release slots. Each live thread has at most one queued owner across all per-CPU queues combined, and every per-CPU queue reserves capacity up to the live runnable-capable thread count before a new thread is published as runnable, so later timer, unblock, requeue, and steal-requeue paths do not allocate. The shared live-reservation count is released when processes or threads exit or when pre-publication reservation is rolled back. Reserving each queue to the full live-thread count is required because the bounded steal path may migrate every live thread into a single sibling queue between two scheduler passes.

Phase D accepted its Task 6 diagnostic closeout at commit 77caafc0 (2026-05-10 19:39 UTC, docs(scheduler): record phase d thread-scale gate) and closed in docs commit 1a08ec23 (2026-05-10 21:47 UTC, docs(scheduler): close phase d). The accepted state is the WFQ scheduler described here: per-thread weights and latency classes are mutated only through SchedulingPolicyCap, each per-CPU runnable queue is ordered by freshly derived virtual_finish_ns, migration preserves virtual_runtime_ns, and bounded stealing selects the most-overdue runnable sibling candidate. The controlled Task 6 benchmark pair on capos-bench recorded capOS 1-to-4 work/total speedups 3.088x / 2.700x versus the previous single-global-queue baseline 1.566x / 1.538x; the matching Linux pthread baseline on the same host and physical-core logical CPUs 0,1,2,3 recorded 3.974x / 3.850x. The host harness enforced the configured 1-to-2 work/total gates; the 1-to-4 row was manually accepted from recorded diagnostics. Phase E SchedulingContext is the next scheduler authority phase; EEVDF is a follow-on ordering-policy evaluation rather than a Phase D blocker.

Phase D Task 3 (2026-05-07) restored the per-CPU runnable queues that the 2026-05-02 collapse retired and gave them the WFQ ordering Task 2’s virtual_finish_ns was prepared for. Newly created processes and threads publish onto the creating scheduler CPU’s per-CPU queue; the bounded steal path balances the queues when other CPUs run out of local work. The publish-time placement is intentionally simple in this slice — “place locally, let steal balance” — and a more sophisticated caller-aware spread or least-loaded scan is a milestone-gate follow-up, not a Task 3 acceptance requirement. Wake policy carries WakePolicy::QueueCpu(u32) for endpoint, timer, park, process-wait, thread-join, and process-spawn completions so the wake target matches the queue placement, and DirectTarget keeps its original direct-IPC handoff role. The transitional CAPOS_SCHED_DISABLE_WFQ=1 / WakePolicy::QueueAny fallback has been removed before Phase E SchedulingContext schema work.

wake_idle_scheduler_cpus_locked first probes the placement target when the policy is QueueCpu, then walks eligible idle scheduler CPUs and wakes the first that accepts a fresh reschedule IPI, skipping CPUs that already have a pending IPI so a burst of ready work cross-wakes more than one neighbor instead of stranding the rest behind one already-targeted CPU.

Ring SQ Consumer Ownership

Each ring endpoint has kernel-owned SQ-consumer metadata outside the writable userspace ring page. cap_enter and the bounded timer-side current-thread ring service both acquire a syscall-mode owner lease before calling process_ring(). The lease carries a nonzero generation and owner identity; process_ring() verifies that generation before flushing deferred ring work or advancing SQ head, and stale owners return StaleSqConsumer without consuming the head SQE. Duplicate owners fail closed as a retryable busy cap_enter status.

CQ publication remains independent of SQ ownership. Already accepted completions stay visible through CQ head/tail even after the SQ owner releases, and thread/process teardown releases any live SQ owner before ring unmapping or record drop without clearing accepted CQEs.

Bounded SQPOLL ring mode

Phase F adds a bounded SQPOLL mode for the caller thread’s ring through CpuIsolationLease with allowedMode = kernelSqpoll and namedRing = callerThread. The transition is explicit: syscall-owned dispatch may request SQPOLL start while it still owns the SQ, then releases its generation-checked owner; the poller finalizes into SqpollRunning, may publish NEED_WAKEUP and enter SqpollSleeping, wakes back to running when a producer publishes a new SQ tail, and stops or rolls back on lease revoke, cap release, teardown, or failed start. Timer-side syscall-mode ring service fails closed while SQPOLL owns the same endpoint, so no second SQ consumer can advance the SQ head.

The Phase F poller runs from the periodic scheduler service path and from a bounded current-thread syscall service entry used for SQPOLL producer wakes and explicit syscall kicks. Both entries borrow the SQPOLL owner lease rather than acquiring syscall SQ ownership. The current default admits two SQEs per selected SQPOLL worker, and a worker is not reselected again in the same periodic service pass or syscall service entry. Poller elapsed time is charged to the admitted scheduler ledger or scheduling-context target. The wake/sleep protocol uses a shared ring flag: the poller publishes NEED_WAKEUP, performs a full ordering barrier, and rechecks SQ tail before sleeping; producers publish initialized SQEs, store SQ tail with a barrier, and enter the kernel if NEED_WAKEUP is visible. A cap_enter producer wake that finds SQPOLL already owns SQ head can run one bounded SQPOLL batch, return visible CQ availability when the requested threshold is satisfied, preserve ordinary blocked-current-thread and thread-owned-head results, and otherwise fail closed as a retryable busy result. Stale owner generations fail before deferred ring work or SQE start. If teardown requests stop after a live owner has already accepted a SQE, the poller still publishes SQ head for that accepted SQE before releasing ownership, preserving accepted CQEs without leaving work replayable by syscall mode. The focused make run-scheduler-generic-sqpoll-nohz proof admits this explicit ring-coupled shape into SQPOLL nohz, drives producer wake and bounded service progress without depending on a periodic tick, then rolls back on stale owner/lease revoke. Policy-service automatic nohz, broader userspace-poller/device-queue admission, and production realtime admission remain future work.

Per-CPU run queue ordering structure

Each per-CPU VecDeque<ThreadRef> is kept ordered ascending by Thread.virtual_finish_ns. Enqueue performs an ordered insert via a linear scan from the front; selection scans the queue by index for the first destination-Runnable entry (via pop_first_runnable_local_locked), removes Drop entries it walks past, and leaves RetryLater entries undisturbed for the next scheduler pass. Because the queue is ordered ascending, the first Runnable hit is also the lowest-virtual_finish_ns candidate the destination CPU can accept (the most overdue against fair share that this CPU is allowed to run). Linear-scan insert is O(n) per enqueue; with SCHEDULER_CPUS = 4 and bounded thread counts in this slice the constant is small enough to defer a smarter structure (sorted bucket arrays, intrusive trees) until benchmark evidence shows it dominates scheduler-lock hold time. Promoting to a smarter structure is a follow-up under this plan if the Task 6 milestone gate proves the need.

virtual_finish_ns is recomputed on every enqueue from the thread’s current virtual_runtime_ns, weight, and latency_class; it is never carried as committed state across blocking, and migrations between per-CPU queues recompute it at the destination so the destination’s view of fair-share progress applies. The derivation rule per latency class is documented in capos-abi/src/scheduler.rs and the “Latency-class semantics for Phase D” section of docs/proposals/scheduler-evolution-proposal.md.

Bounded steal path

When a CPU’s local queue has no immediately runnable entry the scheduler walks sibling per-CPU queues. For each sibling queue the scan walks indices ascending and selects that queue’s first entry that the destination CPU considers Runnable; because each queue is ordered ascending by virtual_finish_ns, the first Runnable hit is also the lowest virtual_finish_ns candidate available to the destination on that source queue. The steal then picks the source queue whose first-Runnable candidate has the lowest virtual_finish_ns overall, with ties broken by lower CPU id. The chosen entry is removed from its current position in the source queue (not necessarily the head: a RetryLater or single-CPU-owner thread may sit at the source’s front and stay there), the WFQ tag is recomputed at the destination, and the entry is inserted at the destination’s ordered position. The destination queue is reserved to the full live-thread count, so the steal-requeue is allocation-free. The scan walks at most SCHEDULER_CPUS * max_queue_len entries, but in practice each sibling scan stops at the first Runnable candidate per queue.

RetryLater semantics in the local scan

The local pop scan walks the per-CPU queue by index instead of popping the front and re-pushing RetryLater candidates. Re-pushing a RetryLater entry whose virtual_finish_ns has not changed would ordered-insert it back at the same head position, so a naive pop-then-requeue loop would re-pop the same RetryLater head every iteration and starve runnable entries behind it. The index scan removes Drop entries in place, leaves RetryLater entries undisturbed for the next scheduler pass to re-evaluate, and returns the first Runnable candidate it finds. The bounded steal path uses the same index scan on the destination queue after a steal so a stolen RetryLater entry does not get re-popped in the same dispatch pass.

Phase E preflight fallback cleanup

The one-bisect-cycle CAPOS_SCHED_DISABLE_WFQ=1 opt-out has been removed. Enqueues always target the selected per-CPU WFQ queue, and wake-up sites always carry WakePolicy::QueueCpu(slot) for queued work. Phase E SchedulingContext work therefore starts from the accepted Phase D WFQ behavior rather than from a source-level single-global-queue fallback.

Phase E Task 1: scheduling-context object shape

The first SchedulingContext slice is info-only: schema, config, runtime, and kernel code expose SchedulingContext.info() and a bootstrap grant shape, but no dispatcher enforcement, replenishment, donation/return, depletion notification, realtime island, SQPOLL, or nohz behavior. SchedulingContextSpec.cpuMask uses the canonical little-endian bitset defined in schema/capos.capnp: CPU n maps to bit n % 8 of byte n / 8, with bit 0 as the least-significant bit of that byte. Empty data means no CPUs are selected rather than all CPUs. Producers omit trailing zero bytes, so the all-zero set’s canonical form is empty and any non-empty canonical mask ends with a nonzero byte.

Phase E Task 2: bind, revoke, and generation identity

The second SchedulingContext slice adds the first bounded authority lifecycle. SchedulingContext.create() creates a same-interface result cap for a validated spec, bindCallerThread() records one caller-thread binding for the current context generation, and revoke() advances the generation and clears the matching thread metadata binding. Bootstrap-granted contexts and contexts returned by create() use the same non-wrapping context-id allocator; the binding identity remains (contextId, generation), but distinct cap objects no longer share bootstrap ids. Stale caps report staleGeneration and cannot create, bind, or revoke scheduler metadata for a new generation; already-revoked contexts report revoked. Release cleanup clears only a thread metadata binding that matches the released cap identity.

Phase E: SchedulingContext budget enforcement

make run-scheduling-context is the focused Phase E QEMU proof. It starts one process with two independently granted bootstrap contexts, verifies their identities cannot alias, adopts a created result cap, drives bind/revoke and stale-generation calls, confirms release cleanup by rebinding after the released cap drops, and now checks the first dispatcher budget behavior. bindCallerThread() installs a fixed budget ledger in the caller thread’s scheduler metadata. Runtime charge decrements that ledger at the same scheduler-lock-contained points that update per-thread runtime/vruntime. Runnable selection replenishes elapsed periods and treats exhausted bound contexts as RetryLater until their next period, leaving the queued owner in place rather than allocating or moving emergency-path state. Stale or revoked contexts still fail closed before mutating scheduler metadata or accounting.

The current enforcement granularity is the existing periodic scheduler tick: a running thread may overshoot its budget by the current tick quantum before the next dispatch charge throttles it. The smoke therefore proves bounded dispatcher behavior, not nohz/SQPOLL activation or hard realtime admission. It prints dispatch_effect=budgetEnforced, visible budget charge, replenishment to full budget after a period, and a throttled wall-clock window.

Phase F: CpuIsolationLease and automatic nohz activation

CpuIsolationLease is a separate authority surface from SchedulingContext CPU-time budget enforcement. The scaffold records owner identity, allowed CPU set, allowed isolation mode, live accounting target reference, housekeeping exclusions, maximum revocation latency, and generation identity. It rejects stale generations, duplicate or overlapping active leases, fabricated or stale SchedulingContext accounting targets, malformed CPU masks, and lease sets that would leave no online scheduler housekeeping CPU outside the globally admitted active lease CPUs.

The scheduler-side preflight reports a bounded nohz activation/deactivation decision surface: lease identity, target CPU mask, target runnable entity count, active housekeeping CPU availability after subtracting all active lease CPUs, selected housekeeping CPU mask, deferred cleanup, timer/deadline, network polling, IRQ-affinity, accounting-target, monotonic clocksource/accounting readiness, one-SQ-consumer, revocation latency, rollback, and periodic-fallback labels. The accepted QEMU proof uses -smp 4 so an active lease can report ready housekeeping CPUs outside the target CPU, selected housekeeping placement, and exactly one runnable caller on that target CPU.

The clockevent/deadline substrate uses a calibrated TSC-backed monotonic clocksource on normal QEMU/x86_64, with the periodic LAPIC tick disciplining the TSC epoch so QEMU guest halt windows cannot stall wall-clock progress. Timer.sleep, finite cap_enter, and park timeouts store absolute monotonic deadline_ns values, and the LAPIC clockevent backend can program a bounded one-shot deadline and restore periodic mode.

Automatic nohz activation state machine

When the preflight finds every proof obligation satisfied – a single runnable entity on the target CPU, a ready housekeeping CPU outside the lease, no local deferred-cleanup/timer dependency, a valid accounting target, a live monotonic clocksource, a non-stale one-SQ-consumer when a ring is named, a bounded revocation latency, and the lease’s allowedCpuMask naming exactly one scheduler-owned CPU – it performs real per-CPU periodic-tick suppression for that narrow single-runnable window. The target CPU may be the CPU running the preflight call (local activation) or a different scheduler CPU (remote-CPU activation via a reschedule IPI – see Remote-CPU activation below). The single-runnable shape differs by target: a local activation requires the caller itself to be that single entity (exactly-one-runnable-caller); a remote activation requires the target CPU’s single runnable entity to be some thread pinned there, not the caller (which runs on a different CPU – exactly-one-runnable-remote-target).

  • Admission gates. Two lease shapes can be admitted for tick suppression: a pure namedRing = none compute lease, and a ring-coupled allowedMode = kernelSqpoll lease whose bound ring is being actively driven by a live SQPOLL consumer.
    • Compute lease (namedRing = none). Declares no local network/IRQ dependency, so the read-only network-polling and IRQ-affinity admission gates pass.
    • Ring-coupled SQPOLL lease (allowedMode = kernelSqpoll, namedRing = callerThread). The lease’s declared kernel-polled work IS the bounded SQPOLL ring poller, which the scheduler keeps progressing through cap_enter/producer-wake even while the periodic tick is masked. The preflight admits it only when the bound ring is in SQPOLL running/sleeping mode with a non-stale Sqpoll owner; the one-SQ-consumer label is then blocked-sqpoll-owner (the worker owns the ring). The preflight ring-state read is a best-effort hint – it never takes the per-ring lock inside the scheduler lock (it uses try_lock, and a contended snapshot does not admit activation). The decisive disqualifier is the IPI/timer re-check below.
    • A namedRing = callerThread lease that is not kernelSqpoll (compute-with-ring) keeps the conservative refusal until network polling and IRQ affinity are routed to a housekeeping CPU, as does any device-owning mode. The kernel still services virtio RX/TX and Interrupt waiters inline from the periodic scheduler path.
  • Activate. The preflight masks the periodic LAPIC timer on the current CPU and arms a one-shot deadline at min(nearest pending timer wakeup, now + max revocation latency). The CPU now runs on a bounded one-shot deadline instead of the periodic tick. The eligible lease generation is registered so revoke/cleanup paths can stale it.
  • Re-check. On every timer interrupt and on every reschedule IPI the handler re-checks the activation window before the scheduler picks the next thread. The reschedule-IPI handler also drains any pending remote-CPU activation request parked for this CPU (the IPI vector is shared with the remote-activation path – see Remote-CPU activation below), and the periodic timer handler drains it too as a backstop. An unchanged eligible window re-arms the bounded one-shot deadline; a reschedule IPI (the prompt signal that another CPU woke runnable work onto this CPU) drives an immediate rollback. The re-check runs in interrupt context and uses try_lock to avoid deadlocking against a held scheduler lock. Armed-timer invariant: the masked-periodic one-shot does not auto-rearm, so a timer-interrupt re-check NEVER returns leaving a tickless CPU without an armed timer – on scheduler-lock contention it arms a bounded minimum-delta fallback one-shot (or restores the periodic tick) before returning. A lock-free per-CPU nohz-active bitmask lets the contention path distinguish a tickless CPU (the consumed timer was the nohz one-shot and must be replaced) from a normal CPU (the periodic tick auto-rearms). A reschedule IPI does not consume the one-shot, so its contention skip is safe – the still-armed one-shot bounds the next re-check.
  • Rollback. Any disqualifying change rolls the CPU back to the periodic LAPIC tick first, before any further ordinary work: a stale lease generation (explicit revoke, process exit, service replacement, session logout), a second runnable entity or stealable sibling work on the target CPU, a local deferred-cleanup dependency, a direct-IPC target becoming runnable, a target-CPU mismatch, or a one-shot backend that can no longer arm a deadline. For a ring-coupled SQPOLL activation the re-check also carries a sqpoll-ring-mode-changed-or-owner-staled disqualifier (the bound ring leaving SQPOLL running/sleeping mode or its owner staling); that re-check runs under the scheduler lock and uses try_lock on the per-ring lock, so a contended ring is treated as disqualifying (fail-closed – restore the periodic tick rather than keep a CPU tickless on an unverifiable ring). That SQPOLL ring-mode branch is defense-in-depth, currently subsumed by lease-generation staling: every reachable SQPOLL-stop path today (stop_sqpoll_for_lease / stop_sqpoll_if_owned) is a revoke/cleanup-path caller that also stales the lease, and stale-lease-generation is checked first – so the lease-generation stale is the load-bearing SQPOLL rollback trigger in practice. The SQPOLL ring-mode branch becomes independently load-bearing, and would then need its own proof, only if a future change introduces a SQPOLL-stop path that keeps the lease live. Runtime accounting stays boundary/counter driven and monotonic, so suppressing the tick never strands SchedulingContext budget charging.
Remote-CPU activation

Masking the periodic LAPIC tick and arming the one-shot deadline are per-CPU operations – only the target CPU can program its own LAPIC timer. When the preflight runs on CPU A but the lease’s single-CPU allowedCpuMask targets a different CPU B, the kernel does not refuse: it parks a bounded remote-activation request in CPU B’s per-CPU slot and sends a reschedule-style IPI to CPU B. CPU B drains the request from its IPI handler (and from its periodic timer handler as a backstop), re-runs the full disqualification check locally under its own scheduler-lock acquisition, and only then arms its own one-shot deadline. A remote activation is never trusted blind – the preflight’s eligibility snapshot was taken on a different CPU and may be stale by the time the IPI is drained, so the target CPU re-checks before committing. The relevant invariants:

  • Bounded request slot, no nesting. The pending-request store is a fixed [Option<_>; SCHEDULER_CPUS] array – one single-entry slot per CPU, so it can never grow unbounded. If a slot already holds an undrained request, a new preflight fails closed (rejected) rather than queuing behind it. The IPI-context drain never nests the scheduler lock: it takes only the small per-CPU slot mutex, then calls the activation in try_lock mode.
  • Contention retry. If the IPI-context drain finds the scheduler lock contended, it leaves the request parked and returns; the target CPU’s next periodic timer tick (still live – the tick has not been suppressed) retries the drain. Progress is bounded by the periodic tick the same way the existing local re-check contention path is.
  • Fail-closed IPI ordering. A remote rollback (rollback_nohz_for_lease) stales the lease generation before clearing the activation record. The drain re-checks the generation before arming, so a rollback that races the drain fails closed (the request is dropped, the periodic tick stays live). If the drain already committed before the rollback cleared the record, the target CPU’s next nohz_recheck sees the nohz-active bit set with no record and restores its periodic tick. Either ordering converges on the periodic tick.
  • Compute-only. Remote-CPU activation is limited to namedRing = none compute leases in this slice. A ring-coupled SQPOLL lease whose target differs from its ring owner’s CPU is not an admitted shape; it fails closed.

Generic full-nohz admission for ordinary budgeted compute threads is available only through an explicit SchedulingContext-targeted compute lease and the same fail-closed placement gates described above. The SQPOLL nohz state machine now admits explicitly leased caller-thread rings when the SQPOLL worker is live, single-consumer, and bounded by producer wake/deadline rollback. Broader userspace-poller/device-queue admission, automatic CPU-isolation issuance, and production realtime island admission remain future work; auto_nohz stays disabled. Timeout-based auto-revoke landed 2026-05-30 15:22 UTC: a CpuIsolationLease created with leaseLifetimeNs > 0 records an absolute expiry deadline, auto-revokes through the existing generation-advancing cleanup on first observation past it (reason=lease-expired), and the nohz activation record carries the lifetime deadline so a tickless CPU rolls back at the next timer/IPI recheck (lease-lifetime-expired disqualifier), bounded by maxRevocationLatencyNs. A leaseLifetimeNs of 0 preserves the prior revoke/cleanup-only lifecycle. The current SQPOLL-driven activation is the bounded case: tick suppression for a ring-coupled kernelSqpoll lease on the CPU running the preflight, rolled back through lease-generation staling on revoke/cleanup, with the SQPOLL ring-state re-check as defense-in-depth for any future SQPOLL-stop path that does not stale the lease.

Lease revocation and cleanup are generation-aware. Explicit revoke, process exit, service replacement through process termination, and session logout stale the matching generation so old caps cannot keep isolation eligibility alive, and rolling the matching lease’s active nohz window back to the periodic tick is part of the same cleanup path. make run-scheduler-cpu-isolation-lease is the broad QEMU proof for grant, info, revoke, cleanup, real nohz activation and fail-closed rollback, bounded SQPOLL start/sleep/stop, rollback labels, generic full-nohz, and SQPOLL nohz. make run-scheduler-generic-sqpoll-nohz is the focused SQPOLL proof for eligible ring admission, producer wake, SQPOLL service, rollback, and stale owner rejection.

Phase E: endpoint donation and return

Synchronous endpoint delivery now carries a bounded internal donation token when a caller thread with a bound active SchedulingContext delivers a CALL to a receiver thread that has no scheduling context of its own. Donation is strictly passive-server shaped: receivers that already have a scheduling context keep their own authority, unbound callers donate nothing, and callers that receive a donation token are blocked from returning to userspace until the in-flight endpoint call returns or is canceled.

At delivery, the scheduler charges pre-donation caller runtime before moving the context ledger to the receiver. While the receiver handles the endpoint message, normal dispatcher runtime charging decrements the donated context. When endpoint RETURN commits the caller completion, the scheduler first charges receiver runtime since dispatch, then returns the remaining budget and next-replenishment state to the caller’s thread metadata and rebinds the SchedulingContext record to the caller. Return preflight failures leave the in-flight donation in place, while application-exception RETURN, invalid-result RETURN errors, delivery failure, return cancellation, endpoint teardown, process/thread exit, and stale-caller cleanup return or clear the donation before waking the caller and without allocating new emergency-path storage. Nested donation of an already donated context is rejected; supporting stacked donation is deferred until it has an explicit return-token stack design.

make run-scheduling-context proves the behavior with a same-process endpoint round trip. The caller binds a fresh context, burns CPU immediately before CALL, the passive server burns CPU while servicing the endpoint CALL and again immediately before RETURN, and after RETURN the caller observes the reduced budget restored. The same smoke covers application-exception RETURN, oversized-result RETURN under donation, and deterministic rejection of A-to-B-to-C nested donation. It also submits a delivered donated CALL and then uses cap_enter(0, 0) while the server delays RETURN, proving the donor cannot continue outside the donated ledger. A fast-return variant covers the race where the receiver returns before the caller commits to the donation-blocked scheduler state. The smoke prints endpoint_donation=ok, endpoint_return=ok, endpoint_exception_return=ok, endpoint_invalid_return=ok, endpoint_nested_rejected=ok, endpoint_donor_block=ok, endpoint_donor_fast=ok, endpoint_donation_server, endpoint_donation_after, endpoint_exception_return_after, endpoint_invalid_return_after, endpoint_nested_after, endpoint_donor_block_elapsed_ns, endpoint_donor_block_after, endpoint_donor_fast_elapsed_ns, and endpoint_donor_fast_after.

Phase E: SchedulingContext notifications

Every SchedulingContext now owns fixed notification storage allocated at context creation or bootstrap. The storage has two coalescing slots: budgetDepleted and deadlineOrTimeout. Each slot records context id/generation, a saturating sequence, a saturating coalesced-event count, the last holder thread, remaining budget, the next replenishment/deadline timestamp, and whether the holder was using an endpoint-donated context. Runtime charge records depletion when remaining budget transitions to zero and records deadline/timeout expiry against the same context generation. Failed bind attempts do not arm a new budget/deadline window.

SchedulingContext.drainNotifications() returns typed observer results: ok drains the matching fixed cells, revoked reports the current revoked generation, and staleGeneration reports an old observer generation without draining the current record. Explicit revoke() records an explicitRevoke lifecycle event. These notifications explain already-enforced scheduler state; they do not donate budget, reorder runnable entities, bypass throttling, publish result caps, append unbounded queues, allocate on scheduler hard paths, or imply auto-nohz/SQPOLL/tickless behavior. A pre-armed observer waiter/wakeup path remains a future extension.

make run-scheduling-context proves the notification slice by repeatedly draining a depleted context after coalescing, observing deadline expiry, recording explicit revoke and stale-observer labels, and confirming that endpoint-donated runtime records notification state on the donated context. The smoke prints notification_coalescing=ok, deadline_notification=ok, revoke_notification=explicitRevoke, stale_notification=staleGeneration, and endpoint_donated_notification=ok.

Phase E: session logout lifecycle hook

UserSession.logout() now notifies the scheduler after the session liveness cell transitions from live to logged out. That covers explicit UserSession.logout() calls, including the remote DTO gateway logout command and connection-teardown path because those paths already call the same kernel UserSession.logout() method. The hook scans scheduler-owned process/thread metadata for live processes whose immutable SessionContext shares the logged out liveness cell, removes each non-donated matching thread binding from the scheduler ledger, and asks the bound SchedulingContext record to advance its generation and mark itself revoked. Old ordinary SchedulingContext grants therefore report stale generation through info() with zero visible remaining budget and InfoOnlyNoDispatchChange. The focused session-context smoke also proves stale bindCallerThread() does not rebind, stale create() does not publish a result cap, stale revoke() does not mutate the current metadata generation, and stale notification draining reports a stale observer result.

The hook intentionally does not use session code as a second scheduling-context ledger: session lifecycle code only flips liveness and notifies the scheduler, and the scheduler owns the scan and binding removal. The scan takes one binding at a time under the scheduler lock, drops that lock, then calls the SchedulingContextExitCleanup record hook so it does not invert the existing SchedulingContext record-lock to scheduler-lock order used by bindCallerThread().

In-flight endpoint donation uses a conservative counted/skipped logout policy. If the logged-out session owns a receiver thread that currently holds a donated context, the logout hook records that the donated binding was skipped rather than returning donor budget while the endpoint call remains in flight. The focused session-context smoke proves the donor remains blocked in cap_enter(0, 0) until the receiver returns, the hook reports donation_inflight_skipped=1, and endpoint RETURN removes the receiver binding while restoring only the reduced remaining budget to the donor. This does not add a new logout-triggered cancellation semantic. Local owner-shell exit now calls the held UserSession.logout() before clean shell process exit, so the same scheduler hook observes shell logout with stale_marked=0 donation_inflight_skipped=0 in the shell smoke. The ordinary bound-context stale proof remains the focused session-context smoke, because the normal shell does not hold a bound SchedulingContext. Process and thread exit cleanup already have their own stale-context coverage and are unchanged.

Realtime islands, SQPOLL, auto-nohz, and CPU placement enforcement remain future Phase F/G work.

Phase D Task 4: migration fairness invariants

Phase D Task 4 (2026-05-08) made three migration-fairness invariants explicit:

  • virtual_runtime_ns travels with the thread. It lives on Thread.cpu_accounting, not on a per-CPU slot, so a migration from CPU A to CPU B preserves the thread’s accumulated weighted-fair share. The accounting field was promoted out of cfg(measure) in Task 2 and continues to advance through charge_runtime regardless of which CPU charges the quantum.
  • virtual_finish_ns is derived per enqueue, never committed. Every enqueue site – the initial publish in enqueue_ready_thread_on_slot_locked, the post-block requeue in enqueue_unblocked_thread_on_slot_locked, and the steal-insert in steal_from_sibling_queues_locked – routes through refresh_virtual_finish_ns_locked, which reads thread.weight, thread.latency_class, and thread.cpu_accounting.virtual_runtime_ns fresh and recomputes the WFQ ordering tag. The field is never carried as committed state across blocking and is never carried with the thread on migration; the destination CPU’s view of weight, latency class, and quantum decides the new tag.
  • Steal recomputes at the destination. The pop-from-source step in steal_from_sibling_queues_locked is followed by refresh_virtual_finish_ns_locked against the destination slot before the ordered insert, so a SchedulingPolicyCap.setWeight that landed between source enqueue and steal takes effect at the steal itself.

Migrations counter shape

ThreadCpuAccounting.migrations is cfg(feature = "measure")-gated and remains a benchmark-only operator-observability counter; it is not load-bearing for ordering and is not exposed through SchedulingPolicyCap.snapshot. Phase D Task 4 moved the increment from the dispatch-time scheduled_measure path to two enqueue-time arms in kernel/src/sched.rs:

  • Placement-time spread (record_placement_spread_migration_locked) fires from push_reserved_run_queue_locked when the enqueue target slot differs from the thread’s previously dispatched CPU (ThreadCpuAccounting.last_cpu). A thread that has never been dispatched (last_cpu == None) does not register a migration on first publish; otherwise placement spread is counted exactly once per enqueue.
  • Steal (record_steal_migration_locked) fires from steal_from_sibling_queues_locked after the source-queue removal and before the destination-queue insert. The steal scan skips the destination slot, so the counter increments unconditionally each time the steal arm is reached.

scheduled_measure still maintains last_cpu so the placement-spread check has the previous CPU available; only the migrations++ moved. The pre-collapse counter shape is preserved in steady state – a thread that runs on a different CPU than its previous run still records exactly one migration – but the increment is now attributed to the enqueue decision (placement spread or steal) rather than the dispatch that follows it.

The aggregate process-wide thread_placement counter family in kernel/src/measure.rs (migrations, migration_to_cpu0..3, consumed by tools/qemu-thread-scale-harness.sh) is a separate measurement device. It is incremented from account_thread_selected_locked at dispatch time and continues to observe “thread ran on a different CPU than its previously dispatched CPU” rather than the per-thread Task 4 enqueue-time shape, so the thread-scale harness regex does not need to change. The per-thread ThreadCpuAccounting.migrations field and the aggregate thread_placement counter intentionally measure different events at different points in the scheduling pipeline; both stay behind cfg(feature = "measure").

Phase H: per-thread saturation status surface

The Phase H AutoNoHz placement heuristic (a future policy-service feature) needs to read per-thread saturation observation in the normal dispatch build, not only under cfg(feature = "measure"). The non-measure per-thread saturation status surface (2026-05-30) promoted the inputs it consumes into ordinary ThreadCpuAccounting state and exports them through SchedulingPolicyCap.snapshot @2:

  • voluntary_blocks and preemptions moved out of cfg(feature = "measure"). They are charged at the same sites as before – voluntary_blocks when a thread blocks itself (cap_enter wait, park, endpoint scheduling-context donation) and preemptions when the timer requeues a still-runnable running thread – so the measure build’s counts are unchanged; only the cfg gate was removed. A low voluntary_blocks count distinguishes a CPU-saturating thread from an IPC/IO-bound one.
  • runnable_accumulated_ns is a new always-built cumulative counter of runnable-but-not-running time. It is charged at the scheduler-lock-held enqueue/select boundary: push_reserved_run_queue_locked stamps a monotonic runnable_since_ns when a thread is published to a per-CPU run queue without being selected (idempotent across re-publish, so the whole runnable span is counted once), and account_thread_scheduled accumulates the monotonic delta and clears the stamp when the thread is next selected. The stamp/accumulate pair nets to zero for a thread selected at the same monotonic instant it becomes runnable. The clock is monotonic_ns() only (no wall-clock, no rewind), matching charge_runtime’s discipline, and the stamp respects the runnable-ownership rules above (a thread holds a live stamp only between enqueue and selection).

migrations stays measure-gated; it is a placement diagnostic, not a saturation input. The surface exports raw cumulative counters only – windowing, smoothing, and the saturation decision are policy-service choices, never kernel state (see docs/proposals/tickless-realtime-scheduling-proposal.md). Proof: make run-thread-fairness reads the extended snapshot on the weighted workers and asserts the CPU-bound hog reports high runtime_ns with voluntary_blocks at or near zero while at least one preempted lower-weight worker reports nonzero preemptions and runnable_accumulated_ns.

Weight-change-while-enqueued contract

SchedulingPolicyCap.setWeight writes the validated weight directly to Thread.weight through Process::set_thread_weight and does not clear Thread.virtual_finish_ns. A weight change observed while the thread is blocked, running, or already queued takes effect on the next dequeue and re-enqueue because every enqueue site refreshes virtual_finish_ns from current weight/latency_class/ virtual_runtime_ns. The kernel proves the contract two ways:

  • By construction. Process::refresh_thread_virtual_finish_ns reads each input field fresh on every call; there is no cached derivation between enqueues. The function bears a doc-comment asserting the contract.
  • By debug_assert!. Inside the same function, a debug assertion verifies that the recomputed virtual_finish_ns is at or beyond the current virtual_runtime_ns – a future deadline, never a past one. The assertion catches any future regression where the formula could underflow or where a stale cache could drift below the current vruntime.

The focused QEMU smoke that drives setWeight and verifies the post-block dispatch picks up the new weight landed under Phase D Task 5: make run-thread-fairness-weight-change (manifest system-thread-fairness-weight-change.cue, demo demos/thread-fairness/). Two competing child threads run a fixed wallclock window: a baseline worker stays at DEFAULT_WEIGHT, while a heavy worker self-calls SchedulingPolicyCap.setWeight(weight=128) and then blocks on Timer.sleep so it leaves the run queue before the contention window opens. Each worker snapshots its scheduler state at wake and at window end via SchedulingPolicyCap.snapshot, and the parent verifies three independent properties: (1) the heavy snapshot reads weight == 128 and the baseline snapshot reads weight == DEFAULT_WEIGHT; (2) the observed runtime_ns ratio matches the weight ratio inside a configured tolerance; (3) the heavy worker’s virtual_runtime_ns advances at roughly half the rate of its runtime_ns (vruntime/runtime ~= 0.5 for weight=128, ~= 1.0 for DEFAULT_WEIGHT). A scheduler that re-enqueued or dispatched the heavy worker using a stale virtual_finish_ns derived from DEFAULT_WEIGHT would not show the weight-proportional CPU share, and a scheduler that held a stale weight inside charge_runtime would yield heavy vruntime/runtime ~= 1.0 instead of ~= 0.5; the smoke trips on either regression. The capability is bound to CapCallContext::caller_thread (Phase D Task 2 decision), so same-thread self-mutation is the only authorized shape for this proof; cross-thread weight authority remains a Phase H privileged scheduler-policy service concern.

The thread-scale benchmark was repaired before accepting the milestone. The old 1 MiB/spinning-parent shape was not a valid four-core reference because the matching Linux pthread baseline also failed at four workers. The accepted benchmark shape uses a blocking parent join, 262,144 blocks (16 MiB), and work_rounds=64. The formal accepted-evidence pair is the capos-bench 2026-05-02 21:38 UTC 5-run pair pinned to physical-core logical CPUs 0,1,2,3 against main commit 374f8556: capOS work 1.883x and total 1.787x clear the configured 1.6x gates, while the matching Linux pthread baseline records 1.988x/1.987x. Its 1-to-4 row became the diagnostic that justified Phase D’s fair-share enqueue policy: capOS 1.566x/1.538x versus Linux 3.963x/3.858x, a clear bottleneck in the then-current single-global-queue scheduler. Phase D’s WFQ evidence on 2026-05-10 manually accepted the recorded 1-to-4 diagnostic with capOS 3.088x/2.700x and matching Linux 3.974x/3.850x on the same host/CPU pin set. The harness still enforced only the configured 1-to-2 work/total speedup gates. Historical pre-collapse 1-to-2 (1.828x/1.687x) and the post-collapse 3-run diagnostic on capos-bench 2026-05-02 10:42 UTC (1.890x/1.792x, 1.504x/1.436x) remain in docs/benchmarks.md for reference. Four-worker capOS scaling was a follow-up rather than a completed claim under the pre-collapse model: the unsuppressed diagnostic recorded 1-to-4 work/total speedups 3.029x/2.386x, while suppressing scheduler switch logs recorded 3.272x/2.303x; remaining guest-measure evidence pointed at global Scheduler lock contention plus exit/join/block/schedule overhead, and normal scheduler-owned execution is still capped at temporary CPU slots 0-3. Each process currently owns one or more Thread records; each thread owns its saved CPU context, kernel stack, FS base, block state, and – since Phase D Task 2 – the WFQ ordering inputs weight: u16, latency_class: LatencyClass, and virtual_finish_ns: u64. The Phase D constants in capos-abi/src/scheduler.rs set the defaults weight = DEFAULT_WEIGHT and latency_class = LatencyClass::Normal, so unmodified workloads observe no behavior change versus the pre-Phase-D scheduler. virtual_finish_ns is recomputed on every enqueue (Task 2 ships the derivation; Task 3 will consume it for ordered insertion) and is not meaningful while the thread is blocked.

Phase D Task 2 split the per-thread CPU accounting record so the WFQ-load- bearing fields are available in the normal qemu build: runtime_ns, virtual_runtime_ns, and last_started_ns are unconditional; context_switches, preemptions, voluntary_blocks, migrations, last_cpu, and the *_runtime_stable_observed and blocked/exited bookkeeping stay behind the measure feature because they are pure operator-observability counters that do not participate in dispatch ordering and need a separate operator snapshot path. runtime_ns advances 1:1 with elapsed CPU time, while virtual_runtime_ns advances by elapsed_ns * REFERENCE_WEIGHT / weight so per-thread weight changes the cumulative WFQ share rather than only the enqueue tag. The runtime-charge path is invoked when a current thread stops running through timer preemption, blocking cap_enter or park, thread/process exit, or direct switch/handoff paths that select another current thread; the wrapping helpers in kernel/src/sched.rs route through Process::charge_thread_runtime / Process::account_thread_scheduled unconditionally now.

The SchedulingPolicyCap cap surface mutates these per-thread fields through the caller-thread fallback binding selected in Phase D Task 2: every method (setWeight, setLatencyClass, snapshot) routes to CapCallContext::caller_thread, so a holder can only mutate or observe its own running thread. Cross-thread or cross-process authority is reserved for the Phase H privileged scheduler policy service. The SchedulingPolicyCap.snapshot reply intentionally exposes only the four fields promoted out of the measure feature gate; context_switches/preemptions/voluntary_blocks/migrations are benchmark-only and a future operator-observability slice may add them through a separate cap. The BSP scheduler tick normally arrives through the local APIC timer on vector 48 with LAPIC EOI after calibrating the LAPIC initial count against PIT channel 2; if LAPIC setup or calibration is unavailable, the kernel falls back to the legacy PIT/PIC IRQ0 path on vector 32. On each user-mode timer tick (kernel-mode ticks bypass the scheduler entirely through kernel_timer_interrupt_handler, as described under Design), the kernel wakes timed-out or satisfied cap_enter and park waiters, processes the current thread’s ring endpoint in timer mode, saves the current thread context, picks the next ready thread from the single global run queue (the earlier per-CPU local-first / steal scan was retired with the queue collapse), switches CR3 when needed, updates the current CPU’s kernel-entry stack through the per-CPU hook, restores FS base, mirrors the next ThreadRef into the current PerCpu, and returns to the next user context.

When APs are online and their LAPIC timers start, scheduler CPU slots 0-3 can temporarily own scheduler/user execution. The earlier AP-owner proof kept the BSP in kernel idle; the current same-process scaling slice allows sibling threads with distinct ring endpoints to run on different scheduler CPUs while processes that hold broad launch/authority caps or live endpoint objects remain pinned to the legacy single-owner CPU. Additional APs beyond CPU 3 stay in kernel idle until a later scheduler-owner policy replaces the temporary CPU mask. The runnable queues are a per-CPU array of VecDeque<ThreadRef> shared by the scheduler-owned CPUs under the global scheduler lock and ordered ascending by virtual_finish_ns; process/thread metadata remains shared under that lock. A bounded steal path migrates the most overdue sibling candidate (each sibling queue’s first entry that the destination CPU considers Runnable) when a CPU’s local queue has no runnable entry.

Syscall entry initializes kernel GS with swapgs, saves the user RSP through the GS-relative PerCpu.user_rsp slot, and switches to the GS-relative PerCpu.kernel_rsp slot. Normal syscall returns swap back before sysretq. Blocking cap_enter, process exit, and ThreadControl.exitThread paths that leave through scheduler iretq restore use restore_context_after_syscall so GS ownership is returned to userspace before the next user context resumes.

Timer.sleep records a bounded scheduler waiter keyed by caller ThreadRef, user data, and an absolute monotonic deadline_ns. Due sleeps validate the thread generation, post an empty completion directly to the caller’s CQ, and then flow through the same blocked cap_enter wake scan as other completions. Each process has a separate sleep waiter quota, so one Timer holder cannot fill the global sleep queue by itself.

ThreadControl.setFsBase validates runtime-provided FS bases as user-canonical addresses, updates the caller thread’s saved FS base, and writes the CPU FS base immediately when the caller is the running thread. There is no process-global FS base; context switch treats FS base as per-thread state.

The initial thread still uses the compatibility ring at RING_VADDR, while each spawned child thread receives a kernel-chosen ring mapping in the process ring arena. Run queues, per-CPU current, direct IPC handoff, Timer sleep waiters, process/terminal waiters, endpoint caller/receiver records, and deferred cancellation CQEs store generation-checked ThreadRef values and route completions to the target thread’s ring endpoint. Process-owned thread and kernel-stack ledger limits are enforced by ThreadSpawner.create before additional thread records become runnable. The frozen contract is in In-Process Threading. Park wait uses a separate Blocked(Park { ... }) reason and park timeout/wake completions use reserved CQE credits before marking generation-checked waiter threads runnable. The authority and ABI contract is in Park Authority.

cap_enter(min_complete, timeout_ns) processes pending SQEs immediately. If the requested completion count is not available and the timeout permits blocking, the current thread enters Blocked(CapEnter { ... }) and the syscall entry path switches to another runnable thread.

The LAPIC user-timer path enters sched::schedule() unconditionally on every tick. An earlier slice carried a bounded user-mode continuation fast path with a per-CPU one-skip budget and a release/acquire slow-path-required summary; that path has been retired (see docs/backlog/scheduler-evolution.md “Cleanup: Retire Benchmark-Driven Scaffolding Before Phase D”). The fast path saved at most one scheduler entry every other tick on an uncontended single-CPU-effective scheduler while paying for shadow-state publication on every slow-path exit, so the simpler always-schedule shape is preferred until a future Phase D or Phase F slice ships an evidence pair where the fast path measurably reduces scheduler-lock hold time on a contended SMP run.

When endpoint delivery satisfies a blocked server RECV, the scheduler can set a direct IPC target. The next scheduling decision runs that server before ordinary round-robin work when it is ready and its ThreadRef generation still matches the captured direct target. When the direct slot is unavailable, endpoint completions fall back to the queued path with WakePolicy::QueueCpu(slot) targeting the current CPU’s per-CPU queue, so the wake scan probes the placed CPU first.

Design

The implementation keeps ring dispatch outside the global scheduler lock. Timer dispatch extracts ring/cap/scratch handles, releases the scheduler lock, processes bounded SQEs, then reacquires the scheduler lock to choose the next thread. This prevents Cap’n Proto decode, serial output, and capability method bodies from running under the global scheduler lock.

There is no longer a slow-path-required summary or a per-CPU skip budget for the user-mode timer path. Every user-mode LAPIC timer tick enters sched::schedule(), which services run-queue entries, direct IPC targets, deferred process termination/drop and thread-stack cleanup, Timer sleep waiters, and blocked threads with timer-backed cap_enter or Park timeouts under the scheduler lock. Those timeout paths compare absolute monotonic deadlines, but periodic ticks still decide when the checks run. Ring SQEs and ordinary cap waiters run on the same per-tick cadence. Kernel-mode timer ticks (e.g., on AP cores parked in the kernel idle loop) still go through kernel_timer_interrupt_handler, which sends EOI without entering the scheduler. The shared advance_bsp_tick helper still increments the compatibility TICK_COUNT only on CPU 0; normal runtime accounting and timeout comparisons use monotonic_ns() instead. Future per-CPU fair-share slices may reintroduce a continuation path under explicit Phase D or Phase F authority; until then the always-schedule shape keeps the scheduler’s authority over thread metadata and runnable ownership single-source.

The runnable queues keep a single-owner contract behind the global scheduler lock. A live generation-checked ThreadRef may have at most one runnable dispatch owner across per-CPU current/handoff_current slots, the per-CPU run queues, and the single direct_ipc_target preference slot. Blocked waiters, sleep waiters, park waiters, endpoint state, process waiters, and join waiters are not runnable owners; they may make a thread ready only after liveness and generation checks succeed.

Migration between per-CPU queues is represented as a scheduler-lock- contained transfer, not as a second published owner. The source owner is removed or popped first and the ThreadRef is then inserted in the destination queue at the position determined by a freshly recomputed virtual_finish_ns, or selected as the next running thread. virtual_runtime_ns travels with the thread; virtual_finish_ns is recomputed at every enqueue and never carried as committed state, so weight or class mutations applied while the thread was blocked take effect on the next dequeue and re-enqueue. Retry paths requeue the candidate after dropping duplicate queued copies. Direct IPC keeps its preference slot only while the target remains live and runnable; if the direct target cannot run immediately, it falls back through the normal queued-owner path on the current CPU’s per-CPU queue.

Idle-to-runnable wake targeting reuses the same ownership boundary. A thread that becomes ready through endpoint completion, timer sleep, park wake, process wait, or thread join is pushed to the placement target’s per-CPU run queue, and wake_idle_scheduler_cpus_locked first probes the placement target when the policy is QueueCpu, then walks eligible idle scheduler CPUs to wake the first that accepts a fresh reschedule IPI; CPUs that already have a pending IPI (or that fail LAPIC delivery) are skipped without breaking the scan, so a burst of ready work cross-wakes more than one neighbor instead of stranding the rest behind one already-targeted CPU. Direct IPC uses the same path. Measurement builds expose aggregate and per-phase counters for wake scans, eligible idle CPUs, targeted CPUs, IPIs sent, already-pending IPI skips, not-ready target skips, missing LAPIC targets, and send failures.

Each per-CPU run queue is reserved up to the live runnable-capable thread count before publication; the shared live reservation count is released on process/thread exit or pre-publication rollback. Reserving each queue to the full live-thread count is required because the bounded steal path may migrate every live thread into a single sibling queue between two scheduler passes. Timer preemption, unblock, direct- IPC fallback, requeue, and steal-requeue paths therefore must not allocate while the thread is already live.

Process and thread exit cleanup proves the removal side of that ownership contract at the cleanup site. After removing queued owners and clearing a matching direct IPC target, the scheduler lock remains held while the kernel scans every per-CPU runnable queue and the direct target slot; any stale exiting process or thread reference is a kernel assertion failure. The focused spawn smoke asserts the corresponding serial proof markers on exercised process and thread exit paths.

The Phase C migration order is constrained by hardware state, not only by scheduler data structures. The first gate moved syscall entry/exit off BSP-symbol-relative PerCpu fields and onto KernelGsBase/swapgs on user syscall paths, including blocking cap_enter, exit, and ThreadControl.exitThread paths that leave through iretq rather than the normal sysretq epilogue. The second gate added xAPIC initialization, a PIT-calibrated BSP LAPIC timer tick, LAPIC EOI routing, AP LAPIC initialization, a LAPIC spurious-vector handler, and an IPI vector plus bounded vector-49-only fixed IPI send primitive. The third gate added address-space resident CPU masks, per-CPU pending full-TLB flush generations, completion waits, and a vector-49 TLB shootdown handler for user page-table map, unmap, and protect. The fourth gate split current-thread tracking into per-CPU slots, registers AP PerCpu records for current-thread and syscall stack mirrors, updates AP TSS.RSP0 on context switches, and hands the single scheduler-owner role to AP cpu=1 when it is online with a programmed LAPIC timer.

The LAPIC slice replaces the BSP-oriented PIT/PIC scheduler tick on supported QEMU and hardware paths. kernel/src/arch/x86_64/idt.rs keeps vector 32 for the PIT/PIC fallback, reserves vector 48 for LAPIC timer delivery plus vector 49 for cross-CPU requests, and installs vector 255 for LAPIC spurious interrupts. pic.rs can remap and mask all legacy IRQs once LAPIC ticks are active, and context.rs sends LAPIC EOI or PIC EOI according to the active timer source. The IPI vector now handles TLB shootdown requests and bounded reschedule requests for AP idle-to-runnable handoff.

The TLB slice wraps user page-table mutations that can affect an address space resident on another CPU. AddressSpace::map, AddressSpace::unmap, and AddressSpace::protect still perform the local x86_64 mapper flush, then call the architecture shootdown helper with the address space’s resident CPU mask. The helper records pending full-TLB flush generations for online resident CPUs other than the caller, sends vector-49 IPIs, and returns a completion token. Capability handlers drop the address-space guard and enqueue completion work; cap_enter and timer polling drain that queue after ring dispatch releases the cap-table and scratch locks. This keeps a remote syscall that is contending on the same process locks from blocking maskable IPI delivery forever. Capability handlers reserve fixed-size deferred queue slots before page-table mutation, so full queues fail closed as capability overload errors instead of surfacing after rollback, unmap, or protect has already changed state. Drains flush the current CPU before waiting so a CPU that is itself in the target mask cannot wait on its own pending generation. Target CPUs drain the generation in the IPI handler, at syscall entry, or before returning to userspace from syscall, timer, and scheduler restore paths. Generation counters avoid losing overlapping shootdowns while a target CPU is already draining a prior request. This relies on kernel user-buffer access continuing through address-space-locked HHDM copy/read helpers rather than raw user virtual addresses while a delayed flush generation exists. Callers include VirtualMemoryCap dispatch through parse_map, parse_unmap, and parse_protect, plus MemoryObjectCap::{map,unmap,protect} in kernel/src/cap/frame_alloc.rs. Scheduler CR3 handoff now marks the selected address space resident on the current CPU, including AP cpu=1 during the AP scheduler-owner proof.

Idle paths

There are two distinct idle paths, and both run genuine CPL0 (kernel-mode) idle. There is no user-mode idle process: when no real work is runnable a CPU runs the kernel idle code at CPL0 on the kernel PML4. The two paths differ only in how the CPU got there.

The cooperative CPL0 kernel-mode idle path is the boot/AP path. start (BSP), start_ap (APs), and the start_current_cpu loop call next_start_context; when that returns no real runnable work they fall into idle_current_cpu_once, which hlts at CPL0 on the per-CPU kernel stack with interrupts enabled (no CpuContext, no restore_context — the same way start_current_cpu itself runs). A kernel-CPL timer tick or reschedule IPI taken during that hlt runs the kernel-mode handler (kernel_timer_interrupt_handler / handle_reschedule_ipi, both of which call nohz_recheck), so the nohz one-shot deadline is preserved and re-armed across the hlt; control then returns to the loop, which re-checks for work. idle_current_cpu_once increments the KERNEL_IDLE_HLT_ENTRIES counter and emits a bounded cpu-isolation: kernel-idle hlt cpu=… idle_path=cooperative-cpl0 … nohz_active=… timer_source=… log line so this path is observable from the kernel log; the run-scheduler-cpu-isolation-lease smoke asserts it is reached. Once any dispatch path restore_contexts into a real thread, the start_current_cpu frame is abandoned.

The steady-state CPL0 idle-thread path is reached from the four interrupt/syscall-return dispatch call sites — schedule() (timer), capos_block_current_syscall, exit_current, and exit_current_thread. When choose_next_locked falls through to this CPU’s idle thread, each site builds the dispatch tuple from the per-CPU CPL0 idle-thread context. The dispatch call sites hand a CpuContext to assembly that restore_contexts (or, for the timer path, return a context pointer plus a CR3 the timer handler loads), so they need a schedulable context when no real work is runnable; the CPL0 idle context is that context.

CPL0 idle-thread context infrastructure. arch::smp::init_idle_kernel_stacks allocates one dedicated CPL0 idle kernel stack per scheduler CPU slot from fresh contiguous frame ranges, so they do not overlap the boot kernel stacks, the per-thread kernel stacks, or the IST slots. CpuContext::new_cpl0_idle builds a kernel-shaped context (kernel-code/kernel-data selectors, rip = kernel_idle_entry, rsp into the idle kernel stack). sched::sched_init, called from kmain, constructs and stores one CpuContext per CPU slot in CPL0_IDLE_CONTEXTS and then calls register_idle_process_locked to seed the slot-0 synthetic idle Process record before the scheduler runs (this keeps the BSP idle process’s low PID and the init-process PID ordering stable); the remaining per-CPU slots are registered lazily by current_cpu_idle_thread_locked the first time their CPU reaches idle. sched_init panics on OOM, as does the lazy path: the CPL0 idle contexts and the synthetic idle records are scheduler idle infrastructure and there is no fallback idle path, so a failure to build them is unrecoverable. The idle kernel stack is sized as a full per-thread kernel stack (PROCESS_THREAD_KERNEL_STACK_PAGES), not an IST slot, because kernel_idle_entry runs the deep service_periodic_work() call chain on it (see periodic-service parity below).

Synthetic idle process records. The idle thread is never a runnable user-mode process. The synthetic idle Process (Process::new_idle) maps no user code, no user stack, and no cap ring, and carries an empty cap table. It exists only so the idle ThreadRef resolves through sched.processes and the scheduler’s ThreadRef-centric bookkeeping — set_thread_state, account_thread_selected_locked, current-thread tracking, and the is_idle_thread guard predicate used pervasively across the scheduler — keeps working unchanged. Its address_space is a bare page-table root with nothing user-mapped; it is required by the Process struct but is never loaded as CR3. Every idle dispatch site routes the CPU onto the kernel PML4 via the CPL0 idle context, so the synthetic idle AddressSpace is never made resident and never participates in resident_cpu_mask or TLB-shootdown idle-residency handling.

Dispatch-tuple rewire. After choose_next_locked returns, when the chosen thread is idle_threads[current_cpu_slot()], each dispatch site builds the dispatch tuple from the CPL0 context pointer, the dedicated idle kernel stack top, the kernel PML4 CR3, and the current FS base (no FS-base change). sched_init builds one CPL0 idle context per scheduler CPU slot or panics, so cpl0_idle_context(slot) is infallible at every dispatch site. The schedule() timer path does not route through a dedicated CR3-loading restore helper: the existing timer_interrupt_handler already loads the tuple’s CR3 with write_cr3 before the privilege-agnostic five-element iretq. The three syscall-path sites (capos_block_current_syscall, exit_current, exit_current_thread) keep their restore_context_after_syscall restore tail: they are entered via syscall_entry (which already executed swapgs), so the exit swapgs is required to leave the CPL0 idle thread running with the user GS base — the same GS-base state the timer path’s CPL0 idle thread runs with. Each site emits a distinct marker: sched: dispatch idle cpu=N idle_path=cpl0-dispatch-timer (timer), …cpl0-dispatch-block (blocking syscall), and …cpl0-dispatch-exit (both exit_current and exit_current_thread). debug_assert!s guard the CPL0 dispatch tuple: context cs/ss are the kernel selectors and their RPL bits are 0.

CPL0 idle periodic-service parity. schedule()’s timer Phase 2 runs periodic service work on every tick — deferred process drops, pending terminations, wake_cap_waiters, service_sqpoll_workers(), drain_pending_endpoint_cancellations(), terminal_session::poll_input(), virtio::poll_scheduler(), and the network / pipe / interrupt poll_waiters() calls. A CPL0 idle thread’s timer ticks are kernel-mode and go through kernel_timer_interrupt_handler, which never enters schedule() — so without explicit parity handling that servicing would be stranded whenever a CPU is parked on the CPL0 idle thread. That work is factored into a single service_periodic_work() function with one lock discipline: the scheduler lock is taken only for the bounded deferred-drop / thread-stack-release / wake_cap_waiters / pending-termination extraction, then dropped before drop_pending_process / finish_terminated_process and the lock-free poll block. schedule() calls it after ring dispatch; kernel_idle_entry is its own cooperative loop that, each iteration, runs service_periodic_work(), then next_start_context(false) to re-dispatch a real runnable thread the moment one appears (allow_idle = false so it never re-selects the idle thread), then idle_current_cpu_once() to hlt. The re-dispatch is required: without it a kernel-mode timer tick taken during the idle hlt returns through kernel_timer_interrupt_handler, which does not re-enter schedule(), so the CPU would be stranded. service_periodic_work() and next_start_context() run with interrupts disabled in that loop — the CPL0 idle context is built IF=1 so the periodic tick can preempt the hlt, so the loop must cli before the deep service call; otherwise a CPL0 timer tick taken during service_periodic_work() nests a kernel_timer_interrupt_handler frame onto the idle kernel stack (same-privilege interrupts do not switch stacks). idle_current_cpu_once re-enables interrupts only across its enable_and_hlt and disables them again before returning. There is no double-service: a CPU running a real thread gets the service block via schedule(), a CPU on the CPL0 idle thread gets it via the kernel_idle_entry loop, and a given tick on a given CPU is CPL3 (schedule()) xor CPL0-idle (the loop). nohz cadence stays honest because the loop iterates at the timer/IPI cadence — when the periodic tick is suppressed the re-armed one-shot still wakes the hlt, so service_periodic_work() still runs.

iretq CPL0 restoration invariant and CPL0 idle-thread prerequisites

This subsection records the load-bearing x86-64 architectural invariant that any future CPL0 idle-thread context migration must satisfy, along with the prerequisites the implementation will need to meet.

Authoritative reference: Intel 64 and IA-32 Architectures Software Developer’s Manual (SDM), Volume 2A, IRET/IRETQ instruction reference, “Operation” pseudocode (the IF OperandSize = 64 / 64-bit-mode path), and Volume 3A, Section 6.14.3 “Returning from an Exception or Interrupt Procedure.” The description below applies to IRETQ in 64-bit long mode; the legacy 32-bit IRET paths behave differently and are called out explicitly where it matters.

iretq frame layout and the 64-bit unconditional five-element pop. iretq in 64-bit long mode unconditionally pops five 64-bit (8-byte) values from the top of the current kernel stack, in order: RIP, CS, RFLAGS, RSP, SS. This is true regardless of whether the privilege level changes — both a CPL0→CPL3 return and a CPL0→CPL0 return consume the same five-element frame and load RSP:SS from it. AMD deliberately removed the legacy conditional stack switch for long mode: the “skip SS:ESP on a same-privilege return” behavior exists only in the legacy 32-bit IRET operand-size paths, never in IRETQ.

  • CPL0 → CPL3 (privilege change, ring exit): The target CS has RPL=3, which differs from the current CPL=0. The CPU installs RIP, CS, and RFLAGS from the frame, then loads RSP and SS from the same frame and transfers to the user-space instruction at RIP on the user stack.
  • CPL0 → CPL0 (same-privilege, no ring change): The target CS has RPL=0, matching the current CPL=0. iretq still pops all five elements: it installs RIP, CS, and RFLAGS, and also loads RSP and SS from the frame, exactly as in the CPL3 case. There is no same-privilege short-circuit in 64-bit mode. The practical consequence for a CPL0 restore is the opposite of the legacy intuition: the frame’s rsp and ss fields are load-bearing and must carry a valid kernel stack pointer and a valid RPL=0 stack selector, because the CPU will load them.

Current code. restore_context (kernel/src/arch/x86_64/context.rs lines 311–328) sets RSP to the supplied CpuContext pointer, pops all fifteen caller-saved and callee-saved GPRs (lines 315–327), and executes iretq (line 328). The CpuContext struct (context.rs lines 133–155) places rip, cs, rflags, rsp, and ss at the high end of the struct (lines 150–154), matching the hardware interrupt-frame layout that the CPU pushes when it enters the timer interrupt handler. The comment at line 149 (“Pushed by CPU on interrupt from Ring 3”) reflects how every CpuContext is populated today, but the five-element iretq frame itself is not CPL3-specific — iretq consumes the same five elements for any target CPL.

User-thread contexts. Every user-thread CpuContext is built by Thread::new_user (kernel/src/process.rs), which sets cs = sel.user_code.0 as u64 (RPL=3, value 0x23) and ss = sel.user_data.0 as u64 (RPL=3, value 0x1B). Every iretq issued by restore_context or restore_context_after_syscall into a user thread is therefore a CPL0→CPL3 privilege change into a fully user-shaped context.

CPL0 idle contexts coexist with user contexts. The blocker for a CPL0 target is not iretq frame arithmetic: iretq pops the same five elements for a CPL0 target as for a CPL3 target, so a frame carrying kernel selectors and a valid kernel rsp iretqs correctly. The real requirements are in the surrounding dispatch plumbing, all of which the CPL0 idle path satisfies:

  • CR3. The dispatch call sites set CR3 to the kernel PML4 for the CPL0 idle path, not to any user AddressSpace page table. The synthetic idle Process’s AddressSpace is never loaded as CR3.
  • swapgs / GS-base. A CPL0 idle context was never entered through the syscall path. The schedule() timer path reaches it through the timer handler’s own CR3 load and the privilege-agnostic iretq tail (no swapgs in that path at all). The three syscall-path sites (capos_block_current_syscall, exit_current, exit_current_thread) keep their restore_context_after_syscall tail: those sites were entered via syscall_entry (which already swapgsed), so the exit swapgs is required to undo it — leaving the CPL0 idle thread running with the user GS base, the same state the timer path produces.
  • Kernel-code and kernel-data selectors. A CPL0 CpuContext uses cs = sel.kernel_code.0 as u64 (RPL=0, value 0x08) and ss = sel.kernel_data.0 as u64 (RPL=0, value 0x10). Because iretq loads ss unconditionally in 64-bit mode, ss must be a valid RPL=0 stack selector; the GDT data-selector privilege checks require an RPL=0 ss to be paired with an RPL=0 cs, so the whole context (cs, ss, rsp, CR3, GS base) is kernel-shaped together.
  • Idle kernel stack. Each CPL0 idle thread has its own dedicated kernel stack (arch::smp::init_idle_kernel_stacks) that does not overlap any IST slot, any per-thread kernel stack, or the BSP/AP boot stacks. Because iretq loads rsp from the frame, the context’s rsp points into this dedicated stack. It is sized as a full per-thread kernel stack because kernel_idle_entry runs the deep service_periodic_work() call chain on it.
  • No user AddressSpace residency. The synthetic idle Process’s AddressSpace is never made resident and never participates in resident_cpu_mask, so TLB shootdown never stalls waiting for an idle CPU.
  • No blocking, no exit. The idle thread never calls cap_enter, parks, blocks on any waiter, or exits. The Invariants section entry “The idle thread must never block in cap_enter or exit” carries forward unchanged.

CpuContext::new_cpl0_idle builds the kernel-shaped context, sched::kernel_idle_entry is the entry point, and sched::sched_init wires the per-CPU CPL0 idle contexts and seeds the slot-0 synthetic idle process record (the remaining slots’ records are registered lazily by current_cpu_idle_thread_locked). All four dispatch call sites — schedule(), capos_block_current_syscall, exit_current, exit_current_thread — route idle dispatch onto the CPL0 idle context: the timer path returns the CPL0 context pointer plus the kernel PML4 CR3 in its dispatch tuple and relies on the existing timer_interrupt_handler CR3-load; the three syscall-path sites keep their restore_context_after_syscall tail so the syscall-entry swapgs is undone. The CPL0 contexts are kernel-shaped across cs, ss, rsp, and CR3 together.

Measurement Policy

Design grounding for this policy: this document’s scheduler invariants, docs/backlog/scheduler-evolution.md, docs/proposals/scheduler-evolution-proposal.md, docs/research/future-scheduler-architecture.md, docs/research/out-of-kernel-scheduling.md, docs/research/nohz-sqpoll-realtime.md, and docs/research/completion-ring-threading.md. In particular, docs/research/future-scheduler-architecture.md keeps the always-on versus benchmark-only scheduler telemetry split as an open scheduler question, and the current answer is intentionally conservative.

The current kernel/src/measure.rs counters are benchmark instrumentation, not normal operator observability. They stay behind the measure feature and CAPOS_THREAD_SCALE_GUEST_MEASURE=1 because they add atomics, cycle-counter reads, phase bookkeeping, and in some cases sampled user RIP values to hot scheduler, timer, TLB, ring, and serial paths. Normal QEMU and dispatch builds must not depend on those counters being present.

The per-thread runtime-accounting ledger is split. The WFQ load-bearing core fields, runtime_ns, virtual_runtime_ns, and last_started_ns, are unconditional normal-build state on ThreadCpuAccounting: WFQ ordering, SchedulingPolicyCap.snapshot, and SchedulingContext budget charging depend on them outside cfg(feature = "measure"). The diagnostic fields (context_switches, preemptions, voluntary_blocks, migrations, last_cpu, blocked/exited stability probes, placement buckets, and per-phase attribution counters) stay behind the measure feature. Permanent operator observability is still separate work: it should expose low-rate, non-symbolic snapshots derived from the unconditional ledger plus event counters such as runnable queue depths or high-water marks, reschedule IPI sent/failed/pending counts, TLB shootdown request/failure counts, and scheduler policy admission or denial counts. Those counters must not allocate, log, read raw user PCs, or perform cycle-timing in timer, unblock, direct-IPC fallback, requeue, or steal-requeue paths.

Benchmark-only attribution stays in measure: per-phase thread-scale checkpoints, guest cycle timings for ring/capnp/method/scheduler segments, scheduler-lock wait and hold cycles, scheduler-lock site attribution, serial byte attribution, timer-mode breakdown, CR3/TLB event totals, thread-placement selection/migration buckets, raw user-PC samples, logging-suppression A/B evidence, and workload/cacheline diagnostics. The publish-placement publish/caller-aware buckets were retired with the per-CPU run-queue collapse. Phase D shipped the fair-share enqueue policy but did not reintroduce those placement counters. A future branch may promote a specific event count only by adding the normal-build storage/API and proving the same emergency-path constraints; it should not simply remove the current cfg(feature = "measure") boundaries from the benchmark module.

The publish-placement publish/caller-aware buckets are still retired; Phase D Task 3 brought back per-CPU placement semantics but does not re-emit the publish counters. Re-instate them through a separate operator-observability slice that proves the same emergency-path constraints, not by removing the existing cfg(feature = "measure") boundary on the historical buckets.

Tickless idle is enabled only for true idle. A scheduler-owned CPU may mask the periodic LAPIC tick when it is running the CPL0 idle context, has no runnable non-idle work, has no active CpuIsolationLease nohz record, has no local deferred cleanup, has no cap-enter polling dependency, and the one-shot clockevent plus non-tick-derived monotonic clocksource are available. The replacement one-shot is bounded by the nearest Timer/ParkSpace deadline or a 100 ms idle housekeeping floor, and the scheduler restores periodic mode before non-idle dispatch, reschedule-IPI wake, or rollback. Cap-enter polling waiters, including the current terminal shell path, and ready threads paused in a SchedulingContext retry window keep the periodic tick until those dependencies move behind explicit deadlines or housekeeping placement.

Generic full-nohz for ordinary budgeted compute threads carries the clockevent/deadline substrate into the CPU-isolation state machine and suppresses ticks only after network polling, IRQ affinity, accounting, deadline, lifetime, and rollback obligations pass. SQPOLL nohz applies the same substrate to explicitly leased caller-thread rings once the SQPOLL worker is live and the single-consumer, owner-lease, wake, and rollback gates pass. Automatic policy issuance and broader SQPOLL userspace-poller/device-queue admission remain separate later CPU-isolation features; see Tickless and Realtime Scheduling and NO_HZ, SQPOLL, and Realtime Scheduling.

Exit switches to the kernel PML4 before tearing down the exiting address space, releases capability authority, completes process waiters, defers final process teardown until the scheduler is running on another kernel stack, and then releases remaining thread kernel stacks through the scheduler-owned OffStackToken path before the Process value is dropped.

Invariants

  • The idle thread must never block in cap_enter or exit.
  • Ring dispatch must not hold the scheduler lock.
  • Timer dispatch copies current-process user buffers through that process’s locked AddressSpace; it must not rely on a raw current-CR3 validate/use window.
  • Blocked cap_enter waiters wake when enough CQEs are available or their finite timeout expires.
  • Timer sleep waiters must be bounded per process, tied to the caller ThreadRef generation, and removed when the caller process exits.
  • Runtime-controlled FS bases must stay in user canonical space.
  • Direct IPC handoff is a scheduling preference, not a bypass of process liveness, generation, or state checks.
  • The scheduler must update TSS.RSP0 and the per-CPU syscall kernel RSP through percpu::set_kernel_entry_stack on each switch.
  • Each PerCpu.current_thread mirrors that CPU’s scheduler current slot; the scheduler lock remains the authority for current-thread and queue ownership even though dispatch/runnable state is now separate from shared process and thread metadata.
  • Each live ThreadRef may appear in the per-CPU runnable queues at most once across all queues, and every per-CPU queue’s capacity must be reserved up to the live runnable-capable thread count before a new process or thread becomes runnable.
  • A live generation-checked ThreadRef must have at most one runnable dispatch owner across per-CPU current/handoff_current slots, the per-CPU runnable queues, and the direct IPC target.
  • Queue migration (including the bounded steal path) must be a scheduler-lock-contained remove-before-publish transfer; no path may publish the same ThreadRef twice into any queue or leave a stale direct target after exit. Migration must recompute virtual_finish_ns at the destination and never carry the source’s WFQ tag as committed state.
  • Each per-CPU run queue must remain ordered ascending by virtual_finish_ns after every enqueue, requeue, or steal-requeue. Local selection scans the queue by index for the first destination-Runnable entry; RetryLater entries are left in place for the next scheduler pass. The bounded steal path scans each sibling queue’s indices ascending for that queue’s first Runnable-for- destination entry — because each queue is ordered ascending, the first Runnable hit per queue is the lowest virtual_finish_ns candidate the destination can accept on that source — then picks the source queue whose first-Runnable candidate has the lowest virtual_finish_ns globally, with ties broken by lower CPU id. The chosen entry is removed from its actual position on the source queue (not necessarily the head).
  • Process and thread exit cleanup must assert, before releasing the scheduler lock, that the exiting process or thread has no remaining entry in any per-CPU runnable queue and no remaining direct IPC target slot.
  • Timer, unblock, direct-IPC fallback, requeue, and steal-requeue paths must use reserved run-queue capacity and avoid allocation.
  • Runtime accounting must use the normal monotonic clocksource, not benchmark-only cycle counters, and must charge only running intervals.
  • FS base is saved and restored across context switches for TLS.
  • Thread records remain generation-checked ThreadRef identities; exited records are retained only while a live handle, pending join, or unjoined status can still observe them.
  • The final teardown of an exiting process must not release thread kernel stacks until another kernel stack is active, and the implicit Thread::Drop path must not free kernel-stack frames.
  • A scheduler CPU must never run the same generation-checked ThreadRef twice at once; same-process siblings may run on different scheduler CPUs only when their completions route through distinct per-thread ring endpoints.
  • Park waiters must be keyed by generation-checked ThreadRef values, reserve one waiter CQE credit, and must not allocate in wait, wake, timeout, or process-exit cleanup paths.

Code Map

  • kernel/src/sched.rs - shared process table plus SchedulerDispatch ownership of the per-CPU runnable queues (ordered ascending by virtual_finish_ns), per-CPU current/handoff slots, idle-thread slots, direct IPC target, run-queue reservation accounting, pending drops, and pending stack releases; also blocking, wakeups, Timer sleep waiters, the bounded steal path, and exit.
  • kernel/src/arch/x86_64/context.rs - CPU context layout, timer entry/restore, tick counter.
  • kernel/src/arch/x86_64/idt.rs - timer and IPI interrupt handler wiring.
  • kernel/src/arch/x86_64/lapic.rs - xAPIC MMIO setup, PIT-calibrated LAPIC timer, LAPIC EOI, spurious-vector handling, and fixed-IPI send primitive.
  • kernel/src/arch/x86_64/tlb.rs - serialized vector-49 TLB shootdown request, pending flush generations, completion token, and interrupt/user-return drain path.
  • kernel/src/arch/x86_64/pic.rs and kernel/src/arch/x86_64/pit.rs - legacy PIC remap and PIT fallback setup.
  • kernel/src/arch/x86_64/gdt.rs - BSP/AP TSS and kernel stack storage.
  • kernel/src/arch/x86_64/syscall.rs - blocking syscall transition for cap_enter.
  • kernel/src/arch/x86_64/percpu.rs - per-CPU syscall stack registry, TSS.RSP0 update hook, and current thread storage.
  • kernel/src/arch/x86_64/tls.rs - FS base save/restore.
  • kernel/src/process.rs - process state, kernel stacks, the synthetic idle process record, and per-thread CPU accounting storage/accessors.

Validation

  • make run-smoke validates timer preemption, ring fairness, direct IPC handoff, blocked cap_enter wakeups, process exit, and clean halt.
  • make run-spawn validates process wait blocking and child exit completion through ProcessHandle.wait, Timer monotonic now/sleep completion through timer-smoke, per-process sleep quota isolation through timer-flood, and thread/park lifecycle behavior through thread-lifecycle.
  • make run-measure validates the post-thread park blocked/resume timing path and process exit while a park waiter is parked.
  • cargo build --features qemu verifies QEMU-only scheduler and halt paths.
  • QEMU smoke output for IPC includes direct handoff diagnostics when the server is woken from a blocked RECV.

Open Work

  • Prove SQPOLL/poller progress that does not depend on periodic scheduler ticks before automatic nohz activation. Then implement tickless idle only for no-runnable-work CPU idle. Keep runnable contention on periodic preemption until the activation proof closes the remaining network polling, IRQ affinity, and housekeeping dependencies.
  • Keep SMP behind per-CPU scheduler state and review of any path that needs page pinning beyond the AddressSpace-locked copy/read contract.
  • Implement the remaining SMP Phase C slices: split shared scheduler metadata, replace the temporary scheduler-owner mask, and collect accepted benchmark evidence.
  • Add priority or policy scheduling only after the current authority and IPC semantics remain stable.
  • Add service restart policy outside the static boot graph.

Programming Languages

capOS currently supports native Rust programs that are written for the capOS userspace runtime. Other languages are design tracks, not implemented platform support. The main rule is simple: a language runtime may expose familiar APIs, but authority still comes from the process CapSet and typed capability calls.

Current Support

Language or runtimeStatusPath
Rust, capOS-nativeImplemented baseline#![no_std], alloc, capos-rt, static ELF, x86_64-unknown-capos. Phase D best-effort fair scheduling closed at commit 77caafc0 (2026-05-10 19:39 UTC): per-thread weighted vruntime, per-CPU WFQ run queues, bounded steal/migration, and SchedulingPolicyCap weight/latency-class authority.
Rust stdNot implementedFuture Rust standard-library or adapter work over capabilities
CPhase 0 in tree (libcapos C-substrate v0 + libcapos-posix v0)libcapos.a exposes capos-rt syscalls, ring CALL, CapSet lookup, a heap shim, typed Console.writeLine, Timer.now, EntropySource.fill, VirtualMemory wrappers, and native ProcessSpawner.createPipe / Pipe wrappers through extern "C"; make run-c-hello proves baseline C wrappers and make run-c-pipe proves a C binary can create a Pipe, write/read a marker, close the writer, and observe EOF without using the POSIX adapter. libcapos_posix.a adds the POSIX adapter v0 surface above libcapos: per-process static fd table (32 fds), TLS errno via __errno_location(), historical UDP socket/sendto/recvfrom/close wrappers over the retired qemu-only kernel UdpSocket cap, clock_gettime(CLOCK_MONOTONIC, ...), gettimeofday(&tv, NULL), time, nanosleep, and sleep over the kernel Timer cap, fail-closed signal stubs, pipe/read/write/dup/dup2 over the kernel Pipe cap, and fork/execve/waitpid/_exit/posix_inherit_stdio plus a direct posix_spawn successor via the recording-shim Move-grant path through ProcessSpawner.createPipe / ProcessSpawner.spawn. See the POSIX adapter row for shipped smokes; the old DNS smoke is retired until resolver networking is rebuilt on the userspace stack.
C++Future experimentDepends on C startup, ABI choices, allocator, exceptions/RTTI policy, and a useful freestanding subset
GoFuture designCustom GOOS=capos per Go Runtime proposal; a separate Phase W.8 path (docs/proposals/wasi-host-adapter-proposal.md Task 9) targets a TinyGo / upstream Go GOOS=wasip1 CUE evaluator binary that runs inside the WASI host adapter against a future ScriptPackage cap
PythonFuture designNative CPython or MicroPython through a POSIX-style adapter; WASI/Emscripten for sandboxed or compute-only use
LuaPhase 1 in tree (L.3 deterministic memory release)demos/lua-smoke/ runs a hand-written Lua-subset interpreter that exercises three capability-aware host bindings: console:write_line, timer:now, and L.3 memory:{alloc,write,read,size,release} over capos-rt::VirtualMemoryClient (kernel-mapped address never crosses to Lua; every byte access is bounds-checked host-side; release unmaps the exact rounded region and marks the userdata dead). PUC Lua dialect compatibility is deferred to the future C/libcapos port. See Lua Scripting proposal.
JavaScript / TypeScriptFuture designQuickJS-style native runner or WASI-hosted engine; not a browser JS shell
WASI / WebAssemblyPhase W.5 landed 2026-05-17 05:42 UTC (Phase W.4 closed 2026-05-07 20:09 UTC; Phase W.3 closed 2026-05-07 18:25 UTC; Phase W.2 closed 2026-05-07 10:53 UTC)Host imports backed by capabilities; useful for sandboxed code and portable tools. W.1 vendored upstream wasmi (v1.0.9) at vendor/wasmi-no_std/wasmi-1.0.9/ and shipped the capos-wasm/ standalone crate that exposes a Runtime value (wasmi Engine + Store<HostState>). W.2 sub-slice 1 added the wasm-host userspace binary in capos-wasm/src/bin/wasm-host.rs, the system-wasm-host.cue focused-proof manifest, and make run-wasm-host, which still asserts the empty-instantiation regression. W.2 sub-slice 2 grew the same binary with the Preview 1 import resolver in capos-wasm/src/wasi/preview1.rs: 46 wasi_snapshot_preview1 imports land on the wasmi linker; clock_time_get(CLOCKID_MONOTONIC) is backed by the manifest-granted Timer cap; proc_exit exits via capos_rt::syscall::exit; fd_write(1, …) / fd_write(2, …) route through the manifest-granted Console cap with a fixed 4 KiB iov-total scratch ceiling and a 1 KiB per-call chunk that matches the kernel Console cap’s MAX_SERIAL_CAP_WRITE_BYTES; everything else (including random_get, which Phase W.4 promotes against EntropySource) returns ERRNO_NOSYS. A 114-byte hand-encoded probe module imports random_get, calls it once, stores the returned errno in an exported global, and the host refuses to print the [wasm-host] preview1 imports linked: clock_time_get, fd_write, proc_exit, args/environ empty; nosys=52 proof line unless that errno equals ERRNO_NOSYS. W.2 sub-slice 3 added demos/wasi-hello-rust/ (a one-liner println! Rust crate built for the upstream wasm32-wasip1 target), system-wasi-hello-rust.cue (now grants console, timer, and the optional boot (BootPackage) cap), tools/qemu-wasi-hello-rust-smoke.sh, and make run-wasi-hello-rust. The wasm-host binary keeps running the sub-slice 1 + 2 regression first; when the manifest grants boot, it also reads the manifest blob through BootPackage, decodes binaries[] via raw capnp readers (new capos_wasm::payload module), instantiates the wasi-payload wasm, explicitly invokes the _start export (wasmi’s instantiate_and_start runs the WebAssembly start section, NOT WASI’s _start), and lets the payload’s println! reach the kernel Console cap through Preview 1 fd_write. capos-rt grew narrow re-exports (capos_capnp and default_reader_options) so capos-wasm keeps a single direct path-dep on capos-rt and the vendored wasmi tree. The slice also kept the W.2 sub-slice 1 userspace-image budget bump (USER_STACK_BASE 0x100_0000) for wasmi’s ~3 MiB BSS. W.2 sub-slice 4 closed Phase W.2 by adding demos/wasi-hello-c/ (a single printf("Hello, wasi from capOS C\n") C main() built directly with system clang-18 against the Ubuntu wasi-libc + libclang-rt-18-dev-wasm32 apt packages: clang --target=wasm32-wasi --sysroot=/usr -O2 -Wall -Wextra produces a ~46 KiB wasm32-wasi module), system-wasi-hello-c.cue, tools/qemu-wasi-hello-c-smoke.sh, and make wasi-hello-c-build / make run-wasi-hello-c. C runs on capOS without any libcapos/POSIX work in tree because the wasm-host payload-load path landed in sub-slice 3 carries the C .wasm payload through the same wasm-host binary unchanged. Phase W.3 backed args_get / args_sizes_get with the manifest-supplied initConfig.init.wasiArgs text grant: the wasm-host walks the field through raw capnp readers in capos_wasm::payload::read_wasi_args, validates against WASI_ARGS_MAX_COUNT = 32 / WASI_ARGS_MAX_ARG_BYTES = 4096 / WASI_ARGS_MAX_TOTAL_BYTES = 8192 (rejecting interior NUL bytes), packs the bytes into a per-instance HostState argv buffer, and reflects them through Preview 1 to the wasm guest. A 2026-05-13 bounded environment grant mirrors that path for initConfig.init.wasiEnv: the wasm-host walks capos_wasm::payload::read_wasi_env, validates against WASI_ENV_MAX_COUNT = 32 / WASI_ENV_MAX_ENTRY_BYTES = 4096 / WASI_ENV_MAX_TOTAL_BYTES = 8192 (rejecting interior NUL bytes), packs KEY=value entries into a per-instance environment buffer, and reflects them through Preview 1 environ_get / environ_sizes_get; absent grants remain empty. The W.2 sub-slice 2 “args/environ empty” proof line stays byte-identical because the regression module passes empty argv and no environment. The new demos/wasi-cli-args/ Rust smoke (println! of argv[1]), system-wasi-cli-args.cue, tools/qemu-wasi-cli-args-smoke.sh, and make wasi-cli-args-build / make run-wasi-cli-args close the per-instance argv plumbing; demos/wasi-env/, system-wasi-env.cue, tools/qemu-wasi-env-smoke.sh, and make wasi-env-build / make run-wasi-env prove one granted environment value reaches a Rust wasm32-wasip1 payload. Schema/schema/capos.capnp is unchanged because initConfig is already a CueValue and unknown sub-fields under initConfig.init are ignored by the existing manifest decoder. Phase W.4 wires Preview 1 random_get through the kernel EntropySource cap. The wasm-host (capos-wasm/src/bin/wasm-host.rs) looks up an optional per-instance EntropySource cap from the CapSet under the well-known name random and installs the typed EntropySourceClient on HostState AFTER the W.2 sub-slice 2 probe regression has run, keeping the closed-fail nosys=52 proof line byte-identical. Preview 1 random_get (capos-wasm/src/wasi/preview1.rs) drains arbitrary wasm-supplied byte ranges through EntropySourceClient::fill_wait, chunked at the kernel cap’s MAX_ENTROPY_FILL_BYTES = 64 ceiling and capped per Preview 1 invocation at RANDOM_GET_MAX_BYTES = 65_536. RDRAND-unavailable / truncated kernel responses surface as ERRNO_IO; oversized requests as ERRNO_INVAL; out-of-bounds wasm pointer writes as ERRNO_FAULT. Manifests without the grant keep returning ERRNO_NOSYS from the closed-fail refusal branch which never enters the kernel, so an instance without an EntropySource grant cannot leak entropy. Wall-clock support stays deferred until capOS has a typed WallClock/RealTimeClock cap; clock_time_get(CLOCKID_REALTIME) keeps the W.2 sentinel ERRNO_NOSYS. The new demos/wasi-random/ Rust smoke (raw Preview 1 random_get binding reading N=64 bytes), system-wasi-random.cue (granted), system-wasi-random-ungranted.cue (ungranted), tools/qemu-wasi-random-smoke.sh, tools/qemu-wasi-random-ungranted-smoke.sh, make wasi-random-build, make run-wasi-random, and make run-wasi-random-ungranted close Phase W.4. A 2026-05-13 compatibility-import smoke adds demos/wasi-stdio-fd/, system-wasi-stdio-fd.cue, tools/qemu-wasi-stdio-fd-smoke.sh, and make run-wasi-stdio-fd; it directly imports clock_res_get(MONOTONIC), sched_yield, fd_fdstat_get(1/2), and fd_seek(1/2) and requires every promoted import to return a non-ERRNO_NOSYS result without granting filesystem, socket, or stdin authority. A 2026-05-13 harness-hardening smoke adds demos/wasi-preview1-refusals/, system-wasi-preview1-refusals.cue, tools/qemu-wasi-preview1-refusals-smoke.sh, and make run-wasi-preview1-refusals; it directly imports path_open, fd_prestat_get, fd_read, sock_send, and sock_recv and asserts the documented fail-closed errno when no Namespace/File/Store/socket authority exists. Phase W.5 (2026-05-17 05:42 UTC) wires the Preview 1 preopened-directory filesystem against the kernel Directory / File cap interface: the wasm-host looks up an optional per-instance Directory cap from the CapSet under the well-known name root and installs it as a single Preview 1 preopen at fd 3 named /preopen-0. capos-wasm/src/wasi/fs.rs implements path_open, fd_read, fd_write, fd_seek, fd_close, fd_filestat_get, fd_prestat_get, and fd_prestat_dir_name over DirectoryClient / FileClient; the resolver mirrors POSIX P1.4 Slice 4’s libcapos-posix/src/path.rs – intermediate segments walk Directory.sub, the leaf mints either an existing or freshly created File via `Directory.open(flags=CREATE
POSIX-shaped softwarePartial implementationCompatibility adapter over explicit file, directory, socket, stdio, timer, process, and namespace caps. See POSIX Adapter proposal and plan. P1.1, P1.2, and P1.3 are closed; the former direct DNS smoke is retired with the qemu-only kernel UdpSocket owner, while make run-posix-pipe-smoke, make run-posix-spawn-smoke, and make run-posix-stdio-smoke cover pipe/fork-for-exec, direct posix_spawn, and Console-backed stdio surfaces. P1.4 file/directory fd work closed at commit f97d9833 (2026-05-23 06:23 UTC): make run-posix-file proves open(), write(), lseek(), read(), opendir(), readdir(), and closedir() through a live C process over the RAM-backed root Directory cap. Closed P1.4 successors now include printf/string (make run-posix-printf), identity stubs (make run-posix-identity), and signal/time stubs (make run-posix-signal-time). Remaining P1.4 work is dash vendoring/patching, the multi-translation-unit C build, and make run-posix-shell-smoke; long-form decomposition lives in POSIX Adapter Dash Port.

Native Rust Today

The implemented path is Rust without the standard library. Programs use core and may use alloc types such as Vec, String, Box, and BTreeMap because capos-rt installs a userspace allocator. They do not get std::fs, std::net, std::thread, println!, environment variables, process arguments, or a libc syscall table.

capos-rt owns the repeated runtime machinery:

  • the _start entry point and capos_rt_main handoff;
  • fixed heap initialization;
  • panic output through an emergency Console path when available;
  • raw exit and cap_enter syscall wrappers;
  • CapSet lookup and interface-id checks;
  • a single-owner ring client;
  • typed clients for implemented kernel and service capabilities;
  • result-cap adoption and queued local release.

Native programs should keep ordinary Rust business logic in normal modules and push OS interaction to typed capOS clients. That keeps pure logic host-testable while making authority visible at capability lookup and child-spawn sites.

Why std Is Different

Rust std is not just “more Rust.” It is an operating-system binding. It expects an implementation of filesystem, networking, threads, time, standard I/O, process, environment, synchronization, and platform error APIs. On Linux those calls are ambient: a process can ask the kernel to open a path or create a socket and the kernel consults global process credentials.

capOS does not have that ambient authority model. A future Rust std path must choose how each std feature gets authority:

std areacapOS authority source
std::io::{stdin, stdout, stderr}StdIO, Console, or TerminalSession caps
std::fsscoped Directory, File, Store, or Namespace caps
std::netsocket or listener caps minted by a network service
std::threadThreadSpawner, ThreadControl, ThreadHandle, and ParkSpace support
std::timeTimer and future wall-clock caps
process spawn and waitProcessSpawner, RestrictedLauncher, and ProcessHandle caps
std::env and current directorysynthetic runtime state backed by manifest or namespace caps

That mapping can be implemented as a capOS std backend, a Rust compatibility crate, or a POSIX-style adapter. The project has not selected one shared ABI for all language runtimes.

Compatibility Terms

Use these terms instead of the vague phrase “compatibility layer”:

  • Native runtime adapter: language-specific runtime glue that talks to capOS capabilities directly. capos-rt is the implemented Rust example; GOOS=capos would be the Go example.
  • Capability-native bindings: generated or handwritten bindings that expose Cap’n Proto interfaces as language-level APIs without POSIX names.
  • POSIX compatibility adapter: a libc or library surface that translates open, read, write, socket, poll, clock_gettime, and similar APIs into operations on granted capabilities.
  • WASI host adapter: a WebAssembly host implementation whose imports are backed by granted capOS capabilities.

The adapter may make code look familiar, but it cannot create authority. A process without a namespace cap still cannot open a file. A process without a network cap still cannot create a socket. A process without a launcher or spawner cap still cannot create children.

Language Tracks

Rust

Rust is the only implemented userspace language. The current target is targets/x86_64-unknown-capos.json, which exposes target_os = "capos" while keeping the booted userspace baseline no_std, static, and panic = "abort". init, demos, shell, and the capos-rt smoke binary build through this custom target.

Open work before broader Rust support:

  • generated clients after the schema surface stabilizes;
  • runtime ParkSpace clients and multi-threaded ring demultiplexing;
  • a decision on Rust std over native capabilities versus a POSIX adapter;
  • package/build conventions for out-of-tree capOS Rust programs.

C and C++

C support is in tree as a Phase 0 substrate. The libcapos/ crate compiles to libcapos.a, a thin Rust staticlib that exposes the capos-rt syscall, ring CALL, CapSet lookup, typed Console.writeLine, Timer.now, EntropySource.fill, VirtualMemory wrappers, native ProcessSpawner.createPipe / Pipe wrappers, and the global allocator under an extern "C" ABI. C binaries link statically against the archive and run on the same userspace ELF layout as Rust demos; make run-c-hello boots a C main() that calls the baseline wrappers, and make run-c-pipe boots a C main() that creates a Pipe, round-trips a marker, closes the writer, and observes EOF. The substrate is intentionally narrow – no errno, no fd table, no POSIX surface – so the separate libcapos-posix layer can own those decisions without churning the substrate. The same archive is what later runtimes such as CPython, MicroPython, Lua, and QuickJS will link against.

libcapos-posix/ builds libcapos_posix.a on top of libcapos.a and ships the v0 POSIX surface: a 32-fd static table, __errno_location() TLS, UDP socket/sendto/recvfrom/close over the kernel UdpSocket cap, clock_gettime(CLOCK_MONOTONIC, ...) and gettimeofday(&tv, NULL) over the kernel Timer cap, pipe/read/write/dup/dup2 over the kernel Pipe cap, file/directory fd operations (open, lseek, opendir, readdir, closedir) over the RAM-backed root Directory cap, and a recording-shim fork/execve/waitpid/_exit/posix_inherit_stdio path plus direct posix_spawn with posix_spawn_file_actions support, all routed through ProcessSpawner.createPipe / ProcessSpawner.spawn when spawning is needed. The shipped smokes are make run-posix-pipe-smoke, make run-posix-spawn-smoke, make run-posix-stdio-smoke, and make run-posix-file. The former make run-posix-dns-smoke target is retired with the qemu-only kernel UdpSocket owner. The remaining v0 phase is the dash port (Phase P1.4) over the kernel RAM-backed File/Directory/Store/Namespace caps from Storage Phase 3 slices 1-3. See docs/backlog/posix-adapter-dash-port.md for the long-form decomposition.

C++ should wait until the C substrate exists and the project decides its C++ ABI policy: exceptions, RTTI, TLS, allocation, unwind behavior, and standard library scope. A freestanding container/arena subset is plausible earlier than full hosted C++.

Go

Go is a dedicated future design because its runtime is close to a userspace operating system. A native GOOS=capos port needs virtual memory reservation and commitment, TLS setup, OS-thread creation, park/wake, monotonic time, debug output, process exit, and eventually network polling.

The current kernel/runtime substrate already proves useful pieces: VirtualMemory, Timer, ThreadControl, ThreadSpawner, ThreadHandle, and private ParkSpace wait/wake exist at the capOS level. The missing work is the Go runtime port and the runtime-side integration contract, not a new ambient syscall namespace.

Go through WASI may be sufficient for CPU-bound tools such as CUE evaluation; that path is tracked as Phase W.8 in WASI Host Adapter (TinyGo or upstream Go GOOS=wasip1 against a future ScriptPackage cap). Native GOOS=capos remains the path for Go network services and full runtime behavior.

Python

Python is not currently supported on booted capOS. The practical paths are:

  • Native CPython through a POSIX compatibility adapter. This depends on the C/libc substrate plus file, stdio, timer, networking, and process adapters. It is the likely path for trusted system scripts, configuration tooling, and Python programs that need capOS networking or storage.
  • MicroPython through the same native C substrate. This is a smaller early scripting option with less runtime surface than CPython.
  • WASI or Emscripten-hosted Python. This is useful for sandboxed or compute-oriented Python. It still runs a Python interpreter; WebAssembly is the sandbox/host ABI, not a way to avoid Python runtime work.

Current upstream CPython support is relevant but not sufficient by itself: PEP 11 lists wasm32-unknown-wasip1 as a Tier 2 CPython platform and wasm32-unknown-emscripten as Tier 3, while PEP 776 records Emscripten support for Python 3.14. Those targets help the WASM path. They do not provide native capOS file, socket, thread, or capability bindings.

Lua

Lua is a capability-scoped scripting runner. The target is not a POSIX Lua shell. A capos-lua process should receive an exact CapSet, load curated standard libraries, expose capabilities as unforgeable host userdata, deny raw CapIds, and flush owned handles at script exit.

Phase 0 lives in demos/lua-smoke/ as a hand-written Lua-subset interpreter written entirely on top of capos-rt. It exists to validate the long-term capability-aware host API design (typed userdata, obj:method(args) dispatch through a host registry, no raw SQE or method-id leak into Lua, errors surfaced as Lua runtime errors) without committing capOS to a particular Lua dialect. The interpreter accepts a strict subset (local, if/elseif/ else, numeric for, while, integer/float arithmetic, string concat, comparison, obj:method(args) calls); tables, closures, coroutines, metatables, and the Lua standard library are not implemented.

Upstream PUC Lua is a small C implementation, so the dialect-compatible path waits on the C/libcapos substrate. The Phase 0 interpreter is not a promise of PUC Lua compatibility and the smoke binary is explicitly labelled runtime = "capos-lua-subset" rather than lua-5.x. When the C/libcapos port lands, the embedded interpreter is replaced or kept as a research-grade sandbox; the host binding shape stays.

JavaScript and TypeScript

JavaScript support means running an engine as an ordinary capOS process. A small QuickJS-style runner is the plausible early native path once C support exists. V8 or SpiderMonkey are much larger C++ runtime ports and should be treated as later experiments. TypeScript would normally compile before execution; capOS should not make a TypeScript compiler part of the kernel or base runtime.

WASI and Browser WebAssembly

WASI support is a host-runtime track: the host imports become capability calls. The full design is in the WASI Host Adapter proposal, and the implementation decomposition is in WASI Host Adapter. The proposal selects wasmi for the v0 phases (no_std + alloc userspace runtime, fuel metering, externref support) and frames wasmtime / WAMR as the W.7+ migration targets. Each WASI import is backed by a typed capOS capability the host adapter already holds; ungranted authority is refused, not synthesised. WASI is a good fit for code that is already designed around explicit imports and sandboxed execution. It is not a replacement for native runtime ports when the language expects OS threads, signals, sockets, memory mapping, or a large POSIX surface.

The browser/WebAssembly proposal is separate. It explores running capOS concepts in a browser using worker-per-process isolation and SharedArrayBuffer-backed rings. It is a teaching and demo target, not current native userspace language support.

Proposal Map

Validation

Current language-runtime validation is Rust-only:

  • tools/check-userspace-runtime-surface.sh verifies that capos-rt owns _start, panic handling, allocator setup, raw syscalls, and entry macros.
  • make capos-rt-check, make init-capos-build, make demos-capos-build, make shell-capos-build, and make capos-rt-capos-build build the booted userspace artifacts against the capOS custom target.
  • make run-smoke, make run-spawn, make run-shell, and make run-terminal exercise the runtime surface through QEMU.

No page should claim support for Python, Go, Lua, C, C++, JavaScript, WASI, or Rust std until there is a booted artifact and a validation target for that runtime.

Trust Boundaries

This page gives reviewers one place to find the hostile-input boundaries, trusted inputs, and current isolation assumptions that matter for capOS security review.

Current Boundaries

Ring 0 to Ring 3

  • Trust rule: the kernel trusts no userspace register, pointer, SQE, CapSet, or result-buffer field.
  • Implemented: syscall arguments, user-buffer ranges, page permissions, opcodes, and capability-table lookups are validated before privileged use in kernel/src/arch/x86_64/syscall.rs, kernel/src/mem/paging.rs, kernel/src/mem/validate.rs, and kernel/src/cap/ring.rs.
  • Validation: Panic Surface Inventory and REVIEW.md at the repository root.

Capability Table to Kernel Object

  • Trust rule: a process acts only through a live table-local CapId with matching generation and interface.
  • Implemented: capos-lib/src/cap_table.rs owns generation-tagged slots; kernel capability dispatch goes through CapObject::call.
  • Validation: cargo test-lib plus QEMU ring and IPC smokes recorded in done task records.

Capability Ring Shared Memory

  • Trust rule: userspace owns SQ writes, but the kernel owns validation, dispatch, completion, and failure semantics.
  • Implemented: SQ/CQ headers and entries live in capos-config/src/ring.rs; kernel dispatch bounds indexes, opcodes, transfer descriptors, and CQ posting, and copies CALL/RECV/RETURN buffers while holding the owning process AddressSpace lock.
  • Validation: cargo test-ring-loom plus QEMU ring corruption, reserved opcode, fairness, IPC, and transfer smokes.

Endpoint IPC and Transfer

  • Trust rule: IPC cannot create or destroy authority except through explicit copy, move, release, or spawn transactions. Delegating an imported client facet must preserve service-visible identity, and shared-service handlers should derive caller identity from live caller-session metadata.
  • Implemented: kernel/src/cap/endpoint.rs, kernel/src/cap/transfer.rs, and capos-lib/src/cap_table.rs implement queued calls, RECV/RETURN, copy/move transfer, legacy receiver metadata propagation, and rollback. Legacy badge metadata is a debug/test surface only. Normal chat, stdio bridge, and shared demo handlers use live caller-session metadata instead of caller-selected badge identity.
  • Validation: Authority Accounting and any open transfer review-finding task records.

Manifest and Boot Package

  • Trust rule: boot manifest bytes and embedded binaries are untrusted until parsed and validated. Only BootPackage holders can request chunked manifest bytes; ordinary services receive no default boot-package authority.
  • Implemented: tools/mkmanifest validates the embedded initConfig graph before serialization. Kernel bootstrap validates the kernel-owned manifest boundary before loading one init process; init BootPackage validation resolves service graph references before spawning children. kernel/src/cap/boot_package.rs, capos-lib/src/elf.rs, and load paths still enforce manifest-read, ELF, load-range, CapSet, and interface bounds.
  • Validation: cargo test-config, cargo test-mkmanifest, cargo test-lib, manifest and ELF fuzz targets, and make run.

Process Spawn Inputs

  • Trust rule: parent-supplied spawn params, ELF bytes, grants, legacy badges, and result-cap insertion must fail closed. Endpoint kernel grants must not share owner caps with the parent. Delegated client facets must not be relabeled into another service identity.
  • Implemented: ProcessSpawner validates ELF load, grants, frame exhaustion, parent cap-slot exhaustion, child-local endpoint creation, and parent-only client result facets. Delegated ClientEndpoint grants preserve source identity; explicit relabel encodings fail closed except for owner or trusted parent endpoint-result caps.
  • Validation: spawn QEMU smoke evidence and review-finding task records.

Console Authentication and Setup

  • Trust rule: console input, account selectors, password verifiers, setup tokens, and passkey challenge state are hostile or sensitive until a login/session component validates them.
  • Implemented: Console remains output-only. The first interactive boundary is session-scoped TerminalSession with bounded readLine, visible/hidden echo, structured cancellation, one move-only foreground holder, caller-session-checked output, and stale-input scrubbing on cancel or owner teardown. CredentialStore verifies one manifest-supplied Argon2id operator credential and one bounded volatile RAM-overlay password created by first-boot setup. capos-shell drives login and setup; there is no separate ConsoleLogin service.
  • Validation: Boot to Shell, boot-to-shell gates in ../../docs/tasks/README.md, make run-terminal, make run-login, and make run-login-setup.
  • Open/future: durable multi-account credential storage, multiple verifier records, rotate/disable state, broader anti-enumeration audit policy, and bounded single-use setup-token/challenge state.

Session Authority and Audit

  • Trust rule: authenticated sessions receive only broker-issued narrow caps. Audit output, service logs, terminal output, and failed-auth diagnostics must not disclose secrets or verifier material.
  • Implemented: SessionManager mints entropy-backed UserSession metadata for operator, explicitly seeded guest, and anonymous profiles. Endpoint caller-session references are HMAC-SHA256 values scoped by an entropy-backed boot key and non-reused endpoint service-scope id. AuthorityBroker validates session/profile matches before minting bundles. RestrictedLauncher returns shell-scoped launch paths instead of BootPackage or broad ProcessSpawner authority.
  • Validation: User Identity and Policy, Boot to Shell, make run-login, make run-login-setup, and future auth/session hostile input tests.
  • Open/future: audit records, stable service-audit identity across endpoint replacement, opaque record IDs, mutable session liveness cells, UserSession.logout, owner-shell/gateway close propagation, narrow renewal paths, broader policy evaluation, and web-terminal origin/RP-ID validation.

SSH Remote Shell Ingress

  • Trust rule: SSH network input, keys, usernames, channel requests, PTY state, environment requests, and disconnects are hostile until the gateway validates protocol state, authenticates the user, and receives a broker-issued shell bundle.
  • Implemented: current proofs cover schema stubs, one development-only host-key fixture, manifest-seeded AuthorizedKeyStore, public-key session minting, unsupported-feature policy, restricted shell launch, and bounded terminal-host wiring over host-local plain TCP. The proposal keeps SSH transport authority in SshGateway; the spawned shell receives only an SshTerminalFactory-produced TerminalSession plus the normal scoped session bundle through RestrictedShellLauncher. Fixture host-key signing, authorized-key mapping, public-key session minting, SSH policy rejection, and restricted launcher inputs fail closed for malformed or unsupported cases.
  • Validation: SSH Shell Gateway, Runtime, Networking, and Shell, make run-ssh-host-key, make run-ssh-authorized-key, make run-ssh-public-key-session, make run-ssh-public-key-auth, make run-ssh-feature-policy, and make run-restricted-shell-launcher (the plain-TCP run-ssh-gateway-terminal-host terminal-host proof is retired with the kernel socket owner).
  • Open/future: full SSH transport transcript and channel binding, password-auth verifier/backoff wiring, production host-key storage, broader account storage, and production remote-shell hardening.

Identity Metadata and Account Records

  • Trust rule: users, principals, accounts, sessions, roles, and profile names are policy metadata, not kernel subjects, ambient authority, or substitutes for held capabilities.
  • Implemented: sessions receive capabilities only after authentication and broker policy evaluation; a principal or account record does not run or call the kernel.
  • Validation: Local Users, Storage, and Policy and User Identity and Policy.
  • Open/future: durable local account-store behavior, profile persistence, rollback checks, and quota enforcement.

Host Tools and Filesystem

  • Trust rule: manifest/config input must not escape intended source directories or invoke unconstrained host commands.
  • Implemented: tools/mkmanifest validates references and path containment, rejects unpinned CUE compilers, and Makefile targets route CUE and Cap’n Proto through pinned tool paths.
  • Validation: Trusted Build Inputs, make generated-code-check, and make dependency-policy-check.

Generated Code and Schema

  • Trust rule: schema, generated bindings, and no_std patches are trusted build inputs.
  • Implemented: schema/capos.capnp, build scripts, tools/generated/capos_capnp.rs, and tools/check-generated-capnp.sh make generated-code drift review-visible.
  • Validation: Trusted Build Inputs and make generated-code-check.

Device DMA and MMIO

  • Trust rule: current userspace receives no raw DMA buffer, device physical address, virtqueue pointer, BAR mapping, or device interrupt handle.
  • Implemented: the QEMU virtio-net path is allowed only through kernel-owned bounce buffers and kernel-owned MSI-X source records routed through the kernel device interrupt dispatch table, bounded device MSI vector pool, and kernel-owned route lifecycle checks.
  • Validation: DMA Isolation and make run-net.
  • Open/future: typed DMAPool, DeviceMmio, and Interrupt capabilities for userspace-driver transition.

Panic and Emergency Paths

  • Trust rule: hostile input should produce controlled errors, not panic, allocate unexpectedly, or expose stale state.
  • Implemented: ring dispatch is mostly controlled-error; remaining panic surfaces are classified by reachability and tracked as hardening work.
  • Validation: Panic Surface Inventory and REVIEW.md.

Security Invariants

  • All authority is represented by capability-table hold edges; no syscall or host tool path should bypass the capability graph.
  • Identity metadata is not authority. Principals identify audit and policy subjects, accounts store durable policy inputs, profiles select bundle and quota templates, and sessions receive caps only through explicit broker minting.
  • The interface is the permission: method authority is expressed by the typed Cap’n Proto interface or by a narrower wrapper capability, not by ambient process identity.
  • Kernel operations at hostile boundaries validate structure, bounds, ownership, generation, interface ID, and resource availability before mutating privileged state.
  • Failed transfer, spawn, manifest, and DMA setup paths must leave ledgers, cap tables, frame ownership, and in-flight call state unchanged or explicitly rolled back.
  • Trusted build inputs must be pinned or drift-review-visible before their output becomes part of the boot image or generated source baseline.
  • Authentication/session code must treat credential records, setup tokens, passkey challenges, session IDs, and audit logs as security boundaries, not ordinary console text.

Open Work

  • Unify fragmented resource ledgers into the authority-accounting model so reviewers can audit quotas without following parallel counters.
  • Harden open panic-surface entries that become more exposed as spawn, lifecycle, SMP, or userspace drivers expand hostile input reachability.
  • Keep DMA in kernel-owned bounce-buffer mode until the DMAPool, DeviceMmio, and Interrupt transition gates have code and QEMU proof.
  • Do not expand production authentication or remote-shell surfaces without hostile-input tests for bounded terminal input, credential failure paths, challenge expiry/replay, audit redaction, and narrow shell cap bundles.

Verification Workflow

This page maps capOS claims to the commands, QEMU smokes, fuzz targets, proof tools, and review documents that currently support them.

Local Command Set

Use the repo aliases and Makefile targets instead of bare host commands. The workspace default Cargo target is x86_64-unknown-none, so host tests rely on aliases that set the host target explicitly.

ScopeCommandWhat it checks
Formattingmake fmt-checkRust formatting across kernel, shared crates, standalone userspace crates, and demos.
Config and manifest logiccargo test-configCap’n Proto manifest encode/decode, CUE value handling, CapSet layout, and config validation.
Ring concurrency modelcargo test-ring-loomBounded SQ/CQ producer-consumer invariants and corrupted-SQ recovery behavior.
Deferred-completion concurrency modelmake model-dma-deferred-completion-loomBounded Loom over the kernel DeferredCompletionQueue reservation budget and the multi-CPU TLB shootdown generation re-read (kernel/src/arch/x86_64/tlb.rs); budget never exceeded, no completion dropped/double-popped, no retire ahead of a covering flush.
DMA authority lifecycle modelmake model-dma-tlaPinned TLC bounded check of models/dma/dma_authority.tla: allocate->map->publish->complete->revoke->scrub->reuse ordering plus generation-keyed stale completion, record-before-PTE-install split, drive-pin/quarantine, and queue-enable epoch-fence interleavings. Fails closed on any invariant violation, deadlock, or analyzer error.
DMA assurance aggregatemake dma-assurance-model-checkLocal aggregate over the DMA Alloy, TLA+, Loom, and Kani gates. Requires installed cargo-kani; GitHub CI splits the same evidence across the DMA Assurance Models and Kani Proofs jobs.
Shared library logiccargo test-libELF parser, frame bitmap, frame ledger, capability table, and property-test coverage.
Manifest toolcargo test-mkmanifestHost-side manifest conversion and validation behavior.
Userspace runtimetools/check-userspace-runtime-surface.sh; make capos-rt-check; make init-capos-build demos-capos-build shell-capos-buildRuntime primitive ownership, custom-target boot build path, entry ABI, typed clients, ring helpers, and no_std constraints.
Kernel buildcargo build --features qemuKernel build with the QEMU exit feature enabled.
Generated codemake generated-code-checkCap’n Proto compiler path/version, schema binding output equality, no_std patch anchors, Adventure and Paperclips content freshness, locked generator dependencies, and checked-in generated-output drift.
Dependency policymake dependency-policy-checkcargo-deny and cargo-audit policy across root and standalone Cargo lockfiles, plus npm lockfile validation and audit checks for the docs toolchain.
Mandatory Kani gatemake kani-libBounded capos-lib harness set for frame allocation, stale-handle rejection, frame-grant and cap-slot fail-closed accounting, and transfer-origin fail-closed behavior.
DMA-authority core Kani gatemake kani-dma-authorityBounded Kani over the extracted pure DMA-authority core (capos_lib::dma_authority): a recycled slot’s generation strictly increases and never aliases a live handle, a stale-generation completion is rejected without mutating completion/free/reuse state, and a buffer cannot be re-exposed until its in-flight completion is observed. Faithful model of the kernel/src/device_dma.rs authority arithmetic; the kernel call-through is a tracked follow-up.
Full image buildmakeKernel, userspace demos, runtime smoke binaries, manifest, Limine artifacts, and ISO packaging.
Default interactive bootmake runOperator-facing default init-owned boot path from layered system.cue: standalone init starts the foreground shell, resident demo services, and the remote-session CapSet gateway, forwards only the remote CapSet endpoint on loopback, and keeps console/debug output logged separately.
Default QEMU smokemake run-smokeScripted focused shell-led boot from system-smoke.cue: kernel boot-launches capos-shell as init, grants the shell bootstrap cap bundle, then proves anonymous-session bootstrap, login failed-auth redaction, successful password auth, broker upgrade to operator bundle, terminal isolation, and clean halt.
Focused spawn QEMU smokemake run-spawnNarrower init-owned ProcessSpawner graph: kernel boot-launches only standalone init with Console, BootPackage, and ProcessSpawner; init validates BootPackage metadata, spawns endpoint/IPC/VirtualMemory/Timer/FrameAllocator children, waits for them, exercises hostile spawn checks, and halts cleanly.
Shell, terminal, and local-auth smokesmake run-shell; make run-terminal; make run-credential; make run-login; make run-login-setupAnonymous shell behavior, TerminalSession line input and cancellation, CredentialStore verifier behavior, username-aware password login, broker-issued operator bundle upgrade, volatile first-boot setup credential creation, terminal isolation, and stale-handle release.
Focused service smokesmake run-chat; make run-adventure; make run-paperclips; make run-revocable-read; make run-memoryobject-shared; make run-ringtap-failing-callResident-service demos, the clean-room Paperclips terminal demo, revocation behavior, MemoryObject sharing, and debug-tap viewer behavior.
Networking smokemake run-net; make qemu-net-harnessQEMU virtio-net attachment, kernel PCI/device-discovery path, descriptor-accounting guard evidence, ARP, and ICMP. TCP/UDP socket proof lives under the Phase C userspace network-stack gates.
SSH gateway proof smokesmake run-ssh-host-key; make run-ssh-authorized-key; make run-ssh-public-key-session; make run-ssh-public-key-auth; make run-ssh-feature-policy; make run-restricted-shell-launcherDevelopment host-key fixture validation, authorized-key mapping, public-key session minting, public-key authentication failure privacy, unsupported SSH feature policy, and restricted shell launch authority. The bounded host-local socket-to-TerminalSession wiring proof is retired with the kernel socket owner.

Do not claim full verification unless the relevant command actually ran in the current change. For doc-only changes, use an appropriately narrower check such as mdbook build.

Review Workflow

  1. Identify the changed trust boundary or state that the change is docs-only.
  2. Read REVIEW.md (at the repository root) for the applicable security, unsafe, memory, performance, capability, and emergency-path checklist.
  3. Read the relevant review-finding task records under docs/tasks/ before judging correctness so known open findings are not treated as solved behavior.
  4. For system-design work, list the concrete design and research files read; reviewers should reject vague grounding such as “docs” or “research”.
  5. Run the smallest command set that exercises the changed behavior, then add QEMU proof for user-visible kernel or runtime behavior.
  6. Record unresolved non-critical findings as task records under docs/tasks/ or docs/tasks/on-hold/ with concrete remediation context before treating the task as reviewed.

Evidence by Claim

Claim typeRequired evidence
Parser or manifest validationHost tests for valid and malformed input; fuzz target when arbitrary bytes can reach the parser.
Kernel/user pointer safetyQEMU hostile-pointer smoke plus code review of address, length, permissions, and validation-to-use windows.
Ring or IPC transport behaviorHost model/property tests where possible, plus QEMU process output proving success and failure paths.
Userspace runtime primitive ownershiptools/check-userspace-runtime-surface.sh plus review of capos-rt/src/entry.rs, alloc.rs, panic.rs, and syscall.rs.
Capability transfer or releaseRollback tests for copy/move/release failure, cap-slot exhaustion, stale caps, and process-exit cleanup. A release-only proof shows local cleanup only; any claim that peers, children, sessions, or delegated holders lose authority needs a separate explicit revoke, session-expiry, object-epoch, or service-specific invalidation proof.
Resource accountingTests that prove quota rejection, matched release on success and failure, and process-exit cleanup.
Generated code, schema, or generated content changesmake generated-code-check and a checked-in baseline diff generated by the pinned compiler or pinned CUE/generator path.
Dependency or toolchain changesDependency-class review plus make dependency-policy-check; update Trusted Build Inputs when trust assumptions change.
Device or DMA workmake run-net or a targeted QEMU smoke; no userspace-driver transition without the gates in DMA Isolation.
Panic-surface hardeningUpdated Panic Surface Inventory when reachability or classification changes.
Authentication and session workHost tests for TerminalSession line-input bounds, secret-mode echo suppression, cancellation behavior, exclusive terminal handoff, non-inheritance without an explicit grant, verifier encoding, entropy-unavailable fail-closed behavior, bootstrap-plus-RAM-overlay credential handling, volatile credential/disable-state disclosure, bounded single-use setup-token/challenge first-consume/expiry/replay semantics, generic failure/backoff policy, and audit redaction with opaque record IDs plus pre-auth serial-safe failure events; QEMU proof for setup/login, failed auth, successful capos-shell launch through TerminalSession/CredentialStore/SessionManager/AuthorityBroker, lack of terminal access for an ungranted child, absence of broad BootPackage/raw ProcessSpawner caps in the shell, and fail-closed behavior when the secure-randomness path is unavailable.

Fuzzing and Proof Tracks

The current fuzz corpus lives under fuzz/ and covers manifest Cap’n Proto input, exported JSON conversion for mkmanifest, arbitrary ELF parser input, Telnet IAC filtering, terminal line discipline, and ring SQE wire validation. Run fuzzers when a change alters those parsers, schema shape, terminal/network byte-stream handling, SQE validation, or related validation rules.

Kani coverage is intentionally narrow and lives in capos-lib, where pure logic can be bounded without hardware state. Add or refresh Kani harnesses for ledger, cap-table, bitmap, and parser invariants when those invariants become part of a security claim. The required local/CI gate is make kani-lib. The extracted DMA-authority core (capos_lib::dma_authority) has its own bounded gate, make kani-dma-authority, which proves ownership-generation bump on recycle, stale-handle rejection without mutation, and no-re-expose before completion — a faithful model of the kernel/src/device_dma.rs arithmetic whose kernel call-through is a tracked follow-up.

Loom coverage belongs in shared ring logic. Extend cargo test-ring-loom when SQ/CQ ownership, ordering, corruption recovery, or wake semantics change.

DMA assurance model files live under models/dma/ and are bounded checked evidence for device and cloud-backend claims. The Alloy relational authority graph (models/dma/dma_authority.als) is now an analyzer-checked gate: make model-dma-alloy runs the pinned Alloy Analyzer 6.2.0 headless at scope for 4 and fails on any counterexample (free-page reachability, same-domain IOVA uniqueness, and the ownership-generation stale-handle gate), with the checked verdict table recorded in models/dma/README.md. The focused Loom gate for the DeferredCompletionQueue reservation budget and the multi-CPU generation re-read is also checked (make model-dma-deferred-completion-loom, pinned loom 0.7.2). The TLA+ lifecycle model (models/dma/dma_authority.tla) is now a model-checked gate as well: make model-dma-tla runs the pinned TLC 2.19 (tla2tools 1.7.4) over the bounded configuration (2 devices / 2 domains / 2 pages / 2 iovas, generations 0..1) and fails closed on any invariant violation, deadlock, or analyzer error, with the checked result recorded in models/dma/README.md. It covers the lifecycle ordering plus the landed generation-keyed stale completion, record-before-PTE-install split, drive-pin/ quarantine, and queue-enable epoch-fence interleavings. The extracted pure DMA-authority core is checked by Kani as well (make kani-dma-authority, pinned kani-verifier 0.67.0): ownership-generation bump on recycle, stale-handle rejection without mutation, and no-re-expose before completion over capos_lib::dma_authority.

The DMA checked gates are wired into CI. The GitHub dma-assurance-models job runs make model-dma-alloy, make model-dma-tla, and make model-dma-deferred-completion-loom; the kani-proofs job runs make kani-dma-authority after the mandatory make kani-lib gate. The local make dma-assurance-model-check aggregate runs all four when cargo-kani is installed. Do not claim Verus evidence – or any Alloy/TLC/Loom/Kani DMA-authority result beyond what these targets actually check – unless the exact command, checker version, configuration, model bounds, and output are recorded in the task closeout.

For DMA work, map claims through DMA Assurance Model: TLA+ for lifecycle ordering and races, Alloy for authority topology, Kani for pure Rust validators/accounting, and Loom for atomic or queue interleavings such as DeferredCompletionQueue. The model supplements the required QEMU or cloud evidence; it does not replace hardware-facing smokes.

Documentation Sources

  • REVIEW.md (at the repository root): rules for security, unsafe code, capability invariants, resource accounting, and emergency paths.
  • docs/tasks/: open remediation backlog, review-finding task records, and latest verification task records.
  • Trusted Build Inputs: trusted compiler, generated-code, dependency, bootloader, manifest, and host-tool inputs.
  • Panic Surface Inventory: classified panic-like surfaces and commands used to generate the inventory.
  • Authority Accounting: authority graph, quota, transfer, rollback, and ProcessSpawner accounting invariants.

Security Verification Track Registry

The S.x labels used across this manual are registry identifiers for the Security Verification Track. They are not product stages. When a section mentions one of these labels, read it as shorthand for the track name below.

  • S.1 — CI bootstrap. Status: Landed.
  • S.2 — Miri and proptest on capos-lib. Status: Landed.
  • S.3 — Manifest and mkmanifest fuzzing. Status: Landed.
  • S.4 — Ring Loom harness. Status: Landed.
  • S.5 — Kani on capos-lib. Status: Initial bounded gate landed.
  • S.6 — Security review docs stay aligned. Status: Ongoing.
  • S.7 — Stage-6-aware security refresh. Status: Planned/ongoing.
  • S.8 — Untrusted-service hardening gate. Status: Planned.
  • S.9 — Authority graph and resource accounting. Status: Design landed.
  • S.10 — Supply-chain and generated-code trusted computing base. Status: Partially landed.
  • S.11 — Device and DMA isolation gate. Status: Design accepted; implementation gates open.
  • S.12 — Kani harness bounds refresh. Status: Planned.
  • S.13 — ELF parser arbitrary-input coverage. Status: Landed.
  • S.14 — Telnet IAC filter fuzz coverage. Status: Landed.
  • S.15 — Telnet differential round-trip and line-discipline extraction. Status: Landed.
  • S.16 — Ring SQE wire-validation extraction and fuzz target. Status: Landed.
  • S.17 — Sanitizers on host tests. Status: Planned.

Subtracks Used In This Manual

  • S.10.0 under S.10 — Trusted build input inventory.
  • S.10.2 under S.10 — Generated-code drift check.
  • S.10.3 under S.10 — Dependency policy and no_std review gate.
  • S.11.1 under S.11 — DMA capability invariants.
  • S.11.2 under S.11 — Userspace-driver ownership-transition gate.

The S.11.2.0 through S.11.2.9 labels in the DMA chapter are local checklist rows for the userspace-driver transition gate. They are acceptance criteria under S.11.2, not separate project tracks.

Trusted Build Inputs

This inventory covers the build inputs currently trusted by the capOS boot image, generated bindings, host tooling, and verification paths. It started as the Security Verification Track S.10.0 inventory, records the Security Verification Track S.10.2 generated-code drift check, and now also records the Security Verification Track S.10.3 dependency policy plus the shared no_std generated-code patch helper. The consolidated long-horizon supply-chain risk view – floating Rust nightly, repo-pinned qemu-system-x86_64 / xorriso digests (CI now apt-installs qemu-system-x86=1:8.2.2+ds-0ubuntu1.16, xorriso=1:1.5.6-1.1ubuntu3, and ovmf=2024.02-2ubuntu0.8 so package identity is captured; the OVMF firmware blob is now repo-pinned by SHA-256 (OVMF_CODE_SHA256, landed at commit f1c8c8fb, merged at ca5a1fea) and the ovmf-verify Makefile gate fails the build on drift, but download-and-verify of the qemu-system-x86_64 / xorriso tool blobs remains a future step), PR-blocking CI environment provenance comparison, and the remaining immutable-runner-image / repo-managed tool-digest gap – is tracked as R13 in docs/design-risks-register.md; the gap text below stays consistent with that entry.

Summary

InputCurrent sourcePinning statusDrift-review status
Limine bootloader binariesMakefile:5-10, Makefile:34-49Git commit and selected binary SHA-256 values are pinned.make limine-verify fails if the checked-out commit or copied bootloader artifacts drift.
Rust toolchainrust-toolchain.toml:1-4, .github/workflows/ci.ymlDate-pinned nightly-2026-04-20 channel with target triples and the rust-src component required by custom-target -Zbuild-std userspace builds. The CI host-baseline, dma-assurance-models, and qemu-smoke jobs explicitly request the same dated nightly. The Kani job remains pinned separately to nightly-2025-11-21 paired with the Kani-compatible bundle installed by cargo kani setup.The dated channel resolves to rustc 1.97.0-nightly (e22c616e4 2026-04-19) (the 2026-04-20 manifest carries the previous day’s rustc commit). Bumps are review-visible as rust-toolchain.toml and workflow diffs; the advance procedure is recorded in the Rust Toolchain section below.
Workspace cargo dependenciesCargo.toml, crate Cargo.toml files, Cargo.lockLockfile pins exact crate versions and checksums for the root workspace. Manifest requirements remain semver ranges.make dependency-policy-check runs cargo deny check plus cargo audit against the root workspace and lockfile in CI.
Standalone cargo dependencies (covered by make dependency-policy-check)init/Cargo.lock, demos/Cargo.lock, demos/wasi-hello-rust/Cargo.lock, demos/wasi-cli-args/Cargo.lock, demos/wasi-env/Cargo.lock, demos/wasi-fs/Cargo.lock, demos/wasi-random/Cargo.lock, demos/wasi-preview1-refusals/Cargo.lock, demos/wasi-stdio-fd/Cargo.lock, tools/adventure-content-gen/Cargo.lock, tools/paperclips-content-gen/Cargo.lock, tools/mkmanifest/Cargo.lock, tools/remote-session-client/Cargo.lock, tools/ringtap-viewer/Cargo.lock, capos-rt/Cargo.lock, capos-service/Cargo.lock, shell/Cargo.lock, libcapos/Cargo.lock, libcapos-posix/Cargo.lock, capos-wasm/Cargo.lock, fuzz/Cargo.lockEach standalone workspace has its own lockfile. The Makefile DEPENDENCY_POLICY_MANIFESTS / DEPENDENCY_POLICY_LOCKFILES lists drive the gate.make dependency-policy-check runs the shared deny/audit baseline against every standalone manifest and lockfile listed above (root workspace Cargo.lock plus the 21 standalone lockfiles in this row). Cross-workspace version drift remains review-visible and intentional where lockfiles differ.
Standalone cargo dependencies (not yet under policy gates)tools/remote-session-client/src-tauri/Cargo.lock, vendor/wasmi-no_std/wasmi-1.0.9/Cargo.lockTwo checked-in lockfiles fall outside DEPENDENCY_POLICY_LOCKFILES. tools/remote-session-client/src-tauri/Cargo.lock is the Tauri scaffold lockfile; make remote-session-tauri only exposes deterministic policy and check modes and reviewed dev mode – distributable package and desktop automation modes are blocked. vendor/wasmi-no_std/wasmi-1.0.9/Cargo.lock is part of the vendored upstream snapshot covered separately by the wasmi =1.0.9 path-dependency pin in capos-wasm/Cargo.toml.Both lockfiles are review-visible through ordinary diffs but are not run through cargo deny check / cargo audit today. Promoting either into DEPENDENCY_POLICY_LOCKFILES is gated on the matching authority decision (Tauri scaffold scope decision; wasmi refresh procedure in vendor/wasmi-no_std/VENDORED_FROM.md).
Cap’n Proto compilerMakefile:12-80, tools/capnp-build/src/lib.rs, capos-config/build.rs, tools/check-generated-capnp.sh, tools/mkmanifest/src/lib.rs, tools/mkmanifest/src/main.rsOfficial capnproto-c++-1.2.0.tar.gz source tarball URL, version, and SHA-256 are pinned in Makefile; make capnp-ensure builds $(CAPOS_TOOLS_ROOT)/capnp/1.2.0/bin/capnp under the per-user tool cache so linked worktrees reuse it. The build rule patches the distributed CLI version placeholder to the pinned version before compiling.The shared build helper defaults to the pinned path and rejects CAPOS_CAPNP when it points elsewhere. Make targets export the pinned path and CI persists it through $GITHUB_ENV. make generated-code-check verifies both the exact compiler path and Cap'n Proto version 1.2.0 before regenerating bindings through Cargo. mkmanifest cue-to-capnp also rejects missing or non-canonical CAPOS_CAPNP, checks Cap'n Proto version 1.2.0, and delegates schema-aware JSON-to-binary conversion to that pinned compiler.
Cap’n Proto Rust runtime/codegen cratescapos-config/Cargo.toml, kernel/Cargo.toml, tools/capnp-build/Cargo.toml, Cargo.lockCargo manifests use exact capnp = "=0.25.4" and capnpc = "=0.25.3" requirements where declared; lockfiles pin exact crate versions and checksums.Security Verification Track S.10.3 now requires dependency-class and no_std review before these changes are accepted.
Kani verifier toolchain.github/workflows/ci.yml, Makefile, tools/run-kani-proofs.sh, tools/cloudbuild-kani.yaml, .gcloudignoreGitHub CI pins kani-verifier 0.67.0; cargo kani setup installs the matching Kani bundle plus nightly-2025-11-21-x86_64-unknown-linux-gnu into the user-local Kani/rustup paths. Local make kani-lib and make kani-dma-authority expect a compatible cargo-kani install. The high-memory make kani-lib-full path uses Google Cloud Build image digest rust@sha256:adab7941580c74513aa3347f2d2a1f975498280743d29ec62978ba12e3540d3a on E2_HIGHCPU_32, installs rustup from https://sh.rustup.rs, sources /usr/local/cargo/env, initializes minimal git metadata for build tooling that expects a repository, then pins nightly-2025-11-21 plus cargo-kani 0.67.0.The CI kani-proofs job installs kani-verifier 0.67.0, runs cargo kani setup, and executes the bounded make kani-lib harness list plus the DMA-authority make kani-dma-authority harness group. The Cloud Build config installs the same Kani version and runs make kani-lib-full; it depends on explicit source staging and logs in maintainer-private GCS buckets configured in tools/cloudbuild-kani.yaml, .gcloudignore secret exclusions, and account/project IAM for Cloud Build submission and the selected runtime service account. Version, image, worker, bucket, IAM, rustup bootstrap, synthetic git metadata, or setup-path changes are review-visible in the workflow, Cloud Build config, runner script, and this inventory.
Alloy Analyzer (DMA assurance model checker)Makefile ALLOY_VERSION/ALLOY_TARBALL_URL/ALLOY_TARBALL_SHA256, tools/run-dma-alloy-model.sh, models/dma/dma_authority.alsSelf-contained linux/amd64 Alloy Analyzer 6.2.0 app image (bundled Temurin JRE + native SAT solvers) pinned by SHA-256; make alloy-ensure downloads and verifies it into $(CAPOS_TOOLS_ROOT)/alloy/6.2.0/ (the jar is not vendored). This slice owns the Alloy pin shared with the scheduler lease model track.make model-dma-alloy verifies the tarball SHA-256, checks the launcher reports version 6.2.0, and runs the relational authority-graph checks/witnesses headless at scope for 4, failing on any counterexample or analyzer error. GitHub CI runs it in the dma-assurance-models job.
TLC model checker (DMA assurance lifecycle model)Makefile TLA_TOOLS_VERSION/TLA_TOOLS_JAR_URL/TLA_TOOLS_JAR_SHA256/TLA_JRE_URL/TLA_JRE_SHA256, tools/run-dma-tla-model.sh, models/dma/dma_authority.tlatla2tools.jar 1.7.4 (TLC 2.19) pinned by SHA-256 plus a SHA-256-pinned Temurin JRE 17.0.19+10 (the bare jar needs a JVM, unlike the self-contained Alloy app image); make tla-ensure downloads and verifies both into $(CAPOS_TOOLS_ROOT)/tla/ (neither is vendored). This slice owns the TLC pin shared by the scheduler/IRQ TLA+ model tracks.make model-dma-tla re-verifies the jar SHA-256 and the pinned JRE version, then runs TLC over the bounded .cfg (2 devices / 2 domains / 2 pages / 2 iovas, generations 0..1), failing closed on any invariant violation, deadlock, or analyzer error (exit code and the “No error” marker are both asserted). GitHub CI runs it in the dma-assurance-models job.
Generated capnp bindingscapos-config/src/lib.rs:10-12, tools/generated/capos_capnp.rs, tools/check-generated-capnp.shGenerated into Cargo OUT_DIR; the expected patched output is checked in under tools/generated/.make generated-code-check regenerates the canonical capos-config output and fails if that output differs from the checked-in baseline or if kernel-generated output reappears.
no_std patching of generated bindingstools/capnp-build/src/lib.rs, capos-config/build.rs, tools/check-generated-capnp.shOne shared build-support crate asserts the patch anchor and injects the no_std imports after generation. capos-config/build.rs calls that helper as the single schema binding owner.make generated-code-check verifies the patched output contains the expected no_std imports and matches the checked-in baseline.
Generated adventure contentdemos/adventure-content/content/prototype.cue, tools/adventure-content-gen/, demos/adventure-content/src/generated.rs, tools/check-generated-adventure-content.shPrototype mission content is authored in checked-in CUE and generated by a standalone locked Cargo host tool into a checked-in no_std Rust content blob. The checker requires the pinned CUE path under $(CAPOS_TOOLS_ROOT) and cue version v0.16.0.make generated-code-check runs generated-adventure-content-check, which exports the CUE source as JSON, runs tools/adventure-content-gen with cargo run --locked, formats the generated output, and fails on drift from the checked-in baseline.
Generated Paperclips contentdemos/paperclips-content/content/paperclips.cue, schema/paperclips-content.capnp, tools/paperclips-content-gen/, demos/paperclips-content/src/generated.rs, tools/check-generated-paperclips-content.shPaperclips game content is authored in checked-in CUE, schema-validated through the typed PaperclipsContent Cap’n Proto root, and generated by a standalone locked Cargo host tool into checked-in typed Cap’n Proto bytes embedded by a no_std Rust wrapper. The checker requires the pinned CUE path under $(CAPOS_TOOLS_ROOT), cue version v0.16.0, and the pinned Cap’n Proto compiler path/version used for schema-aware conversion.make generated-code-check runs generated-paperclips-content-check, which exports the CUE source as JSON, converts it through mkmanifest cue-to-capnp against schema/paperclips-content.capnp, runs tools/paperclips-content-gen with cargo run --locked, formats the generated output, and fails on drift from the checked-in generated content.
Userspace custom targettargets/x86_64-unknown-capos.json, .cargo/config.toml, Makefile, system*.cueSource-controlled target specification plus Cargo aliases, Makefile build wrappers, and manifest paths for booted init, demos, shell, and capos-rt runtime builds. The target JSON uses Rust nightly custom-target support and builds core,alloc from rust-src.make init-capos-build demos-capos-build shell-capos-build capos-rt-capos-build verifies the userspace crates against target_os = "capos"; QEMU smokes embed target/x86_64-unknown-capos/release userspace artifacts.
Userspace runtime surface checktools/check-userspace-runtime-surface.shSource-controlled script that treats capos-rt as the only owner of _start, panic, allocator, raw syscall, and entry-point macro definitions.Run directly when runtime or userspace entry code changes; it is not a QEMU transcript assertion and does not live inline in Makefile.
Linker script build scriptskernel/build.rs, init/build.rs, demos/*/build.rs, capos-rt/build.rs, capos-wasm/build.rsSource-controlled scripts and linker scripts. capos-rt/build.rs emits the runtime linker script for both the legacy target_os = "none" userspace build path and the booted custom target_os = "capos" path. capos-wasm/build.rs mirrors the same pattern for the wasm-host bin (Phase W.2 onward) and uses cargo:rustc-link-arg-bins so the linker script applies only to the bin and not the lib.Build rerun boundaries are explicit; generated link args are not independently audited.
CUE manifest compilerMakefile CUE_TARBALL_URL/CUE_TARBALL_SHA256, tools/mkmanifest/src/main.rs, tools/mkmanifest/src/lib.rs, .github/workflows/ci.ymlmake cue-ensure downloads the official cue_v0.16.0_linux_amd64.tar.gz release binary, verifies its SHA-256, extracts cue into $(CAPOS_TOOLS_ROOT)/cue/0.16.0/bin/cue, and checks the reported version – the same download-and-verify pattern used for Typst and uv. CAPOS_TOOLS_ROOT defaults to $HOME/.capos-tools (per-user shared cache); operators may override it explicitly. This replaces the prior go install cuelang.org/go/cmd/cue, which compiled from source under a floating Go toolchain rather than verifying a pinned binary by hash.Make exports CAPOS_CUE and CAPOS_TOOLS_ROOT to tools/mkmanifest, and CI records that exact path through $GITHUB_ENV before both the host-baseline cargo test-mkmanifest gate and QEMU smoke. mkmanifest::expected_cue_path derives the same per-user path, rejects missing or non-canonical CAPOS_CUE, and checks cue version v0.16.0 before export. The same path and version checks now gate both boot-manifest compilation and mkmanifest cue-to-capnp data-message conversion.
Default boot manifest defaults packagecue/defaults/defaults.cue, cue.mod/module.cue, system.cue, tools/mkmanifest/src/lib.rscue/defaults/defaults.cue declares package defaults and exports #DefaultSystem, the shared scaffold for the default boot manifest. cue.mod/module.cue pins module: "capos.local" with language v0.16.0. system.cue imports the defaults via capos.local/cue/defaults, declares package capos, and mkmanifest --package capos system.cue manifest.bin exports the unified package.make invokes mkmanifest with --package capos only when MANIFEST_SOURCE is system.cue; focused-proof system-*.cue manifests stay in single-file mode. The defaults package is a manifest-rule prerequisite, so edits trigger rebuilds.
Operator overlay surfacesystem.local.cue.example, system.local.cue (gitignored), .gitignoreThe repo-root overlay file is system.local.cue (package capos); system.local.cue.example is the committed worked-example template. CUE’s package mode unifies it with system.cue automatically.Operators copy the example, edit, and rebuild — system.local.cue is a wildcard-resolved manifest-rule prerequisite. The overlay is gitignored explicitly to avoid accidental commits of host-specific keys or principals.
Host-user manifest tagMakefile, system.cue _user @tag(user) / _displayName @tag(displayName), tools/mkmanifest/src/lib.rs cue_export_args / cue_tags_from_env_values, target/.cue-tags.<manifest>make run sets CAPOS_CUE_USER=$(USER). mkmanifest reads that structured account variable, derives displayName from the same account’s first GECOS/comment field in /etc/passwd when CAPOS_CUE_DISPLAY_NAME is unset, and falls back to the account name when the passwd comment is unavailable. It also reads generic CAPOS_CUE_TAGS (and --tag key=value CLI repeats) and forwards each entry to cue export --inject; structured CAPOS_CUE_USER / CAPOS_CUE_DISPLAY_NAME override duplicate generic keys. The target/.cue-tags.<manifest-bin> sentinel records the active tag state via a FORCE-prereq rule that touches the file only when content differs, so a tag change invalidates the cached manifest.bin; the recipe reads exported environment values at shell runtime rather than splicing tag text into shell syntax.The injected user value reaches the manifest via system.cue’s _user: string | *"operator" @tag(user) and surfaces as the default local operator seed account name; displayName reaches the seed account display name. Untagged system.cue keeps the operator account-name/display-name defaults, while focused demo and smoke manifests pin their own demo fixtures.
mdBook documentation toolsMakefile, book.tomlGitHub release assets for mdBook v0.5.0 and mdbook-mermaid v0.17.0 are pinned by version and SHA-256 under $(CAPOS_TOOLS_ROOT), which defaults to $HOME/.capos-tools. mdbook-mermaid supplies the pinned mermaid.min.js browser bundle used by both mdBook HTML rendering and docs-PDF Mermaid rasterization.make docs and make cloudflare-pages-build verify the tarball checksums and executable versions, refresh the Mermaid assets, and build target/docs-site.
Typst typesetter (paper and docs PDF builds)Makefile TYPST_VERSION, papers/schema-as-abi/main.typ, docs/manual.typGitHub release asset for Typst v0.14.2 is pinned by version and SHA-256 under $(CAPOS_TOOLS_ROOT)/typst/0.14.2, mirroring the mdBook pinning pattern. typst-ensure verifies the tarball checksum and the binary’s reported version before paper and docs-PDF targets invoke it. Bundled New Computer Modern font keeps builds reproducible across hosts.make paper rebuilds target/papers/schema-as-abi/main.pdf using the pinned Typst binary; make cloudflare-pages-build additionally publishes the PDF as target/docs-site/papers/schema-as-abi.pdf. make docs also uses Typst to compile the generated system manual PDF from docs/manual.typ plus per-page converted Markdown body content. Generated PDFs are not checked in; source main.typ, references.bib, docs/manual.typ, and documentation inputs are checked in.
Documentation PDF converterMakefile UV_VERSION / MD2TYPST_VERSION, .node-version, package.json, package-lock.json, tools/md2typst-constraints.txt, tools/docs-bundle.js, tools/build-typst-manual.js, tools/mermaid-puppeteer-config.json, docs/manual.typ, docs/manual-overrides/*.typGitHub release asset for uv 0.11.8 (uv-x86_64-unknown-linux-gnu.tar.gz) is pinned by version and SHA-256 under $(CAPOS_TOOLS_ROOT)/uv/0.11.8. uv-ensure verifies the tarball checksum and the binary’s reported version before PDF generation. uv tool run --constraints tools/md2typst-constraints.txt --from md2typst==0.3.3 md2typst pins the Markdown-to-Typst converter and its Python dependency set. Node version 22.16.0 is declared by .node-version and package.json; the current Makefile invokes node and npm from PATH, so host Node selection remains an operator/CI environment responsibility. package-lock.json pins @mermaid-js/mermaid-cli and its Puppeteer dependency tree; make mermaid-cli-ensure runs npm ci --ignore-scripts with PUPPETEER_SKIP_DOWNLOAD=1, so Puppeteer’s install script cannot fetch a browser during dependency installation. Mermaid rasterization uses the explicit MERMAID_BROWSER_BIN Chromium/Chrome executable, passes it to Puppeteer as PUPPETEER_EXECUTABLE_PATH, and renders PDF diagrams at MERMAID_PDF_SCALE=3 by default; tools/mermaid-puppeteer-config.json disables the browser sandbox for local and gVisor build containers. tools/docs-bundle.js reads the explicit manual page list from docs/manual.typ, generates target/docs-bundle/manual.md plus one Markdown file per manual page, and docs/manual.typ owns the PDF title page, contents, page order, page styling, and override placeholders.make docs-pdf converts each generated manual Markdown page to Typst with md2typst, normalizes anchors and links with tools/build-typst-manual.js, uses any matching checked-in docs/manual-overrides/<page-id>.typ instead of the generated page, rasterizes Mermaid diagrams through the explicit browser executable, and compiles target/docs-bundle/manual.pdf with pinned Typst. make docs copies that generated PDF to target/docs-site/manual.pdf for Cloudflare Pages publication. Generated Markdown, Typst body pages, and PDF files are ignored build artifacts, not tracked source.
QEMU and firmwareMakefile:85-96, tools/build-provenance.sh, .github/workflows/ci.yml qemu-smokeThe qemu-smoke CI job installs qemu-system-x86=1:8.2.2+ds-0ubuntu1.16 (amd64, noble-updates/main or noble-security/main, Ubuntu 24.04) and ovmf=2024.02-2ubuntu0.8 (amd64, noble-updates/main, Ubuntu 24.04). OVMF delivers /usr/share/ovmf/OVMF.fd – the first entry in the Makefile’s OVMF_CODE_CANDIDATES list, so the wildcard discovery resolves to that path on the pinned runner. The Makefile now also pins the selected OVMF firmware blob by SHA-256 (OVMF_CODE_SHA256) and gates the ISO and cloud-disk rules on ovmf-verify, which fails on hash drift and emits a NOTICE skip when no OVMF candidate is installed. Local boot verification still uses the host-installed qemu-system-x86_64.make build-provenance records the current QEMU version, selected executable path, package identity when discoverable, OVMF selected path or explicit absence, OVMF package identity when discoverable, and OVMF firmware hash when the configured firmware path exists. QEMU and OVMF are identified on the CI runner by package name, exact version, architecture, normalized apt source pocket, and selected path; the QEMU binary identity is captured via dpkg-query/apt-cache policy by make build-provenance per run and the OVMF firmware-blob SHA-256 is captured the same way. make ovmf-verify fails the build when the on-host OVMF firmware blob does not match the pinned OVMF_CODE_SHA256.
ISO and host filesystem toolsMakefile:317-341, tools/build-provenance.sh, .github/workflows/ci.yml qemu-smokeThe qemu-smoke CI job installs xorriso=1:1.5.6-1.1ubuntu3 (amd64, noble/main, Ubuntu 24.04), make=4.3-4.1build2 (amd64, noble/main, Ubuntu 24.04), and git=1:2.43.0-1ubuntu7.3 (amd64, noble-updates/main or noble-security/main, Ubuntu 24.04). Local builds still use host-installed xorriso, sha256sum, git, make, and shell utilities.make build-provenance records selected executable paths and package identities when discoverable for xorriso, sha256sum, make, git, and related local build tools, plus final ISO hashes. xorriso, make, and git are identified on the CI runner by package name, exact version, architecture, normalized apt source pocket, and selected path; the per-run identity is captured via dpkg-query/apt-cache policy by make build-provenance. The remaining host tools, including sha256sum, shell, build-essential, and curl, remain host-provided or package-observed rather than repo-digest-pinned.
Boot manifest and embedded binariessystem.cue:1-144, tools/mkmanifest/src/lib.rs:339-379, tools/mkmanifest/src/main.rs, tools/build-provenance.sh, tools/compare-build-provenance.py, Makefile:168-169, Makefile:332-341Source manifest is checked in; embedded ELF payloads are build artifacts or inline manifest bytes.Manifest validation checks references and path containment. make build-provenance now writes a local provenance record with runner OS/kernel/architecture identity, Rust toolchain details, selected host-tool paths and package identities when discoverable, hashes for the selected manifest, ISO, kernel, OVMF firmware when present, and every embedded binary reported by mkmanifest --print-binaries, including file-backed and inline payloads. make build-provenance-compare compares two retained records for material drift while ignoring generated timestamp and allowed local target/ or .capos-tools/ path-root movement.
Vendored upstream snapshotsvendor/wasmi-no_std/, vendor/dns-c-wahern/, vendor/fatfs-no_std/, vendor/rustls-webpki/, vendor/webpki-roots/, vendor/embedded-tls/, each with a VENDORED_FROM.mdEach vendored tree is a static, pinned snapshot recorded by version/tag, commit SHA when available, commit date when available, vendoring date, and license. vendor/wasmi-no_std/wasmi-1.0.9/ pins wasmi v1.0.9 (commit 61ba65e6563d8b2f5b699b018349d3330b28b9f3, Apache-2.0 OR MIT) consumed by capos-wasm/; vendor/dns-c-wahern/src/ pins William Ahern’s dns.c rel-20160808 (commit 4ec718a77633c5a02fb77883387d1e7604750251, MIT). vendor/rustls-webpki/rustls-webpki-0.103.13/ pins rustls-webpki 0.103.13 (artifact SHA-256 61c429a8…f756e, commit 2879b2ce…728e86, ISC) and vendor/webpki-roots/webpki-roots-1.0.7/ pins webpki-roots 1.0.7 (artifact SHA-256 52f5ee44…2eb9d, commit be948464…221688, CDLA-Permissive-2.0); both are the certificates/TLS Phase-1 verifier deps consumed by the capos-tls/ Phase-1 verifier crate. vendor/embedded-tls/embedded-tls-0.19.0/ pins the embedded-tls 0.19.0 crates.io package (embedded VCS commit 865e1fd983c583228e3bbeb9f4996f1abc454ca3, Apache-2.0) consumed only by the local TLS client handshake smoke. No source patches (one integration-only empty-[workspace] marker per crate); each path dep carries an exact version = "=X.Y.Z" pin so cargo-deny’s wildcards gate stays happy.The wasmi snapshot is exercised by make capos-wasm-build and the WASI smokes plus make dependency-policy-check (cargo-deny + cargo-audit against capos-wasm/Cargo.lock). The rustls-webpki / webpki-roots snapshots are exercised by capos-tls/ under cargo build / cargo build --features qemu (bare-metal x86_64-unknown-none) and by make dependency-policy-check against the root Cargo.lock. The embedded-tls snapshot is exercised by make run-cloud-tls-client-handshake, the focused capOS demo build, and make dependency-policy-check against demos/Cargo.lock. The dns.c snapshot is not yet on the v0 build path; demos/posix-dns-resolver/ compiles only main.c with a commented-out dns.h include. Refreshes follow the procedure recorded in each VENDORED_FROM.md. No vendor/dash/ source-build is present.
Build downloadsMakefile, Cargo lockfiles, rust-toolchain.tomlLimine, CUE, and documentation tool tarballs are explicitly fetched and SHA-256-verified; Cargo and rustup downloads are implicit when caches/toolchains are absent. The build no longer uses a Go toolchain: CUE is now a hash-verified release binary rather than a go install compile, so the actions/setup-go CI step and the floating go-version pin were removed.Limine artifacts, the CUE release binary, and documentation tool tarballs are verified by SHA-256. Cargo downloads rely on upstream tooling and lockfiles, with no separate repo policy beyond the lockfile checksums. Rustup downloads are now gated by the dated nightly-2026-04-20 channel pin (see Rust Toolchain section); only the dist tarballs themselves are not yet mirrored.
GitHub Actions identities and runner OS.github/workflows/ci.ymlEvery third-party Action is pinned by 40-character commit SHA with a trailing # v<X.Y.Z> comment marker. The runner OS is pinned to ubuntu-24.04 rather than the floating ubuntu-latest label.Pin bumps are review-visible as workflow diffs and the trailing version comment makes the intended release auditable. See the GitHub Actions Runner and Workflow Pinning section below for the current pin table and the bump procedure.

Security Verification Track S.10.3 Dependency Policy

Dependency changes are accepted only if they satisfy this policy and are recorded in the owning task checklist.

Dependency classes

Use these classes when reviewing a dependency change:

  • Kernel-critical no_std: crates used directly by kernel, capos-lib, capos-config, and capos-abi.
  • Userspace-runtime no_std: crates used by init, demos, and capos-rt.
  • Host/build: crates used by tools/*, build.rs helpers, and generated output pipelines.
  • Test/fuzz/dev: crates gated by dev-dependencies or target-specific for fuzz/proptests/smoke support.

Required pre-merge criteria

For any added dependency (or bump in any class):

  1. Manifest and features are explicit. Dependency entries must include explicit feature choices; avoid default-features = true unless justified.
  2. No_std compatibility is proven for no_std classes. Kernel-critical and userspace-runtime dependencies must compile in a #![no_std] mode with alloc where expected. cargo build -p <crate> --target x86_64-unknown-none must succeed for every kernel/no_std crate affected.
  3. Security policy checks run and pass. CI-equivalent checks for the touched workspace are required through make dependency-policy-check, which runs cargo deny check on every Cargo manifest and cargo audit on every lockfile.
  4. Dependency class change is justified in review. PR text must include target class, ownership rationale, transitive graph impact, and why the crate is not a transitive replacement for an already-allowed dependency.
  5. Lockfile behavior is explicit. Update only intended lockfiles and record intentional cross-workspace drift in this document if workspace purpose differs.

No_std add/edit checklist

  • Reject crates that require std, OS I/O, or unsupported platform APIs in the dependency path intended for kernel classes.
  • Reject dependencies that re-export broad platform facades or large unsafe surface unless there is a replacement with smaller scope and better audit visibility.
  • Record a license and supply-chain review result (via policy checks) before merge.
  • Confirm no unsafe contract escapes are added without a review surface note in the relevant module.

Standing requirements

  • Add Security Verification Track S.10.3 checks to the target branch plan item for any kernel/no_std crate dependency change and document the exact pass command set.
  • Keep lockfile deltas review-visible in normal PR flow; lockfile pinning is the minimum bar, not the gate.
  • Keep transitive drift in sync with the trust class: class-wide divergence across lockfiles requires explicit justification.

Remaining gaps after Security Verification Track S.10.3 policy

  • Mirror the resolved dated nightly dist tarballs (and their SHA-256 checksums) into the per-user tool cache as a further hardening step, so bumping the pin does not depend on rustup retaining its historical manifests. The dated pin closes the floating-channel gap; tarball mirroring would close the historical-availability gap.
  • Decide whether the local make kani-lib workflow should grow a repo-managed installer/bootstrap helper or continue to rely on separately provisioned user-local cargo-kani plus the Kani bundle/toolchain setup path.
  • CI now publishes target/build-provenance.txt as a named artifact on every qemu-smoke run (see actions/upload-artifact step in .github/workflows/ci.yml) and, on pull_request events, downloads the most recent successful main-branch artifact via actions/download-artifact and runs make build-provenance-compare BUILD_PROVENANCE_COMPARE_POLICY=ci-environment against it as a blocking PR gate. Missing base provenance is a CI failure, not a silent skip; artifact retention is therefore part of the gate.

Build Provenance Retention And Comparison Policy

Status 2026-06-07 06:35 UTC: this policy applies to local and CI proof artifacts produced by make build-provenance. The qemu-smoke CI job now publishes the candidate record as a named artifact on every run and, on pull_request events, runs make build-provenance-compare against the most recent successful main-branch artifact with BUILD_PROVENANCE_COMPARE_POLICY=ci-environment. That CI policy is PR-blocking for runner, tool, Rust, OVMF package, and OVMF hash drift while allowing expected base-vs-head source commit, ISO/kernel/manifest hash, and embedded-payload hash differences. This remains a reproducibility evidence policy, not a claim that production images are third-party reproducible before the unresolved pinning gates below are closed.

Package-pin bumps for qemu-system-x86, xorriso, make, git, or ovmf are the only planned baseline-refresh case where the PR comparison can fail on purpose: the candidate provenance records the new reviewed package identity, while the base-branch artifact still records the old one. That failure is not a green PR exception and is not a workflow bypass. The bump can land only through a reviewed local-main integration or maintainer push path after the branch’s qemu-smoke build, make run-smoke, and make build-provenance steps pass with the new pins, and the compare diff contains only the reviewed package-identity changes introduced by the same branch. After the bump lands, the next successful main-branch qemu-smoke push artifact becomes the refreshed base provenance; unrelated PRs must wait for that artifact before their blocking environment comparisons can pass.

For every externally cited QEMU proof, release candidate, paper artifact, or public performance/security claim, retain the following as one immutable evidence bundle:

  • target/build-provenance.txt from the exact checked commit, manifest, and recorded worktree state;
  • the kernel, manifest, ISO, OVMF firmware if used, and embedded-binary hashes recorded in that provenance file;
  • the exact command set and QEMU transcript or host-test log used as evidence;
  • the source commit hash, clean-tree assertion or retained git diff plus untracked-file inventory, and any non-default Make variables such as MANIFEST_SOURCE, CAPOS_CUE_TAGS, QEMU_NET, MERMAID_BROWSER_BIN, or CAPOS_TOOLS_ROOT;
  • the system.local.cue overlay state for default-manifest builds: record explicit absence, or retain the file content plus SHA-256 and size. Because the overlay is gitignored and unified into system.cue, a commit hash alone is not enough to reconstruct a default-manifest build that used it;
  • the runner identity: either a pinned CI/container image digest, or the host package identities for Rust, QEMU, xorriso, make, git, OVMF firmware package, and the operating-system image when the runner is not pinned.

Retention requirements:

  • Keep evidence bundles for any tagged release, published paper result, public benchmark, or public security claim for at least the lifetime of that claim.
  • Keep pre-merge task evidence until the reviewed branch has merged and the next full relevant verification has superseded it.
  • Keep failed evidence when it explains a known regression, review finding, or release blocker; otherwise failed local scratch logs may be discarded.
  • Do not rely on target/ as the retention store. target/ artifacts are local build output; retained evidence must be copied to the release, CI, or paper artifact store that owns the claim.

Comparison requirements:

  • Run local comparisons with make build-provenance-compare BASE_PROVENANCE=... CANDIDATE_PROVENANCE=... or tools/compare-build-provenance.py BASE CANDIDATE. The command exits zero only when records differ by generated timestamp and allowed local path roots such as worktree target/ or .capos-tools/, while all hashes, versions, package identities, and runner identities match.
  • Run PR base-vs-head environment comparisons with make build-provenance-compare BUILD_PROVENANCE_COMPARE_POLICY=ci-environment BASE_PROVENANCE=... CANDIDATE_PROVENANCE=.... This policy compares the default manifest source, host target, runner identity, GitHub-hosted image identity when present, Rust toolchain, selected executable identities, tool versions, OVMF selection, and OVMF hash, but ignores expected source commit, kernel/manifest/ISO hash, and embedded-binary hash changes between the base branch and PR head.
  • For package-pin bump branches, treat a ci-environment comparison failure as acceptable review evidence only when every reported difference is an intended package-identity change for qemu-system-x86, xorriso, make, git, or ovmf from the same branch. Any runner image, Rust toolchain, OVMF firmware hash, tool-version, or unrelated package drift remains blocking.
  • Compare two provenance records by commit, clean or retained-diff state, system.local.cue absence/hash/content policy, manifest source, manifest binary hash, kernel hash, ISO hash, embedded-binary table, OVMF hash or explicit absence, host-tool versions, package identities, and operating-system image identity.
  • A byte-identical ISO requires all recorded hashes to match. Equal source commits with different Rust, QEMU, xorriso, OVMF, or host package identities are compatible proof reruns, not reproducible-production evidence.
  • If a comparison differs only in paths under .capos-tools or worktree-local target/ directories while all hashes and versions match, treat the result as the same proof environment.
  • If a comparison differs in worktree state, overlay state, package identity, operating-system image identity, host-tool version, OVMF hash, Rust compiler commit/date, embedded-binary hash, or ISO hash, record the difference in the owning review or release note before citing the result.

Minimum runner identity for production-hardening branches:

  • Rust must be a date-pinned nightly or stronger hash-pinned toolchain, not the floating nightly channel.
  • QEMU, xorriso, make, and git must come from a pinned runner image digest or a documented package set with package name, version, architecture, repository, and distribution release.
  • OVMF firmware must be either repo-pinned by digest or identified by package name, version, architecture, repository, distribution release, selected path, and SHA-256.
  • Any runner image used for production reproducibility claims must be cited by immutable digest. Mutable tags are acceptable only for local proof evidence.

Production hardening must treat the following as unresolved supply-chain gates, not as cosmetic reproducibility work:

immutable runner image digest or repo-managed tool digests for qemu/xorriso/make/git

The Rust nightly date pin (currently nightly-2026-04-20) closes the floating-channel gate; tarball mirroring is tracked as a further hardening step in the Remaining gaps section above. The qemu-smoke job now installs qemu-system-x86=1:8.2.2+ds-0ubuntu1.16 (amd64, noble-updates/main or noble-security/main, Ubuntu 24.04), xorriso=1:1.5.6-1.1ubuntu3 (amd64, noble/main, Ubuntu 24.04), make=4.3-4.1build2 (amd64, noble/main, Ubuntu 24.04), git=1:2.43.0-1ubuntu7.3 (amd64, noble-updates/main or noble-security/main, Ubuntu 24.04), and ovmf=2024.02-2ubuntu0.8 (amd64, noble-updates/main, Ubuntu 24.04) so the QEMU, ISO writer, make, git, and OVMF firmware identities are all captured for UEFI smoke builds: package name, exact version, architecture, normalized apt source pocket, and the per-run identity captured via dpkg-query/apt-cache policy by make build-provenance. OVMF additionally records the selected path (/usr/share/ovmf/OVMF.fd) and per-run SHA-256, and the Makefile now pins the selected firmware blob by SHA-256 through the ovmf-verify gate wired into the ISO and cloud-disk rules. Repo-pinned digests (download-and-verify rather than apt-installed) for qemu-system-x86, xorriso, make, and git, or an immutable runner image digest that contains them, remain future hardening tracked in docs/design-risks-register.md (R13). xorriso has no version in noble-updates; the pin uses the only available noble/main version, which is what every Ubuntu 24.04 host resolves to.

Until those gates land, generated ISO/manifest/payload artifacts plus target/build-provenance.txt are suitable for local and CI proof evidence, but not for claims that a third party can reproduce an identical production boot image from source alone.

Bootloader and ISO Inputs

The Makefile now pins Limine at commit aad3edd370955449717a334f0289dee10e2c5f01 and verifies these copied artifacts:

ArtifactChecksum reference
$(LIMINE_DIR)/limine-bios.sysLIMINE_BIOS_SYS_SHA256 in Makefile
$(LIMINE_DIR)/limine-bios-cd.binLIMINE_BIOS_CD_SHA256 in Makefile
$(LIMINE_DIR)/limine-uefi-cd.binLIMINE_UEFI_CD_SHA256 in Makefile
$(LIMINE_DIR)/BOOTX64.EFILIMINE_BOOTX64_EFI_SHA256 in Makefile

$(LIMINE_DIR) resolves to $(CAPOS_TOOLS_ROOT)/limine/<LIMINE_COMMIT> (default $HOME/.capos-tools/limine/<commit> unless CAPOS_TOOLS_ROOT is overridden), shared with the rest of the per-user pinned tool cache.

make limine-ensure clones https://github.com/limine-bootloader/limine.git only when $(LIMINE_DIR)/.git is absent, fetches the pinned commit if needed, checks it out detached, and runs make inside the Limine tree (the limine-ensure recipe). make limine-verify then checks the repository HEAD and artifact checksums (the limine-verify recipe). The ISO copies the kernel, generated manifest.bin, Limine config, and verified Limine artifacts into iso_root/, runs xorriso, then runs limine bios-install (the $(ISO) recipe).

Remaining reproducibility gap: Limine source is pinned, but the Limine build host compiler and environment are not pinned or recorded.

Rust Toolchain

rust-toolchain.toml specifies:

  • channel = "nightly-2026-04-20"
  • targets = ["x86_64-unknown-none", "aarch64-unknown-none", "wasm32-wasip1"]
  • components = ["rust-src"]

The wasm32-wasip1 target is needed for the WASI Preview 1 demo payloads (demos/wasi-hello-rust/, demos/wasi-cli-args/, demos/wasi-random/) built by make wasi-hello-rust-build, make wasi-cli-args-build, and make wasi-random-build; the wasm-host binary itself is built for the booted x86_64-unknown-capos userspace target instead.

The pinned dated channel resolves to:

  • rustc 1.97.0-nightly (e22c616e4 2026-04-19)
  • host target x86_64-unknown-linux-gnu

The 2026-04-20 manifest packages the rustc commit cut on 2026-04-19; that is the upstream dist naming convention, not a drift. Rustup will continue to install the same dist tarball for nightly-2026-04-20 as long as upstream retains it.

The Makefile derives HOST_TARGET from rustc -vV (Makefile:12) and uses that for tools/mkmanifest (Makefile:28-29). Cargo aliases in .cargo/config.toml:4-48 hard-code x86_64-unknown-linux-gnu for host tests. The custom userspace target aliases in .cargo/config.toml use targets/x86_64-unknown-capos.json plus -Zjson-target-spec and -Zbuild-std=core,alloc, so rust-src is a required toolchain component. The CI host-baseline and qemu-smoke jobs install the same nightly-2026-04-20 toolchain so CI matches the local rust-toolchain.toml resolution. The kani-proofs job stays on nightly-2025-11-21 because Kani requires its own paired nightly bundle installed by cargo kani setup; advancing the Kani pin is tracked separately through that bundle’s compatibility matrix.

Rust Nightly Date Pin Policy

The pin is one of the supply-chain-trust controls listed in this proposal alongside the Limine commit, OVMF firmware SHA-256, capnp tarball SHA-256, CUE binary, mdBook/mdbook-mermaid release assets, Typst binary, uv binary, and pinned cargo-deny/cargo-audit/cargo-kani releases. All of these must be pinned at the same trust level – date- or hash-anchored, never a floating channel or moving tag. This subsection states the policy for the Rust nightly entry; the next subsection states the mechanical advance procedure.

Where the pin lives. Exactly one source: rust-toolchain.toml as a date-anchored nightly channel of the form nightly-YYYY-MM-DD. The CI workflow’s host-baseline and qemu-smoke toolchain: values must mirror that same dated channel. No other file may declare a nightly date; no float, no nightly shorthand, no commit-hash override.

Promotion criteria. A bump is accepted only when the candidate nightly satisfies all of the following against the worktree where the pin lands:

  • make builds the full workspace clean (kernel + standalone userspace + ISO) with no new warnings under cargo build --features qemu.
  • make fmt-check passes across the workspace and all standalone crates.
  • make workflow-check passes (CLAUDE.md token budget, mandatory-context budgets, slice trailers).
  • make check passes (the aggregate build/test gate that includes generated-code-check and the host-test aliases).
  • make run-smoke passes on the developer host when QEMU smoke is feasible there; if QEMU is unavailable locally, the bump branch’s CI qemu-smoke run is the authoritative gate.
  • Any new rustc warning, lint, or unrelated build failure introduced by the new nightly is treated as a real gate failure. Do not relax capOS code or silence the lint to land the bump.

Rollback. If a promotion exposes a regression in a downstream crate that capOS depends on (limine, x86_64, spin, smoltcp, wasmi, capnp/capnpc, or any cargo-deny/cargo-audit pinned tool), revert the pin to the prior dated channel on main, file a tracking note in docs/tasks/ with the failing date, the failing crate, and the upstream issue if one exists, and resume normal cadence only after the downstream regression is resolved or worked around.

Cadence. Bump the pin at least once per quarter even without a specific feature trigger so production-provenance evidence does not lag upstream. Bump out of cadence when (a) a security advisory affects the current pinned nightly’s rustc/cargo/std (consult rust-lang/rust, rustsec/advisory-db, and the cargo-audit output for the pinned dist), or (b) a compiler feature, fix, or lint that capOS depends on lands upstream. Unbounded float is not permitted: the dated channel must always resolve to a concrete YYYY-MM-DD.

Approvals. Maintainer-driven, single reviewed slice per bump. No automated promotion bot. The pin bump is its own contract change and must not be bundled with unrelated behavior changes; the reviewed diff must show only rust-toolchain.toml, the CI workflow, this proposal’s summary table and resolved-rustc line, and any minimal lint/code adjustments forced by the new nightly with an inline justification.

Trust-input dimension. The pin closes the floating-channel supply- chain gate listed in the Build Provenance Retention And Comparison Policy (“Minimum runner identity for production-hardening branches: Rust must be a date-pinned nightly or stronger hash-pinned toolchain, not the floating nightly channel”). Mirroring the resolved dist tarballs into the per-user tool cache (the same shape as Limine, capnp, CUE, mdBook, and Typst pins) remains a future hardening step tracked in the Remaining gaps section.

Advance procedure (bumping the dated nightly)

When to bump:

  • A compiler feature, fix, or lint that capOS depends on lands in upstream nightly after 2026-04-20. Example triggers: a Cargo or rustc fix that unblocks a build path; a core/alloc change that affects -Zbuild-std; a rustfmt change required for the project formatting baseline.
  • Toolchain drift hygiene: schedule a bump at least once per release window even without a specific feature trigger, so production-provenance evidence does not lag too far behind upstream.

How to bump:

  1. Choose a candidate nightly date and verify all required targets and the rust-src component are simultaneously available for that date:

    rustup toolchain add nightly-<YYYY-MM-DD> \
        --target x86_64-unknown-none \
        --target aarch64-unknown-none \
        --target wasm32-wasip1 \
        --component rust-src
    

    If any target or component is missing, try adjacent dates (rustup’s nightly dist manifests sometimes drop a target for a single day) until one is found that provides the full set.

  2. Update both files in the same commit:

    • rust-toolchain.toml channel value.
    • .github/workflows/ci.yml – both the host-baseline and qemu-smoke toolchain: values. Leave kani-proofs on its own pin.
  3. Run the full local gate set against the candidate before pushing: make fmt-check, cargo build --features qemu, make check, make workflow-check, make run-smoke. Treat any new warning or unrelated build failure as a real gate failure – do not patch around compiler drift by relaxing capOS code.

  4. Update the Rust toolchain row in this file’s summary table, the resolved rustc line above, and the last_reviewed front-matter timestamp. Cite the new dated channel.

  5. Land the pin bump as its own reviewed slice, not bundled with unrelated behavior changes. The pin is itself the provenance contract.

Remaining reproducibility gap: rustup retains nightly dist manifests for a finite window. A future hardening slice may mirror the resolved dist tarballs plus their SHA-256 checksums into the per-user tool cache the same way Limine and capnp are pinned today, so a bump-without-mirror does not become a silent loss of historical reproducibility.

CI Runner Package Pins

The qemu-smoke CI job installs qemu-system-x86, xorriso, make, git, and ovmf via apt on an ubuntu-24.04 runner. Those packages provide the QEMU emulator that executes every QEMU smoke, the ISO writer that builds the bootable image consumed by smokes, the build and repository tools used after checkout, and the UEFI firmware blob selected by make run-uefi and the cloud-disk path. A floating apt install (no =<version> specifier) would let upstream Ubuntu silently roll any of them on the next CI run, so this section names the version pins, the file that owns them, and the procedure for advancing them.

CI Package Pin Policy

The pin is one of the supply-chain-trust controls listed in this proposal alongside the Limine commit, OVMF firmware SHA-256, Rust nightly date pin, capnp tarball SHA-256, CUE binary, mdBook/mdbook-mermaid release assets, Typst binary, uv binary, and pinned cargo-deny/cargo-audit/cargo-kani releases. All of these must be pinned at the same trust level – date- or hash-anchored, never a floating channel or moving tag. This subsection states the policy for the QEMU, xorriso, make, git, and OVMF package entries; the next subsection states the mechanical advance procedure.

Where the pin lives. Exactly one source: the Install boot smoke dependencies step of the qemu-smoke job in .github/workflows/ci.yml. Each package must be invoked as <name>=<exact-version> (no * wildcard and no major-only floor) so the apt resolver fails closed rather than silently rolling forward. The currently pinned versions are qemu-system-x86=1:8.2.2+ds-0ubuntu1.16 (amd64, noble-updates/main or noble-security/main, Ubuntu 24.04), xorriso=1:1.5.6-1.1ubuntu3 (amd64, noble/main, Ubuntu 24.04), make=4.3-4.1build2 (amd64, noble/main, Ubuntu 24.04), git=1:2.43.0-1ubuntu7.3 (amd64, noble-updates/main or noble-security/main, Ubuntu 24.04), and ovmf=2024.02-2ubuntu0.8 (amd64, noble-updates/main, Ubuntu 24.04). The summary table, the QEMU and firmware and ISO and host filesystem tools rows, the Host Tools section, and the Build Provenance Retention And Comparison Policy mirror these strings; the policy text is the single source of truth and the other locations track it.

Promotion criteria. A bump is accepted only when all of the following hold against the bump branch:

  • The Ubuntu base image rolls (noble/noble-updates/noble-security publishes a newer version of the package) or a security advisory affects the currently pinned version. Cosmetic version bumps without an upstream trigger are not accepted; the pin moves forward when there is a reason to move it.
  • apt-cache madison <package> on a current Ubuntu 24.04 host lists the candidate version, and the candidate is available from noble-updates/main (or noble/main when no noble-updates entry exists, as is the case for xorriso today). Versions sourced from third-party PPAs or *-proposed pockets are not accepted.
  • The bump branch’s qemu-smoke execution reaches and passes the new-pin build evidence steps: make build, make run-smoke, make build-provenance, and candidate provenance artifact upload. The pull-request make build-provenance-compare BUILD_PROVENANCE_COMPARE_POLICY=ci-environment step is expected to fail only on reviewed package-identity fields that match the new pinned strings rather than the previous ones; every other comparison difference remains blocking.
  • No new QEMU, xorriso, make, git, or OVMF behavior is silently relied on: if the bump unlocks a smoke that previously failed, that smoke must be enabled and reviewed in the same bump branch rather than treated as incidental.
  • The Trusted Build Inputs summary table, QEMU and firmware row, ISO and host filesystem tools row, Host Tools section, and Build Provenance Retention And Comparison Policy text are updated to cite the new versions and the new resolved repository (noble/noble-updates) in the same commit.

Rollback. If a promotion exposes a regression in the QEMU smoke path, the ISO writer, build orchestration, repository operations, or UEFI boot, revert the .github/workflows/ci.yml change to the prior pinned version on main, file a tracking task under docs/tasks/ with the failing version, the failing smoke, and the upstream Ubuntu/QEMU/xorriso/OVMF issue if one exists, and resume normal cadence only after the regression is resolved or worked around. Reverting also requires reverting the summary-table and policy text mirrors so the recorded versions stay consistent with the workflow file.

Cadence. Bump the pins at least once per quarter even without a specific security trigger, so production-provenance evidence does not lag upstream Ubuntu point releases. Bump out of cadence when (a) a security advisory affects the current pinned version of any of the packages (consult the Ubuntu Security Notices, the QEMU security mailing list, the Git security advisories, GNU make release notes, and the ovmf/edk2 advisories), or (b) a fix that capOS depends on lands in a newer Ubuntu point release. Unbounded float is not permitted: each package must always resolve to a concrete <epoch>:<upstream>-<debian> version string.

Approvals. Maintainer-driven, single reviewed slice per bump. No automated promotion bot. The pin bump is its own contract change and must not be bundled with unrelated behavior changes; the reviewed diff must show only .github/workflows/ci.yml, this proposal’s summary table and resolved-version mirrors, the relevant production-provenance task record when sub-items move, and any minimal smoke adjustments forced by the new package versions with an inline justification.

Trust-input dimension. The pin closes the runner/OS/tool identity gate listed in the Build Provenance Retention And Comparison Policy (“Minimum runner identity for production-hardening branches: QEMU, xorriso, make, and git must come from a pinned runner image digest or a documented package set with package name, version, architecture, repository, and distribution release”) for the apt-installed package set it owns. A pinned runner image digest (replacing the ubuntu-24.04 mutable label with an immutable image SHA) or repo-managed tool digests for those packages remain future hardening tracked in docs/design-risks-register.md (R13).

Advance procedure (bumping the apt-pinned versions)

When to bump:

  • An Ubuntu Security Notice affects the currently pinned version of qemu-system-x86, xorriso, make, git, or ovmf.
  • A QEMU, xorriso, make, git, or OVMF point release lands in noble-updates/main that capOS needs (typically a virtio, MSI-X, ISO writer, build-tool, repository-tool, or UEFI fix).
  • Quarterly hygiene cadence with no specific feature trigger, so the pin does not lag too far behind upstream.

How to bump:

  1. On a current Ubuntu 24.04 host (or a ubuntu:24.04 container that has refreshed apt-get update), list available versions of each package:

    apt-cache madison qemu-system-x86
    apt-cache madison xorriso
    apt-cache madison make
    apt-cache madison git
    apt-cache madison ovmf
    

    Pick the highest stable version from noble-updates/main. If a package has no noble-updates entry (as is the case for xorriso today), pick from noble/main. Do not select from *-proposed, *-backports, or third-party PPAs.

  2. Update the single source in the Install boot smoke dependencies step of the qemu-smoke job in .github/workflows/ci.yml so each package line reads <name>=<exact-version>.

  3. Update the mirrors in this file in the same commit: the summary-table rows for QEMU and firmware and ISO and host filesystem tools, the Host Tools section, the Build Provenance Retention And Comparison Policy text, and the Remaining gaps for Security Verification Track S.10.2/S.10.3 block under Manifest, Embedded Binaries, and Downloaded Artifacts. Refresh the last_reviewed front-matter timestamp.

  4. If the OVMF package version moves, the OVMF firmware blob SHA-256 may change. Recompute OVMF_CODE_SHA256 in Makefile from the resolved firmware path (/usr/share/ovmf/OVMF.fd on Ubuntu 24.04) and verify make ovmf-verify passes against the new digest. Land the OVMF_CODE_SHA256 change in the same commit as the package bump.

  5. Push the bump branch and let qemu-smoke exercise the new pins through make, make run-smoke, make build-provenance, and candidate provenance artifact upload. The acceptance gate for the bump itself is those steps passing plus a reviewed PR make build-provenance-compare BUILD_PROVENANCE_COMPARE_POLICY=ci-environment failure whose diff is limited to the intended package-identity strings replacing the previous ones. Land the bump through the reviewed local-main integration or maintainer push path; after it reaches main, the next successful main-branch qemu-smoke push artifact is the new base record for unrelated PR comparisons.

  6. Land the pin bump as its own reviewed slice, not bundled with unrelated behavior changes. The pin is itself the provenance contract.

Remaining reproducibility gap: the ubuntu-24.04 runner label is still managed by GitHub Actions, not by an immutable image digest, so the host package set underneath the apt-installed qemu-system-x86, xorriso, make, git, and ovmf pins can still roll between runs. A future hardening slice may move the qemu-smoke job to a self-built runner image referenced by digest, mirror the apt package files into the per-user tool cache the same way Limine and capnp are pinned today, or both, so a bump-without- mirror does not become a silent loss of historical reproducibility.

Cargo Dependencies

The root workspace members are capos-abi, capos-config, capos-lib, capos-tls, kernel, and the host-only tools/capnp-build build-support crate. Cargo.toml keeps default members to capos-config, capos-lib, capos-tls, and kernel so ordinary root bare-metal builds do not build the host helper as a target package but do build the capos-tls certificates/TLS verifier-dependency probe. The vendored rustls-webpki / webpki-roots path dependencies declare their own [workspace] and are listed in the root Cargo.toml exclude set (the same isolation as the vendored fatfs crate), so they are not workspace members. The vendored embedded-tls client-state machine snapshot follows the same workspace isolation and is consumed only by the standalone demos/ workspace. init/, demos/, tools/mkmanifest/, tools/ringtap-viewer/, capos-rt/, shell/, libcapos/, libcapos-posix/, capos-wasm/, and fuzz/ are standalone workspaces with their own lockfiles.

Important direct dependencies and current root-lock resolutions:

DependencyManifest referencesRoot lock resolution
capos-abicapos-config/Cargo.toml, capos-lib/Cargo.tomllocal path package in Cargo.lock
argon2capos-lib/Cargo.toml; optional capos-config/Cargo.toml credential-validation feature used by kernel/init/mkmanifest bootstrap validation0.5.3 in Cargo.lock
capnpcapos-config/Cargo.toml, capos-lib/Cargo.toml, kernel/Cargo.toml0.25.4 in Cargo.lock
capos-capnp-buildcapos-config/Cargo.tomllocal path package in Cargo.lock
capnpctools/capnp-build/Cargo.toml0.25.3 in Cargo.lock
limine cratekernel/Cargo.toml:8 ("0.6" range)0.6.3 in Cargo.lock
spinkernel/Cargo.toml:9 ("0.9" range)0.9.8 in Cargo.lock
x86_64kernel/Cargo.toml:10 ("0.15" range)0.15.4 in Cargo.lock
linked_list_allocatorkernel/Cargo.toml:11 ("0.10" range)0.10.6 in Cargo.lock
smoltcpkernel/Cargo.toml:16 ("0.13.0" caret range)0.13.0 in Cargo.lock
loomcapos-config/Cargo.toml:270.7.2 in Cargo.lock
proptestcapos-lib/Cargo.toml1.11.0 in Cargo.lock
rustls-webpki (vendored path)capos-tls/Cargo.toml (=0.103.13, default-features = false, alloc)local path package (vendor/rustls-webpki/rustls-webpki-0.103.13) in Cargo.lock
webpki-roots (vendored path)capos-tls/Cargo.toml (=1.0.7, default-features = false)local path package (vendor/webpki-roots/webpki-roots-1.0.7) in Cargo.lock
rustls-pki-typestransitive of the vendored rustls-webpki/webpki-roots (alloc)1.14.1 in Cargo.lock
untrustedtransitive of the vendored rustls-webpki0.9.0 in Cargo.lock
zeroizetransitive of rustls-pki-types (alloc)1.8.2 in Cargo.lock

The four kernel-critical crates limine, spin, x86_64, and smoltcp are declared with semver-range requirements ("0.6", "0.9", "0.15", and the caret "0.13.0"), not the exact =X.Y.Z requirements applied to capnp (=0.25.4) in kernel/Cargo.toml and sha2 (=0.10.9 in capos-lib/Cargo.toml). This requirement-level asymmetry is currently unintentional drift in manifest style rather than a deliberate policy: the exact crate version that ships is still pinned by the checked-in Cargo.lock checksums above and is review-visible through lockfile diffs, so a range requirement does not widen what actually compiles without a lockfile change. Tightening these four manifest requirements to =X.Y.Z to match capnp/sha2 is a separate build-risk change (a manifest edit plus lockfile regeneration and re-verification), tracked as a doc-accuracy gap here rather than changed in this inventory pass.

Standalone lockfile drift observed during this inventory:

The TLS client handshake smoke adds a userspace-runtime no_std dependency in the standalone demos/ workspace: embedded-tls = "=0.19.0" as a path dependency under vendor/embedded-tls/embedded-tls-0.19.0/, with default-features = false and only the rustpki feature enabled. demos/Cargo.lock pins the resulting RustCrypto TLS 1.3 closure. The capOS custom target forces the software AES and POLYVAL backends in .cargo/config.toml so those crypto dependencies do not select x86 accelerated backend code that is outside the custom-target build contract.

LockfileNotable direct/runtime resolution
init/Cargo.lockcapnp 0.25.4, capnpc 0.25.3, linked_list_allocator 0.10.6
demos/Cargo.lockcapnp 0.25.4, capnpc 0.25.3, linked_list_allocator 0.10.6
demos/wasi-hello-rust/Cargo.lockSingle-package leaf lockfile for the wasm32-wasip1 Rust hello payload; no third-party direct dependencies.
demos/wasi-cli-args/Cargo.lockSingle-package leaf lockfile for the Phase W.3 argv-grant wasm32-wasip1 Rust payload; no third-party direct dependencies.
demos/wasi-env/Cargo.lockSingle-package leaf lockfile for the WASI environment-grant wasm32-wasip1 Rust payload; no third-party direct dependencies.
demos/wasi-fs/Cargo.lockSingle-package leaf lockfile for the WASI filesystem wasm32-wasip1 Rust payload; no third-party direct dependencies.
demos/wasi-random/Cargo.lockSingle-package leaf lockfile for the Phase W.4 random_get wasm32-wasip1 Rust payload; no third-party direct dependencies.
demos/wasi-preview1-refusals/Cargo.lockSingle-package leaf lockfile for the WASI Preview 1 refusal-coverage wasm32-wasip1 Rust payload; no third-party direct dependencies.
demos/wasi-stdio-fd/Cargo.lockSingle-package leaf lockfile for the WASI stdio-fd wasm32-wasip1 Rust payload; no third-party direct dependencies.
tools/mkmanifest/Cargo.lockcapnp 0.25.4, capnpc 0.25.3, serde_json 1.0.149
tools/adventure-content-gen/Cargo.lockHost generator for adventure content; locked dependencies include serde_json and the cue-export-to-JSON pipeline; no capnp runtime dependency.
tools/paperclips-content-gen/Cargo.lockHost generator for Paperclips content; locked dependencies include serde_json and capnp 0.25.4 for schema-aware JSON-to-binary conversion through mkmanifest cue-to-capnp.
tools/remote-session-client/Cargo.lockStandalone Linux host-side remote-session client; pins capnp 0.25.4 and serde 1.0.228; no transitive wasmi, Argon2, or smoltcp dependency. Covered by make dependency-policy-check.
tools/ringtap-viewer/Cargo.lockcapnp 0.25.4, capnpc 0.25.3; no Argon2 because it uses baseline capos-config
capos-rt/Cargo.lockcapnp 0.25.4, capnpc 0.25.3, linked_list_allocator 0.10.6
capos-service/Cargo.lockcapnp 0.25.4, capnpc 0.25.3, linked_list_allocator 0.10.6 (the same allocator resolution as capos-rt/demos/libcapos; no cross-workspace drift).
libcapos/Cargo.lockcapnp 0.25.4, capnpc 0.25.3, linked_list_allocator 0.10.6 plus the local capos-rt path dependency.
libcapos-posix/Cargo.lockcapnp 0.25.4, capnpc 0.25.3, linked_list_allocator 0.10.6, plus local capos-rt and libcapos path dependencies.
shell/Cargo.lockblake2 0.10.6, capnp 0.25.4, capnpc 0.25.3, linked_list_allocator 0.10.6; no Argon2 because it uses baseline capos-config
capos-wasm/Cargo.lockcapnp 0.25.4, capnpc 0.25.3, linked_list_allocator 0.10.6, wasmi 1.0.9 (vendored static-pinned at vendor/wasmi-no_std/wasmi-1.0.9/); no Argon2.
fuzz/Cargo.lockcapnp 0.25.4, capnpc 0.25.3, libfuzzer-sys 0.4.12
tools/remote-session-client/src-tauri/Cargo.lock (not yet under policy gates)Tauri scaffold lockfile carrying ~435 transitive packages pinned through tauri = "=2.11.1"; only reachable through make remote-session-tauri policy / check / dev modes. Not covered by make dependency-policy-check today; promotion is gated on the Tauri authority decision.
vendor/wasmi-no_std/wasmi-1.0.9/Cargo.lock (vendored snapshot lockfile)Upstream wasmi workspace lockfile preserved with the static-pinned snapshot; capos-wasm consumes wasmi only through its own =1.0.9 path dependency, which lands in capos-wasm/Cargo.lock and is covered there. The vendored lockfile is not separately gated; see vendor/wasmi-no_std/VENDORED_FROM.md for the refresh procedure and policy re-check.

Cargo lockfiles pin exact crate versions and crates.io checksums, so ordinary crate upgrades are review-visible through lockfile diffs. They do not, by themselves, define whether a dependency is acceptable for kernel/no_std use, whether multiple lockfiles must converge, or whether advisories/licenses block the build.

Security Verification Track S.10.3 policy gate:

  • deny.toml defines the shared license, advisory, ban, and source baseline.

  • The allowed license set is intentionally limited to permissive licenses used by current locked dependencies. BSD-3-Clause is accepted for the Argon2 credential-validation dependency closure (subtle through password-hash, digest, and blake2); it is OSI-approved, FSF-free, and carries only the standard non-endorsement clause beyond the already-allowed BSD-2-Clause. 0BSD is accepted for the smoltcp networking dependency closure (smoltcp and managed); it is OSI-approved and carries no attribution or non-endorsement condition beyond the existing permissive-license baseline.

  • make dependency-policy-check runs cargo deny check on the root workspace, init, demos, tools/mkmanifest, tools/ringtap-viewer, capos-rt, shell, and fuzz.

  • The same target runs cargo audit --deny warnings on every checked-in lockfile, with one explicit audit ignore: RUSTSEC-2026-0173 (proc-macro-error2 unmaintained warning). The ignored path is pulled into lockfiles through smoltcp’s optional defmt logging feature; capOS does not enable defmt for smoltcp, but cargo audit scans lockfiles rather than the target feature set. Remove the ignore when upstream smoltcp / defmt no longer resolves that crate.

  • The same target copies package.json and package-lock.json into a private temporary directory and runs a dry-run install there:

    PUPPETEER_SKIP_DOWNLOAD=1 npm ci --ignore-scripts --dry-run
    

    That preserves the npm ci package/lock synchronization check without modifying the worktree install. It also runs npm audit --package-lock-only --audit-level=high. Lifecycle scripts stay disabled for the docs dependency install path; the browser used by Mermaid PDF rendering is an explicit host executable selected by MERMAID_BROWSER_BIN.

  • capos-config keeps Argon2 behind the credential-validation feature. Bootstrap/config validation remains available in the baseline feature set, while validators that need to parse PHC credential strings enable the feature. Runtime clients and inspection tools that only need ring/schema/CapSet data use the baseline feature set.

  • Local packages are marked publish = false so cargo-deny treats them as private, and local path dependencies include version = "0.1.0" so registry wildcard requirements can remain denied.

  • CI installs pinned cargo-deny 0.19.4 and cargo-audit 0.22.1 and runs the target.

Remaining dependency-policy gap: decide whether standalone lockfiles may intentionally drift from the root lockfile, especially for capnp and allocator crates used by userspace.

Cap’n Proto Compiler, Runtime, and Generated Bindings

The trusted Cap’n Proto inputs are:

  • schema/capos.capnp, the source schema.
  • Repo-local pinned capnp, invoked through the capnpc Rust build dependency via CAPOS_CAPNP.
  • capnp runtime crate with default-features = false and alloc.
  • capnpc codegen crate.
  • Generated capos_capnp.rs written to Cargo OUT_DIR.
  • Local no_std patching applied after generation by tools/capnp-build.

capos-config/build.rs delegates schema generation to tools/capnp-build. That shared helper runs capnpc::CompilerCommand over schema/capos.capnp, reads the generated capos_capnp.rs, asserts that the expected #![allow(unused_variables)] anchor is present, and injects:

#![allow(unused)]
#![allow(unused_imports)]
fn main() {
use ::alloc::boxed::Box;
use ::alloc::string::ToString;
}

The generated code used by builds is included from OUT_DIR in capos-config/src/lib.rs:10-12. The expected patched output is checked in as tools/generated/capos_capnp.rs, so schema, compiler, capnpc crate, and patch-output changes must update that baseline and become review-visible as a source diff.

Security Verification Track S.10.2 generated-code drift check:

  • make generated-code-check first builds the checked-in init ELF required by kernel build-script validation, exports its absolute path as CAPOS_INIT_ELF, and runs tools/check-generated-capnp.sh, tools/check-generated-adventure-content.sh, and tools/check-generated-paperclips-content.sh.
  • The script invokes the actual Cargo build-script path for capos-config in an isolated target directory, so it checks the generated artifact that crate would include from OUT_DIR.
  • During that build, tools/capnp-build also copies the patched binding to a deterministic package-scoped path under the isolated target directory. The checker consumes those explicit paths rather than searching Cargo’s hashed build-script output directories.
  • The script verifies that the patched file still contains the capnpc anchor plus the local no_std patch imports, compares the output against tools/generated/capos_capnp.rs, and fails if a kernel-generated output path appears in the isolated target directory.
  • Any intentional schema/codegen/patch change must update the checked-in baseline in the same review, making generated output drift review-visible.
  • make check runs fmt-check plus generated-code-check for a single local or CI entry point.
  • Current pinned compiler source is capnproto-c++-1.2.0.tar.gz from https://capnproto.org/ with SHA-256 ed00e44ecbbda5186bc78a41ba64a8dc4a861b5f8d4e822959b0144ae6fd42ef. The checked-in tools/generated/capos_capnp.rs baseline must be regenerated with that compiler when schema or codegen behavior intentionally changes. The current pinned baseline SHA-256 is 5ab84731324fe9cc984d7aba7dd97963a773800cc52c4c1693fcb6bb448329a6.

Adventure content generation uses:

  • demos/adventure-content/content/prototype.cue as the checked-in source.
  • tools/adventure-content-gen, a standalone Cargo host tool with tools/adventure-content-gen/Cargo.lock.
  • demos/adventure-content/src/generated.rs as the checked-in generated no_std Rust baseline consumed by demos/adventure-content/src/lib.rs.
  • tools/check-generated-adventure-content.sh, which derives the same $(CAPOS_TOOLS_ROOT)/cue/0.16.0/bin/cue path as the Makefile, rejects a mismatched CAPOS_CUE, checks cue version v0.16.0, exports explicit JSON, runs the generator with cargo run --locked, formats the output with rustfmt --edition 2024, and fails if the result differs from demos/adventure-content/src/generated.rs.

Any intentional content-source or generator change must update the checked-in generated Rust baseline in the same review. The generator manifest and lockfile are included in make dependency-policy-check.

The no_std patch source is single-owned by tools/capnp-build; capos-config/build.rs emits its crate-specific rerun directives and calls the helper.

Alloy Analyzer (DMA Assurance Model)

The DMA assurance Alloy model (models/dma/dma_authority.als) is checked by a pinned Alloy Analyzer, the same trust level as the Limine, capnp, CUE, Typst, and uv pins.

  • Pinned artifact: alloy-6.2.0-linux-amd64.tar.gz from the official AlloyTools/org.alloytools.alloy GitHub release v6.2.0 (https://github.com/AlloyTools/org.alloytools.alloy/releases/download/v6.2.0/alloy-6.2.0-linux-amd64.tar.gz), SHA-256 5a5494a4bac6e243e471590bb44a91e25a35794a5af1ae1f332be30b9c54a9e7. This is the self-contained linux/amd64 app image: it bundles a Temurin JRE under lib/runtime/ and the native SAT solver libraries, so the gate needs no host JVM and pins the analyzer and its runtime by one hash. The org.alloytools.alloy.dist.jar (bare jar, host-JVM dependent) is deliberately not used.
  • Where the pin lives: Makefile ALLOY_VERSION / ALLOY_PLATFORM / ALLOY_TARBALL_URL / ALLOY_TARBALL_SHA256. make alloy-ensure downloads the tarball (curl with retry), verifies the SHA-256, extracts the app image into $(CAPOS_TOOLS_ROOT)/alloy/6.2.0/ (shared per-user cache, default $HOME/.capos-tools), and confirms the launcher reports version 6.2.0. The jar is not vendored into the repository.
  • Drift review: make model-dma-alloy re-verifies the tarball SHA-256 and the reported launcher version on every run before invoking the model. A bump is a Makefile ALLOY_VERSION + ALLOY_TARBALL_SHA256 diff plus a refreshed checked-result record in models/dma/README.md.
  • Why output is parsed, not exit-code-gated: the Alloy CLI exec subcommand always exits 0; a check that finds a counterexample, a run that finds no instance, and a syntax/resolution error all return success with the failure visible only in the printed verdict table. tools/run-dma-alloy-model.sh parses that table and fails closed on any check that is not UNSAT, any run that is not SAT, or any analyzer error marker.
  • Platform/CI scope: the pinned app image is linux/amd64 (the dev/CI host architecture). GitHub CI runs make model-dma-alloy in the dma-assurance-models job on ubuntu-24.04. Other architectures would need the matching Alloy app image (or the bare jar plus a host JVM). Ownership of the Alloy pin is shared with the scheduler lease model track (scheduler-cpu-isolation-lease-authority-model).

TLC Model Checker (DMA Assurance Lifecycle Model)

The DMA assurance TLA+ lifecycle model (models/dma/dma_authority.tla) is checked by a pinned TLC, the same trust level as the Limine, capnp, CUE, Typst, uv, and Alloy pins. Unlike the self-contained Alloy app image, tla2tools.jar is a bare Java jar, so a JVM is pinned alongside it.

  • Pinned artifacts: tla2tools.jar from the official tlaplus/tlaplus GitHub release v1.7.4 (TLC 2.19), https://github.com/tlaplus/tlaplus/releases/download/v1.7.4/tla2tools.jar, SHA-256 936a262061c914694dfd669a543be24573c45d5aa0ff20a8b96b23d01e050e88; and a Temurin JRE 17.0.19+10 linux/x64 tarball (OpenJDK17U-jre_x64_linux_hotspot_17.0.19_10.tar.gz from the adoptium/temurin17-binaries release), SHA-256 adb5a2364baa51de1ef91bb9911f5a61d24b045fe1d6647cb8050272a3a8ee75. Pinning the JRE as well as the jar fixes both the checker and its runtime by hash.
  • Where the pin lives: Makefile TLA_TOOLS_VERSION / TLA_TOOLS_JAR_URL / TLA_TOOLS_JAR_SHA256 / TLA_JRE_URL / TLA_JRE_SHA256. make tla-ensure downloads both (curl with retry), verifies their SHA-256, extracts the JRE into $(CAPOS_TOOLS_ROOT)/tla/jre/ and places the jar at $(CAPOS_TOOLS_ROOT)/tla/1.7.4/tla2tools.jar (shared per-user cache, default $HOME/.capos-tools), and confirms the launcher reports 17.0.19. Neither is vendored into the repository.
  • Drift review: make model-dma-tla re-verifies the jar SHA-256 and the JRE launcher version on every run before invoking the model. A bump is a Makefile pin diff plus a refreshed checked-result record in models/dma/README.md.
  • Why output is parsed and exit-code-gated: TLC returns a non-zero exit code (12) on an invariant violation, a deadlock, or a parse/semantic error, but tools/run-dma-tla-model.sh additionally asserts the Model checking completed. No error has been found. marker and rejects any violation/error marker, so a future TLC behaviour change cannot turn a violation into a green gate. The model is checked with deadlock detection enabled; the spec provides an explicit terminating self-loop for the all-pages-parked state, so any other stuck state is a genuine modelling gap.
  • Platform/CI scope: the pinned JRE tarball is linux/x64 (the dev/CI host architecture). GitHub CI runs make model-dma-tla in the dma-assurance-models job on ubuntu-24.04; other architectures would need the matching Temurin JRE. Ownership of the TLC pin is shared by the scheduler/IRQ TLA+ model tracks (scheduler-nohz-activation-model, irq-msix-waiter-determinism-model).

Cargo Build Scripts

Build scripts currently do these trusted operations:

ScriptBehavior
kernel/build.rsWatches kernel/linker-x86_64.ld and itself.
capos-config/build.rsCalls tools/capnp-build to watch schema/capos.capnp, generate bindings, and apply the shared no_std patch. Checked by make generated-code-check.
tools/capnp-build/src/lib.rsHost build-support helper for pinned capnp path validation, schema generation, and no_std generated-binding patching. Unit tests cover patch injection and missing-anchor rejection.
tools/adventure-content-gen/src/main.rsHost generator for the prototype adventure CUE source. Checked by make generated-code-check through tools/check-generated-adventure-content.sh, which uses pinned CUE and locked Cargo dependencies.
init/build.rsEmits a linker script argument for init/linker.ld.
demos/*/build.rsEmits a linker script argument for demos/linker.ld.
capos-rt/build.rsEmits a linker script argument for capos-rt/linker.ld when building current target_os = "none" userspace or custom-target target_os = "capos" probes.
capos-wasm/build.rsEmits a linker script argument for capos-wasm/linker.ld (Phase W.2 onward; uses cargo:rustc-link-arg-bins so the script applies only to the wasm-host bin and not the lib).

The linker build scripts derive CARGO_MANIFEST_DIR from Cargo and only emit link arguments plus rerun directives. The capnp build scripts read and rewrite generated code under OUT_DIR. None of these scripts fetch network resources.

Security Verification Track S.10.2 coverage: make generated-code-check exercises the canonical capos-config capnp build script through Cargo, validates the patched generated file, fails if kernel-generated output reappears, and fails if the canonical output no longer matches the checked-in generated baseline.

Manifest, Embedded Binaries, and Downloaded Artifacts

system.cue declares named binaries and services. Makefile builds manifest.bin by running tools/mkmanifest on the host. mkmanifest runs:

  1. Resolve the pinned CUE compiler from $(CAPOS_TOOLS_ROOT), reject missing or mismatched CAPOS_CUE, check cue version v0.16.0, then run cue export system.cue --out json or package-mode equivalent.
  2. JSON-to-CueValue conversion and manifest validation (tools/mkmanifest/src/lib.rs).
  3. Binary embedding from relative paths (tools/mkmanifest/src/lib.rs).
  4. Binary-reference validation and Cap’n Proto serialization (tools/mkmanifest/src/main.rs).

The adjacent mkmanifest cue-to-capnp subcommand uses the same pinned CUE export path but does not parse the result as SystemManifest. Instead, it resolves and validates CAPOS_CAPNP, checks Cap'n Proto version 1.2.0, and passes the exported JSON to Cap’n Proto:

capnp convert json:binary <schema.capnp> <RootType>

It is the supported schema-aware path for CUE-authored data messages rooted at arbitrary specified Cap’n Proto structs; live capabilities and interface objects are outside that data-file contract.

Path handling rejects absolute paths, parent traversal, non-normal components, and canonicalized paths that escape the manifest directory (tools/mkmanifest/src/lib.rs). The generated manifest.bin is copied into the ISO as /boot/manifest.bin and loaded by Limine via limine.conf:5.

Downloaded or generated artifacts in the current build:

ArtifactProducerPinning/drift status
$(LIMINE_DIR) checkout ($(CAPOS_TOOLS_ROOT)/limine/<commit>)git clone/git fetch in the limine-ensure recipeCommit-pinned and artifact-verified.
Cargo registry cratescargo build, cargo run, tests, fuzzLockfile-pinned checksums plus CI-enforced deny/audit checks through make dependency-policy-check.
Node registry packagesnpm ci --ignore-scripts for docs Mermaid renderingpackage-lock.json pins package tarball integrity. Lifecycle scripts are disabled, Puppeteer’s browser download path is skipped, and make dependency-policy-check enforces the npm ci package/lock synchronization invariant plus high-severity npm audit state.
Chromium/Chrome for Mermaid PDF renderingHost executable selected by MERMAID_BROWSER_BIN or auto-detected from chromium-browser, chromium, google-chrome-stable, or google-chromeHost-provided browser, not repo-pinned. The docs PDF target fails closed if no executable is available and passes the selected path to Puppeteer as PUPPETEER_EXECUTABLE_PATH, rather than allowing Puppeteer’s npm install script to download an implicit browser artifact.
Rust toolchain, targets, and rust-srcrustup from rust-toolchain.toml when absentDate-pinned nightly-2026-04-20 channel; rust-src is declared for custom-target -Zbuild-std userspace builds. The advance procedure for bumping the pin lives in the Rust Toolchain section above.
target/ kernel and host artifactsCargoGenerated, not checked in.
init/target/, demos/target/, capos-rt/target/, capos-wasm/target/ ELFsCargo standalone buildsGenerated, embedded into manifest.bin where referenced; make build-provenance records hashes for embedded file-backed and inline payloads.
target/x86_64-unknown-capos/, init/target/x86_64-unknown-capos/, demos/target/x86_64-unknown-capos/, shell/target/x86_64-unknown-capos/, capos-rt/target/x86_64-unknown-capos/, libcapos/target/x86_64-unknown-capos/, libcapos-posix/target/x86_64-unknown-capos/, and capos-wasm/target/x86_64-unknown-capos/ userspace artifactsCargo aliases using targets/x86_64-unknown-capos.jsonGenerated artifacts for booted userspace manifests, the capos-rt smoke binary, the wasm-host Phase W.2 binary, and the libcapos / libcapos-posix C-substrate staticlibs.
manifest.bintools/mkmanifestGenerated from system.cue plus ELF payloads; not checked in. Hash is recorded by make build-provenance.
iso_root/ and capos.isoMakefile, xorriso, Limine installerGenerated and gitignored; Limine inputs verified. Final ISO hash is recorded by make build-provenance.
target/build-provenance.txttools/build-provenance.sh via make build-provenanceGenerated and gitignored; records runner OS/kernel/architecture identity, GitHub Actions image identity when present, Rust toolchain details, selected executable paths, package identities when discoverable, OVMF selected path/package/absence state, tool versions, git commit, manifest/ISO/kernel/OVMF hashes, and embedded payload origin plus hashes. CI publishes the artifact as build-provenance-<sha> on every qemu-smoke run (30-day retention). On pull_request events the qemu-smoke job locates the most recent successful main-branch build-provenance-<sha> artifact, downloads it via actions/download-artifact, and runs make build-provenance-compare BUILD_PROVENANCE_COMPARE_POLICY=ci-environment against the candidate record as a blocking PR gate.

Remaining gaps for Security Verification Track S.10.2/S.10.3:

  • CI now publishes target/build-provenance.txt as a named artifact on every qemu-smoke run (30-day retention) and, on pull_request events, downloads the most recent successful main-branch build-provenance-<sha> artifact and runs make build-provenance-compare against the candidate record with BUILD_PROVENANCE_COMPARE_POLICY=ci-environment. The compare step is PR-blocking for runner/tool/Rust/OVMF environment drift and fails when the base artifact cannot be found.
  • qemu-smoke apt-pins qemu-system-x86, xorriso, make, git, and ovmf, and make build-provenance records normalized package identity where the runner exposes it. Repo-pinned digests (download-and-verify rather than apt-installed packages) for qemu-system-x86, xorriso, make, and git, or an immutable runner image digest containing that package set, remain future production-reproducibility hardening tracked in docs/design-risks-register.md (R13).
  • Decide whether CI should record the pinned cue export JSON or final manifest.bin bytes if manifest reproducibility becomes release-critical.

Vendored Upstream Snapshots

The repository carries static, pinned snapshots of selected upstream sources under vendor/. Each snapshot has its own VENDORED_FROM.md recording the upstream URL, tag/version, commit SHA, commit date, vendoring date, license, vendoring posture, and refresh procedure. Snapshots are kept byte-identical to their pinned upstream artifact (git commit, or the crates.io published crate as noted below); the only non-upstream changes permitted without a patches/ unified diff are the documented integration-only empty-[workspace] marker and build-inert files restored from the same upstream commit when the publish include omitted them (for rustls-webpki, src/test_utils.rs and rustfmt.toml, both recorded in its VENDORED_FROM.md). Any future functional patch must be recorded as a unified diff under the snapshot’s patches/ directory plus a Patches entry per the procedure in the snapshot’s VENDORED_FROM.md.

SnapshotUpstreamTag/VersionCommit SHALicenseConsumer
vendor/wasmi-no_std/wasmi-1.0.9/https://github.com/wasmi-labs/wasmiv1.0.961ba65e6563d8b2f5b699b018349d3330b28b9f3Apache-2.0 OR MIT (dual)capos-wasm/ (WASI host adapter wasm-host bin and Preview 1 import surface)
vendor/dns-c-wahern/src/https://github.com/wahern/dnsrel-201608084ec718a77633c5a02fb77883387d1e7604750251MITPOSIX adapter Phase P1.2 Phase B DNS smoke; not yet on the v0 build path (the smoke compiles only demos/posix-dns-resolver/main.c with a commented-out dns.h include)
vendor/rustls-webpki/rustls-webpki-0.103.13/https://github.com/rustls/webpki0.103.13 (crates.io crate)2879b2ce7a476181ac3050f73fe0835f04728e86ISCcapos-tls/ Phase-1 verifier (WebPKI X.509 path building + signature verification, no_std + alloc, no crypto provider in the default build)
vendor/webpki-roots/webpki-roots-1.0.7/https://github.com/rustls/webpki-roots1.0.7 (crates.io crate)be948464fd5907af6227213a066743a161221688CDLA-Permissive-2.0capos-tls/ Phase-1 trust-anchor bootstrap (compiled-in Mozilla NSS root bundle, no_std)
vendor/embedded-tls/embedded-tls-0.19.0/https://github.com/drogue-iot/embedded-tls0.19.0 (crates.io crate)865e1fd983c583228e3bbeb9f4996f1abc454ca3Apache-2.0demos/cloud-tls-client-handshake-smoke/ TLS 1.3 client state machine (no_std + alloc, default std/tokio features disabled, rustpki enabled)

The rustls-webpki and webpki-roots snapshots use a published-crate posture distinct from the git-clone snapshots above: each is the crates.io published .crate artifact, SHA-256-verified against the crates.io index (61c429a8…f756e and 52f5ee44…2eb9d respectively), with the upstream commit recorded from the artifact’s embedded .cargo_vcs_info.json. For rustls-webpki, two build-inert files the publish include omitted (src/test_utils.rs, a #[cfg(test)] module, and rustfmt.toml) are restored from the same upstream commit so cargo fmt resolves the module tree and formats the snapshot under upstream’s own style config; see its VENDORED_FROM.md. capos-tls/ (a root-workspace member and the Phase-1 host verifier crate) depends on both as exact-pinned path dependencies (version = "=0.103.13" / version = "=1.0.7") and forces them to link for x86_64-unknown-none under cargo build and cargo build --features qemu. rustls-webpki is selected with default-features = false, features = ["alloc"], so neither std nor a ring / aws-lc-rs crypto provider is compiled; the active compiled closure is rustls-pki-types (alloc, pulling zeroize) and untrusted (ISC), all pinned in the root Cargo.lock and covered by make dependency-policy-check. The ring optional dependency appears in Cargo.lock as an unselected optional entry; aws-lc-rs is a feature-gated optional / dev-only dependency and does not resolve into the root lockfile at all. Neither is ever feature-activated, so no crypto provider is compiled and cargo deny does not evaluate ring (cargo tree -p capos-tls -e features activates only rustls-pki-types, zeroize, untrusted, webpki-roots, and rustls-webpki[alloc]). The webpki-ring feature is host-test-only and supplies the signature algorithms for cargo test-tls; the default bare-metal build remains provider-free.

The embedded-tls snapshot uses the same published-crate posture: the vendored tree is the crates.io 0.19.0 package with .cargo_vcs_info.json recording upstream commit 865e1fd983c583228e3bbeb9f4996f1abc454ca3. The local handshake smoke depends on it with an exact path pin, disables default std and tokio, and enables only rustpki so the TLS 1.3 client path can run under target_os = "capos" over a TcpSocket cap. The empty [workspace] marker is the only local integration change.

capos-wasm/Cargo.toml pins the wasmi path dependency to version = "=1.0.9" so cargo-deny’s wildcards gate continues to pass; the snapshot is exercised by make capos-wasm-build, every make run-wasi-* smoke, and make dependency-policy-check (cargo-deny + cargo-audit on capos-wasm/Cargo.lock). Refreshing wasmi to a newer tag requires the rsync pattern, manifest pin bump, lockfile regeneration, and policy re-check recorded in vendor/wasmi-no_std/VENDORED_FROM.md.

The dns.c snapshot is intentionally a strict subset (only src/dns.c, src/dns.h, LICENSE, and README.md); ancillary upstream files (cache, mem, spf, zone, regress) are excluded because the v0 build path does not need them. Future POSIX-adapter phases that widen libcapos-posix enough to compile dns.c whole will start consuming the snapshot in the build instead of carrying it as a documentation-only reference.

vendor/dash/ is not present at this revision. If a future POSIX-adapter phase imports dash, add a new row above plus a vendor/dash/VENDORED_FROM.md recording the same provenance fields.

Host Tools

Current local host versions observed during this inventory:

ToolObserved versionBuild role
capnp1.2.0Repo-selected schema compiler built by make capnp-ensure from a SHA-256-pinned official source tarball into $(CAPOS_TOOLS_ROOT).
cuev0.16.0Repo-selected manifest compiler installed by make cue-ensure into $(CAPOS_TOOLS_ROOT) from the SHA-256-verified official release binary.
qemu-system-x86_6410.2.2Boot verification via make run and make run-uefi.
xorriso1.5.8ISO generation.
make4.4.1Build orchestration.
git2.53.0Limine checkout/fetch and review workflow.

These are local environment observations, not repository pins. On the qemu-smoke CI runner, qemu-system-x86, xorriso, make, and git are apt-pinned to qemu-system-x86=1:8.2.2+ds-0ubuntu1.16 (amd64, noble-updates/main or noble-security/main, Ubuntu 24.04), xorriso=1:1.5.6-1.1ubuntu3 (amd64, noble/main, Ubuntu 24.04), make=4.3-4.1build2 (amd64, noble/main, Ubuntu 24.04), and git=1:2.43.0-1ubuntu7.3 (amd64, noble-updates/main or noble-security/main, Ubuntu 24.04) by the “Install boot smoke dependencies” step; the per-run identity for each is captured via dpkg-query and normalized apt source pockets by tools/build-provenance.sh. The bump procedure mirrors OVMF: run apt-cache madison <tool> on a current Ubuntu 24.04 host, pick the highest stable version from noble-updates/main (or noble/main when no noble-updates entry exists, as is the case for xorriso and make today), and update the pinned version string in .github/workflows/ci.yml plus this row. make run-uefi selects an OVMF firmware blob from OVMF_CODE_CANDIDATES in Makefile:96-97; the Makefile pins the expected blob via OVMF_CODE_SHA256 and the ovmf-verify target enforces the match before ISO and cloud-disk construction. On the qemu-smoke CI runner the ovmf=2024.02-2ubuntu0.8 apt-install resolves the first candidate (/usr/share/ovmf/OVMF.fd), ovmf-verify succeeds against the pinned digest, and make build-provenance records the resulting firmware-blob SHA-256 per run. Build hosts without any OVMF candidate installed see an ovmf-verify NOTICE skip rather than a failure, so research workflows that never invoke make run-uefi continue to build the ISO unchanged.

Remaining gap for Security Verification Track S.10.3: decide whether full production reproducibility uses an immutable runner image digest, repo-managed download-and-verify tool digests for the apt-pinned build/boot tools, or both. build-essential, curl, sha256sum, the shell, and the checkout-time git used by actions/checkout remain runner-provided; the PR-blocking provenance gate records and compares the post-checkout build environment, but it does not turn the mutable ubuntu-24.04 runner label into an immutable production image.

GitHub Actions Runner and Workflow Pinning

The CI harness in .github/workflows/ci.yml is itself a supply-chain input: its identities determine which third-party code runs against every push and pull request, and the chosen runner image determines the host package set underneath every host-baseline, Kani, and optional QEMU job. Mutable @v<N> or @master references on third-party Actions would allow upstream owners to swap out the executed code at any time without a repository diff, and ubuntu-latest would silently roll the runner OS when GitHub re-points it.

The current policy is to pin every third-party Action to a 40-character commit SHA and to pin the runner OS to a specific release rather than the floating label. Each pinned uses: line carries a trailing # v<X.Y.Z> comment so reviewers and bump PRs can read the intended release without following the SHA through the GitHub UI.

IdentityPinned referenceNotes
runs-on: runner imageubuntu-24.04Replaces ubuntu-latest; applied to host-baseline, kani-proofs, dma-assurance-models, and qemu-smoke. GitHub-hosted ImageOS and ImageVersion are recorded in target/build-provenance.txt when present and are compared by the PR-blocking CI environment policy. Bump only when the next LTS is needed and the full make check plus QEMU smokes are reverified against the new image.
actions/checkout34e114876b0b11c390a56381ad16ebd13914f8d5 # v4.3.1Resolved from the actions/checkout v4 major-version tag.
swatinem/rust-cachec19371144df3bb44fab255c43d04cbc2ab54d1c4 # v2.9.1Canonical Swatinem/rust-cache v2.9.1 release commit. The v2 major-tracking tag carries the same 2.9.1 message but points at a distinct republication commit; always dereference the exact release tag rather than the major tag.
dtolnay/rust-toolchain3c5f7ea28cd621ae0bf5283f0e981fb97b8a7af9 # master @ 2026-03-27The upstream action does not publish numbered releases; its documented usage is @master. The pin is a snapshot of master at the dated commit.
actions/upload-artifactea165f8d65b6e75b540449e92b4886f43607fa02 # v4.6.2Resolved from the actions/upload-artifact v4.6.2 lightweight tag (same SHA as the moving v4 major tag at resolution time). Used in qemu-smoke to publish target/build-provenance.txt.
actions/download-artifactd3f86a106a0bac45b974a628896c90dbdf5c8093 # v4.3.0Resolved from the actions/download-artifact v4.3.0 release tag. Paired with actions/[email protected] (both v4 series) and used in qemu-smoke on pull_request events to fetch the most recent successful main-branch build-provenance-<sha> artifact for the blocking make build-provenance-compare BUILD_PROVENANCE_COMPARE_POLICY=ci-environment step.

Bump procedure for any of the entries above:

  1. Resolve the candidate release to its commit SHA via the upstream release tag, e.g. gh api repos/<owner>/<repo>/git/ref/tags/v<X.Y.Z> (dereference any annotated tag through gh api repos/<owner>/<repo>/git/tags/<sha>), or gh api repos/<owner>/<repo>/commits/<branch> for branch-tracked actions like dtolnay/rust-toolchain. Always dereference the exact release tag (vX.Y.Z) rather than the moving major-version tag (vX): major-version tags can be re-cut at a republication commit whose tag message still names the same release (as observed for swatinem/rust-cache@v2), so following vX can pin a different commit than the canonical vX.Y.Z release.
  2. Update both the SHA and the trailing # v<X.Y.Z> comment in .github/workflows/ci.yml so the reviewer sees the intended release.
  3. Run make fmt-check and make workflow-check locally for the bump branch. Workflow hygiene plus YAML well-formedness must pass before review. The acceptance gate for the bump itself is a green CI run on the bump branch – make check plus the existing QEMU smokes – which exercises the new Action versions end-to-end.
  4. Treat any master-branch SHA pin (currently dtolnay/rust-toolchain) as a manual-bump dependency: the upstream action does not publish release tags, so bumping its SHA is the only way to absorb upstream fixes. Schedule those bumps explicitly rather than relying on a floating reference.

This pinning closes the mutable-tag supply-chain gap for the CI harness itself. It does not by itself satisfy the “pinned runner image digest” line of the Build Provenance Retention And Comparison Policy: ubuntu-24.04 is still a label managed by GitHub Actions, not an immutable image digest. The current documented equivalent for PR gating is to retain the GitHub-hosted ImageOS/ImageVersion fields in target/build-provenance.txt and compare them against the latest successful main-branch record. A future production-hardening slice may move to a self-built runner image referenced by digest, mirror the build-tool packages, or both.

Inventory Method

This inventory is based on source inspection, Cargo metadata, lockfile checks, and local host-tool version queries. Local host-tool versions are observations, not repository pins; the tables above distinguish enforced pins from observed environment state.

Useful commands for refreshing the inventory:

  • git status --short --branch
  • rg -n "S\\.10|trusted|supply|Limine|limine|capnp|capnpc|QEMU|qemu|download|curl|git clone|wget|build\\.rs|rust-toolchain|Cargo\\.lock" ...
  • rg --files
  • cargo metadata --locked --format-version 1 --no-deps
  • rg -n '^name = |^version = |^checksum = ' Cargo.lock init/Cargo.lock demos/Cargo.lock tools/mkmanifest/Cargo.lock tools/ringtap-viewer/Cargo.lock capos-rt/Cargo.lock shell/Cargo.lock libcapos/Cargo.lock libcapos-posix/Cargo.lock capos-wasm/Cargo.lock fuzz/Cargo.lock
  • command -v rustc cargo capnp cue qemu-system-x86_64 xorriso sha256sum git make
  • rustc -Vv, cargo -V, capnp --version, cue version, qemu-system-x86_64 --version, xorriso -version, make --version, git --version

Panic-Surface Inventory

Scope: panic!, assert!, debug_assert!, .unwrap(), .expect(), todo!, and unreachable! surfaces relevant to boot manifest loading, ELF loading, SQE handling, params/result buffers, IPC, and future spawn inputs.

Classification terms:

  • trusted-internal: depends on kernel/shared-code invariants, static ABI layout, or host build/test code; not directly controlled by a service.
  • boot-fatal: reached during boot/package setup before mutually untrusted services run. Bad platform/package state can halt the system.
  • untrusted-input reachable: reachable from userspace-controlled SQEs, Cap’n Proto params/result buffers, IPC state, manifest/package data, or future spawn-controlled service/binary data.

Summary

No current panic!/assert!/unwrap()/expect() site found in the kernel ring dispatch path directly consumes raw SQE fields or user params/result-buffer pointers. Those paths mostly return CQE errors through kernel/src/cap/ring.rs.

The remaining relevant surfaces are boot-fatal setup assumptions, scheduler internal invariants that would become more exposed once untrusted spawn/lifecycle inputs can create or destroy processes dynamically, and IPC rollback queue capacity assumptions.

Locations use path::function anchors rather than line numbers; line numbers drift on every refactor. Grep the path plus the quoted surface text to re-locate a site.

Manifest And Future Spawn Inputs

LocationSurfaceReachabilityClassificationNotes
kernel/src/main.rs run_initMODULES.response().expect("no modules from bootloader")Boot package/module tableboot-fatalMissing Limine modules abort before manifest validation.
kernel/src/main.rs run_initelf_cache.get(service.binary.as_str()).ok_or_else(...)Manifest service binary referenceuntrusted-input reachable, controlled errorNot a panic surface. Included because it is the future spawn shape to preserve: unknown or unparsed binaries return an error.
kernel/src/spawn.rs spawn_serviceProcess::new(...).map_err(...)Manifest-spawned process creationuntrusted-input reachable, controlled errorCurrent boot path converts allocation/mapping failures into boot errors. Future ProcessSpawner should keep this shape instead of adding unwraps.

ELF Inputs

LocationSurfaceReachabilityClassificationNotes
kernel/src/spawn.rs load_elfdebug_assert!(stack_top % 16 == 0, ...)ELF load pathtrusted-internalConstant stack layout invariant, not ELF-controlled.
kernel/src/spawn.rs align_updebug_assert!(align.is_power_of_two())TLS mapping from parsed ELFtrusted-internalelf::parse rejects non-power-of-two TLS alignment; load_tls also caps the size before calling align_up.
capos-lib/src/elf.rs parserno runtime panic surfaces outside tests/KaniBoot manifest ELF bytes; future spawn ELF bytesuntrusted-input reachable, controlled errorParser uses checked offsets/ranges and returns Err(&'static str). Test-only assertions/unwraps are excluded from runtime classification.
kernel/src/spawn.rs load_elfslice init_data[src_offset..]Parsed ELF PT_LOAD file rangeuntrusted-input reachable, guardedNot matched by the panic-token grep, but it is an index panic candidate if parser invariants are bypassed. elf::parse checks segment file ranges before load_elf.
kernel/src/spawn.rs load_tlsslice &init_data[init_start..init_end]Parsed ELF TLS file rangeuntrusted-input reachable, guardedNot matched by the panic-token grep, but it is an index panic candidate if parser invariants are bypassed. elf::parse checks TLS file bounds before load_tls.

SQE And Params/Result Buffers

LocationSurfaceReachabilityClassificationNotes
kernel/src/cap/ring.rs process_ring / dispatch_call / dispatch_recv / dispatch_returnno matched panic-like surfacesUserspace SQEs, params, result buffersuntrusted-input reachable, controlled errorSQ corruption, unsupported fields/opcodes, oversized buffers, invalid user buffers, and CQ pressure return transport errors or defer consumption.
capos-config/src/ring.rs const _: () = assert!(...) ABI size checksconst assert! layout checksShared ring ABItrusted-internalCompile-time ABI guard; not runtime input reachable.
capos-config/src/capset.rs const _: () = assert!(...) ABI size checksconst assert! layout checksShared CapSet ABItrusted-internalCompile-time ABI/page-fit guard; not runtime input reachable.
capos-lib/src/frame_bitmap.rs (alloc_frame and alloc_contiguous).try_into().unwrap() on 8-byte bitmap windowsFrame allocation, including work triggered by manifest/process creation and capability methodstrusted-internalGuarded by frame + 64 <= total or i + 64 <= to, assuming the caller-provided bitmap covers total_frames. Kernel constructs that bitmap at boot.

IPC

LocationSurfaceReachabilityClassificationNotes
kernel/src/cap/endpoint.rs Endpoint::endpoint_callpending receive pop on CALL deliveryCross-process CALL delivered to pending RECVuntrusted-input reachable, controlled errorThe former guarded pending_recvs.pop_front().unwrap() now returns a failed capnp error if the queue is inconsistent. Endpoint pending-RECV exhaustion has QEMU coverage in endpoint-roundtrip.
kernel/src/cap/endpoint.rs endpoint_restore_recv_frontrollback push_front growthIPC rollback pathuntrusted-input reachable, controlled errorCALL delivery reserves the popped pending-RECV slot until rollback restores the RECV or receiver completion releases the reservation, so concurrent receives cannot consume rollback capacity. Recovery helpers resolve the original endpoint object through revoked cap epochs and wrapper recovery methods bypass liveness checks, without reopening ordinary CALL/RECV/RETURN authority. If restore still fails after reaching the endpoint, the ring path posts or defers an explicit receiver cancellation instead of silently dropping the popped RECV. endpoint-roundtrip includes QEMU coverage for same-process CQ-pressure rollback with both available and saturated pending-RECV capacity, then consuming the restored undersized RECV through the controlled receiver-error path; capos-lib host coverage checks revoked-cap recovery lookup.

Scheduler And Process Lifecycle

LocationSurfaceReachabilityClassificationNotes
kernel/src/sched.rs register_idle_process_lockedProcess::new_idle().expect("failed to create idle process")Boot scheduler init (sched_init, slot 0) and lazy per-CPU registration (current_cpu_idle_thread_locked)boot-fatal at slot 0; per-CPU-fatal on first AP idleSynthetic idle Process creation OOM panics. There is no fallback idle path after the user-mode idle process removal, so this panic is the deliberate unrecoverable-OOM behavior.
kernel/src/sched.rs sched_initCPL0 idle kernel stack .expect, idle-context registry try_reserve_exact().expect, per-CPU CpuContext Box::try_new panic!Boot scheduler initboot-fatalCPL0 idle-context infrastructure OOM panics before services run. Same rationale as the synthetic idle records: no fallback idle path exists, so the failure is deliberately unrecoverable.
kernel/src/sched.rs block_current_on_cap_entercurrent.expect, idle assert!, process-table expectcap_enter(min_complete > 0) pathuntrusted-input reachable, internal invariantUserspace can request blocking, but these unwraps assert scheduler state, not user values. Future process lifecycle/spawn changes increase this exposure.
kernel/src/sched.rs capos_block_current_syscallcurrent.expect, idle assert!, table expect, panic! if not blockedBlocking syscall continuationuntrusted-input reachable, internal invariantTriggered after cap_enter chooses to block. User controls the request, but panic requires kernel state inconsistency.
kernel/src/sched.rs run_queue references missing process expect (context-switch + start paths)run-queue/process-table consistencyScheduling after queue selectiontrusted-internal now; future spawn/lifecycle sensitiveA stale run-queue PID panics. Dynamic spawn/exit must preserve run-queue/process-table invariants.
kernel/src/sched.rs exit_currentcurrent.expect, idle assert!, processes.remove(...).unwrap(), next-process unwrap()Ambient exit syscall and future process exituntrusted-input reachable, internal invariantAny service can exit itself. Panic requires scheduler corruption or idle misuse, but future spawn/process APIs should harden this boundary.
kernel/src/sched.rs current_ring_and_capscurrent.expect, process-table expectcap_enter flush pathuntrusted-input reachable, internal invariantUser can call cap_enter; panic requires no current process or missing table entry.
kernel/src/sched.rs startinitial run-queue expect, process-table unwrap, CR3 expectBoot service startboot-fatalManifest with zero services is rejected earlier, and process creation errors out; panics indicate scheduler/CR3 invariant breakage.
kernel/src/arch/x86_64/context.rs timer context restoreCR3 expect("invalid CR3 from scheduler")Timer interrupt schedulingtrusted-internal; future lifecycle sensitiveScheduler should only return page-aligned CR3s from AddressSpace.

Boot Platform And Memory Setup

LocationSurfaceReachabilityClassificationNotes
kernel/src/main.rs kmainassert!(BASE_REVISION.is_supported())Limine boot protocolboot-fatalPlatform/bootloader contract check.
kernel/src/main.rs kmainmemory-map and HHDM expectLimine boot protocolboot-fatalMissing bootloader responses halt before untrusted services.
kernel/src/main.rs kmaincap::init().expect("failed to initialize kernel capabilities")Kernel cap table bootstrapboot-fatalFails on kernel-internal cap-table exhaustion.
kernel/src/mem/frame.rs initframe-bitmap region expect("no region large enough for frame bitmap")Boot memory mapboot-fatalBad or too-small memory map halts.
kernel/src/mem/frame.rs free_frametry_free_frame(...).expect("free_frame failed")Kernel-owned frame teardowntrusted-internalCapability handlers use try_free_frame; this panic surface is for kernel-owned frames and rollback/Drop paths.
kernel/src/mem/frame.rs HHDM cache helperassert!(offset != 0, "frame allocator not initialized")HHDM cache use before frame inittrusted-internalInitialization-order invariant.
kernel/src/mem/heap.rs initalloc_contiguous(HEAP_FRAMES).expect("out of memory for heap")Boot heap initboot-fatalFails if the frame allocator cannot provide the fixed kernel heap.
kernel/src/mem/paging.rs alloc_page_table_frame / kernel_pml4_frame / assert!(addr != 0, "paging not initialized")page-alignment .unwrap() / paging initialized assert!Kernel frame/page-table internalstrusted-internalframe::alloc_frame returns page-aligned addresses.
kernel/src/mem/paging.rs init_kernel_page_tableskernel PML4 expect("failed to allocate kernel PML4"), page-lookup and map expectsKernel page-table setupboot-fatalAssumes kernel image is mapped in bootloader tables and enough frames exist.
kernel/src/arch/x86_64/syscall.rs initSTAR selector expect("invalid STAR segment configuration")Syscall initboot-fatalGDT selector layout invariant.
kernel/src/sched.rs context-switch / exit_current / startCR3 expect("invalid CR3")Context switch/exit/starttrusted-internal; future lifecycle sensitiveScheduler should only carry page-aligned address-space roots.

Audit Method

Candidate sites come from panic-token searches over runtime source plus manual review of nearby indexing and allocation paths on untrusted-input boundaries. The table excludes test-only assertions unless they enforce runtime ABI or layout contracts. Re-run the searches after code changes and classify new sites by reachability, not by token alone.

Search commands:

rg -n "\b(panic!|assert!|assert_eq!|assert_ne!|debug_assert!|debug_assert_eq!|debug_assert_ne!|unwrap\(|expect\(|todo!|unreachable!)" kernel capos-lib capos-config init demos tools schema system.cue Makefile docs -g '*.rs' -g '*.cue' -g '*.md' -g 'Makefile'
rg -n "\b(panic!|assert!|assert_eq!|assert_ne!|debug_assert!|debug_assert_eq!|debug_assert_ne!|unwrap\(|expect\(|todo!|unreachable!)" kernel/src capos-lib/src capos-config/src init/src demos/capos-demo-support/src demos/*/src tools/mkmanifest/src -g '*.rs'

DMA Isolation Design

Security Verification Track S.11 gates PCI, virtio, and later userspace device-driver work on an explicit DMA authority model. The immediate goal is narrow: let the kernel bring up a QEMU virtio-net smoke without creating a user-visible raw physical-memory escape hatch.

Short-Term Decision

Use kernel-owned bounce buffers for the first in-kernel QEMU virtio-net smoke.

The first virtio-net smoke stays on this conservative path:

  • kernel-owned DMA pages
  • kernel-owned virtqueue descriptor tables
  • kernel-owned packet buffers
  • kernel-programmed physical addresses
  • copied packet bytes delivered to the network stack
  • no DMA buffer capability exposed to userspace
  • no physical address exposed to userspace
  • no virtqueue pointer exposed to userspace
  • no BAR mapping exposed to userspace

The kernel allocates DMA-capable pages from its own frame allocator, owns the virtqueue descriptor tables and packet buffers, programs the device with the corresponding physical addresses, and copies packet payloads between those buffers and the networking stack.

This is deliberately conservative:

  • It works before ACPI/DMAR or AMD-Vi parsing, IOMMU page-table management, MSI/MSI-X routing, and userspace driver lifecycle supervision exist.
  • It keeps all physical-address programming inside the kernel, where the same code that allocates the frames also bounds the descriptors that reference them.
  • It does not make the current FrameAllocator or MemoryObject capability part of the DMA path. FrameAllocator no longer exposes raw physical addresses, but DMA still needs device-owned buffer objects with IOVA and reset/revoke semantics rather than repurposed general memory caps.
  • It gives the smoke a disposable implementation path. When NIC or block drivers move to userspace, bounce-buffer authority becomes a typed DMAPool object instead of an ad hoc physical-address grant.

An IOMMU-backed DMA-domain model remains the target for direct device access from mutually untrusted userspace drivers, but it is not a prerequisite for the first QEMU smoke. Without an IOMMU, a malicious bus-mastering device can still DMA to arbitrary RAM at the hardware level; the short-term smoke assumes QEMU-provided virtio hardware and protects against confused or untrusted userspace, not hostile hardware.

IOMMU Staging

IOMMU support is a deferred-with-known-dependency prerequisite for production hardware claims and for moving direct DMA-capable NIC or block drivers into userspace. capOS now discovers bounded ACPI IOMMU table summaries for Intel DMAR and AMD-Vi/IVRS and records static DMAR DRHD include-all or single-hop PCI endpoint device-scope coverage for retained DMA-capable PCI diagnostics functions. Bridge and multi-hop scopes are retained for diagnostics but do not prove endpoint attachment until PCI topology traversal exists, and include-all fallback fails closed when retained DRHD units or scopes are capped.

The selected QEMU Intel remapping path now programs VT-d root/context and second-level tables for manager-owned DMAPool pages, reports bounded fault state, exports only domain-scoped IOVAs, and proves two claimed DMA-capable functions receive distinct per-device domains and second-level roots. It also asserts the production-path S.11.2 hostile-smoke matrix over the active DMAPool / DMABuffer ledger. The decomposed integration umbrella for this path closed 2026-05-23 23:35 UTC (ddf-iommu-remapping-production-closeout). This is still QEMU-only evidence for the selected path, not a general production hardware-isolation claim: trusted sharing groups, AMD-Vi programming, and production NIC/block userspace driver authority remain future work, and VM shapes without usable remapping hardware remain on the explicit bounce-buffer fallback.

The discovery parser is intentionally shallow and follows the static-table formats documented by the Intel VT-d architecture specification, the AMD IOMMU specification, and QEMU’s q35-only -device intel-iommu emulation:

Future real remapping work is grounded by the primary-source IOMMU remapping research note, which records Intel VT-d, AMD-Vi, and QEMU sections relevant to table programming, invalidation, fault/status diagnostics, and QEMU-only smoke tests. That note is source grounding only; it does not make the current diagnostics path a real remapping implementation.

The staged implementation order is:

  1. Discover firmware IOMMU topology from ACPI static tables and fail closed if the tables are malformed, unsupported, or inconsistent with the PCI root complex being used. This first bounded table-discovery step is implemented for DMAR/IVRS summaries only; domain attachment is still planned.
  2. Record each DMA-capable PCI function’s attachment to an IOMMU unit, or explicitly keep the function on the prototype bounce-buffer-required policy when no trusted IOMMU domain can be created. This reporting step is implemented for retained PCI diagnostics functions when DMAR DRHD include-all or single-hop PCI endpoint device-scope metadata proves PCI segment/BDF coverage. Bridge and multi-hop scopes are not treated as attachment proof until PCI topology traversal exists, and include-all fallback fails closed when retained DMAR coverage metadata is capped; trusted domain creation is still planned.
  3. Define and prove the claimed-device domain policy: one device-manager-owned DMA domain per claimed device or trusted sharing group, with all exported device addresses represented as IOVAs scoped to that domain rather than host physical addresses. The selected QEMU Intel path now implements the per-device form for two claimed DMA-capable functions; trusted sharing groups remain disabled and out of scope.
  4. Attach DMAPool allocation, descriptor validation, MMIO ownership, interrupt ownership, and revocation state to the same device-manager ledger before any doorbell write can make a descriptor visible to hardware.
  5. On revoke, reset, or driver death, stop new submissions, remove or invalidate IOMMU mappings before page reuse, and flush the relevant IOTLB state where the hardware model requires it.

Until those gates exist, direct DMA and userspace driver handoff remain blocked. Devices that cannot be placed in a trusted IOMMU domain must stay on kernel-owned bounce buffers or remain unsupported for production claims. This also affects the hostile-smoke gate: S.11.2 smokes must prove that stale DMA handles, stale completions, reset races, and teardown ordering fail closed for IOMMU-backed IOVA mappings, while the process-exit / exit-under-DMA rows remain covered by the selected backend evidence before a cloud or hardware driver can be treated as isolated from the rest of memory.

Fallback Policy For No Usable IOMMU Exposure

Some providers or VM shapes may not expose remapping hardware that capOS can trust. That includes absent, malformed, unsupported, capped, or incomplete DMAR/IVRS metadata; scopes that require PCI topology traversal capOS has not implemented yet; and platforms where remapping hardware is unavailable or cannot be programmed safely. Those shapes use a fail-closed fallback policy:

  • Direct device DMA remains blocked. direct_dma_trusted_domains stays zero and remapping_tables stays not-programmed.
  • Prototype devices that remain enabled use kernel-owned bounce buffers only. The kernel or device manager owns the pages, descriptor validation, physical-address programming, and packet or block-data copies between device-visible memory and non-device memory. General FrameAllocator and MemoryObject capabilities are not DMA authorities.
  • capOS does not expose direct hardware authority for userspace DMAPool, DMABuffer, DeviceMmio, or Interrupt in the fallback shape. Result-only .info skeletons and bounded manifest grants may report conservative status. The current DMAPool manifest grant may allocate and free eight fixed manager-attached, kernel-owned, single-page bounce-buffer DMABuffer result caps, with backing pages scrubbed before frame release and no host physical address or IOVA exposed. That narrow fixed-slot allocation/free authority does not map DMA, program device-visible addresses, publish arbitrary CQ entries, program IOMMU/remapping tables, access arbitrary BAR registers or doorbells, or own hardware interrupt acknowledgement, mask, or unmask. The selected provider-TX proof is the current bounded exception: after the same manager-owned DMABuffer authority and bounce-scrub gates, queue 1 may publish the full selected TX queue-depth descriptor/avail window into the existing kernel-owned virtio-net TX ring before the first completion, ring one selected notify doorbell per accepted provider descriptor through the live no-write notify_mmio policy, and hand those bounded completions back through descriptor/generation-matched DMABuffer.completeDescriptor plus live tx_interrupt.wait completion events. The same selected path can also use tx_interrupt.mask/unmask to toggle only the selected TX MSI-X table vector-control bit and matching route state after live issue-id and route validation, and can retire one deferred LAPIC EOI for each delivered selected TX used-ring completion event, with Interrupt.acknowledge returning ABI-visible provider CQ/ack ledger fields plus hardware dispatch ack count, delta, token, and mutation flag for that bounded pairing. Full-queue QEMU bursts that coalesce selected TX MSI-X delivery use a bounded INT $vector proof hook only while the virtio TX completion path has an active full-window coalescing budget, so the selected IDT handler and deferred-EOI path remain observable without claiming full production IRQ ownership. Successful selected queue 1 DMABuffer.completeDescriptor, tx_interrupt.wait, and tx_interrupt.acknowledge results also carry bounded CQ event identity: sequence, queue, descriptor id, slot, slot generation, software descriptor generation, completion length, provider issue id, source id/generation, and route generation. Pre-event, duplicate ack, masked-route ack, wrong-order completion, teardown-drain, stale issue after release/regrant, reset, and stale-after-release paths keep that identity empty and do not mutate the bounded identity queue. Provider TX release also retires delivered but unacknowledged bounded CQ events for the live issue before clearing that issue: the stale post-release ack path is revoked, and the release proof records seven pending provider completion acks and their deferred EOIs as release-retired. The same selected path also has a bounded teardown-only drain for seven incomplete provider-published TX descriptors while one completed descriptor remains live: release may explicitly drain only the incomplete matching used-ring entries, retire those allocation-backed device-DMA TX queue ledgers, and free only after manager in-flight state is drained, without publishing provider CQ/IRQ events or issuing DMABuffer.completeDescriptor results. The paired provider RX bootstrap grant can now validate the live RX issue and selected virtio-net RX route before toggling only the selected RX MSI-X table vector-control bit and route state, and it can complete one selected-route RX Interrupt.wait after a delivered RX MSI-X/LAPIC dispatch. The paired Interrupt.acknowledge accounts exactly one RX dispatch token and retires one deferred LAPIC EOI for that delivered zero-CQ RX event; pre-event, masked-route, duplicate, and stale-after-release paths fail closed without mutating delivery or acknowledgement state. RX descriptor accounting and RX CQ ownership remain bounded to the synthetic proof path, and full hardware IRQ ownership remains blocked. These exceptions do not transfer full virtio-net ownership, direct DMA, IOMMU authority, arbitrary doorbells, production NIC/storage authority, or cloud readiness.
  • capOS does not claim hostile-hardware isolation for those shapes. A malicious or compromised bus-mastering device without a trusted remapping domain can still write arbitrary RAM at the hardware level. The fallback is acceptable only for prototype devices and trusted emulator or provider shapes where that hardware threat is outside the claim; otherwise the device remains unsupported.
  • Before any userspace driver path can rely on DMA or IRQ authority, S.11.2 hostile smokes must pass for the selected backend. That includes stale DMA handles, stale completions, descriptor abuse, revoke/reset races, stale IRQs, teardown-under-DMA for IOMMU-backed IOVA mappings, and exit-under-DMA for the fallback bounce-buffer path when the fallback is used.

This fallback policy is separate from current diagnostics-only IOMMU metadata coverage and from future real remapping-domain integration. Diagnostics can report static firmware-table coverage for a PCI function, but unless capOS creates a device-manager-owned remapping domain and programs mappings, the active direct-DMA policy remains blocked. Future real integration must attach DMAPool, DeviceMmio, Interrupt, ledger teardown, mapping removal or invalidation, and required IOTLB flushes to the same ownership transaction before a direct-DMA trusted-domain count can become nonzero.

DMA Assurance Model Checked Evidence And Cloud Backend Inputs

The DMA assurance model records the claim boundary and checked bounded evidence for DMA authority; the cloud backend contract it feeds is authoritative and lives in the “Cloud DMA Backend” section below: DMA Assurance Model and models/dma/. It is a design/evidence scaffold, not a new production hardware gate by itself. The checked gates are make model-dma-tla, make model-dma-alloy, make kani-dma-authority, and make model-dma-deferred-completion-loom; make dma-assurance-model-check aggregates them locally, while GitHub CI runs the Alloy/TLA+/Loom gates in dma-assurance-models and the Kani gate in kani-proofs. The operationalization track that reconciled the skeletons against landed DMA code is tracked in Security and Verification (“DMA Assurance Model Operationalization”).

Cloud NIC/storage work must use the model as the checklist for backend selection. Backend selection is a runtime, fail-closed decision the kernel makes on each boot, with an optional operator override declared in the system manifest; it is not a per-VM-shape safety assertion that a person signs off. The authoritative selection rule and the manifest override contract are defined in the “Cloud DMA Backend” section below.

Cloud backend evidence must separate provider-side DMA isolation from guest-controlled remapping authority. SR-IOV, virtual NIC, GPU, accelerator, or local NVMe support can identify a DMA-capable surface, but it is not enough to claim direct-DMA isolation. A direct-remapping backend needs guest-visible IOMMU or equivalent translation authority that capOS can discover and program. The cloud evidence matrix must record provider API or documentation sources, retrieval date, region or zone, instance type, image and kernel, live guest PCI/device probes, IOMMU table/group observations, and maintenance or device revocation behavior as the support-policy record for advertised targets. The runtime probe, not this matrix, makes the binding per-boot selection.

The matrix does not replace runtime selection. capOS must choose the safest backend on each boot from what it can actually observe and validate. Direct remapping is enabled only when guest-programmable remapping authority is present and passes the selected self-tests. A provider-remapped or bounce path is selected only when direct DMA remains blocked and device-visible memory can stay manager-owned. Ambiguous, contradictory, or unvalidated observations select Unsupported.

The backend candidates are:

  • Direct remapping domain. The provider shape must expose guest-programmable remapping hardware; capOS must discover and program a device-manager-owned domain for the target device; descriptor publication must be ordered after mapping; and teardown must remove mappings, observe required invalidation completion, and scrub before page reuse. The selected path must carry stale-handle, stale-completion, descriptor-abuse, revoke/reset-race, teardown-under-DMA, no-host-physical-exposure, and cross-domain alias evidence.
  • Labeled bounce-buffer fallback. Direct DMA stays blocked, device-visible memory remains manager-owned bounce pages, host physical addresses and generic MemoryObject authority stay hidden from the driver, and stale handle/completion/teardown evidence covers the selected fallback. This path must keep hostile_hardware_isolation=not-claimed unless separate per-domain remapping evidence justifies a stronger provider-specific claim.
  • Unsupported. Devices whose DMA behavior cannot satisfy either candidate stay unbound or disabled. A serial boot result or PCI enumeration line is not enough to claim cloud NIC/storage readiness.

Downstream cloud driver preflights must declare the candidate backend and map their evidence to the assurance model’s invariants: no host-physical exposure, mapping before publication, no page reuse before teardown, stale-handle and stale-completion fail-closed behavior, domain-scoped aliasing only, bounded fail-closed holds, and explicit backend evidence. The evidence matrix is a support-policy record of advertised targets; the runtime probe, not the matrix, selects the backend on each boot.

Cloud DMA Backend

This section is the authoritative contract for how capOS selects a DMA backend for cloud NIC/storage devices. Selection is a runtime, fail-closed decision the kernel makes on each boot from what it can actually probe and validate, with an optional declarative override in the system manifest. There is no human sign-off in the selection path: the runtime probe decides by default, and the manifest override is config that an operator sets for a deployment, not a doc-signing ritual gated on any specific person. Downstream cloud NIC/storage driver slices consume this contract directly as their DMA-backend authority.

The preceding “DMA Assurance Model Checked Evidence And Cloud Backend Inputs” section defines the three backend candidates; this section adds the per-candidate trade-off analysis, the runtime selection rule, the manifest override field, and the downstream-contract scaffolding that a cloud NIC/storage driver declares. The research substrate is the provider evidence inventory Cloud DMA Provider Evidence Inventory, and the invariants and tool mapping are in DMA Assurance Model.

Provider-Written Addresses And No-IOMMU Brokered Bounce

Two DMA-address ownership models can be valid, but they do not apply to the same backend.

  • Provider-written, kernel-validated addresses (the NVMe Model B validator) are valid only when the provider’s device-visible address is not a host physical address: a verified direct-remapping/vIOMMU domain-scoped IOVA, or a future synthetic software address namespace that the manager translates before hardware sees it.
  • Brokered address publication is the no-IOMMU bounce-buffer model. The provider may own protocol state and buffer capabilities, but the kernel or device manager writes device-visible queue-base, PRP/SGL, or virtqueue address fields because those values are host physical or bus addresses on current no-IOMMU hardware.

Correction recorded 2026-05-27: the earlier reconciliation that treated a no-IOMMU bounce window as a provider-visible, non-host-physical device address space is not valid for the current implementation. On the run-pci-nvme no-IOMMU shape, DeviceDmaAllocation carries host physical pages and the reviewed IOVA export discipline keeps userspace IOVA/host-physical export disabled. Therefore a provider-written NVMe queue base or PRP on that gate would export a host physical address, violating the no-host-physical-exposure invariant. A bounce buffer protects data ownership and copy discipline; it does not create an untrusted-driver-safe IOVA namespace by itself.

The kernel on-notify DMA validator (kernel/src/cap/nvme_doorbell_validator.rs, validate_doorbell_scan) remains useful evidence for the provider-written model. On a queue-arm/CC.EN write and on an SQ tail doorbell it scans the device-visible addresses the provider published (queue bases; PRP1/PRP2 and one level of PRP-list indirection) and fails closed before the doorbell takes effect on any address that is not wholly within a window granted to the doorbell claim’s owner at the live generation: out-of-window, host-physical, cross-owner-alias, region-overrun, unaligned, deeper-than-one-level PRP chain, or stale generation (ScanReject). Owner identity and live generation come from the grant ledger, never from provider-supplied metadata. A completion whose submission scan was never validated, or was validated under a now-retired generation, does not wake a waiter (completion_wakes_waiter), matching the stale-completion gate on the virtio-net path. That mechanism is the right fit for the QEMU direct-remapping lane and any future cloud shape that exposes guest-programmable remapping.

For the current GCP/no-IOMMU target, the storage path must use brokered bounce: userspace supplies typed commands, queue ownership intent, and live DMABuffer/buffer-cap handles; the manager materializes the actual device-visible queue-base and PRP/SGL fields and orders publication, teardown, copy, and scrub. That still leaves protocol-specific NVMe logic in userspace, but it does not let userspace author raw device addresses.

The brokered admin-queue enable landed 2026-05-27 (nvme-no-iommu-brokered-controller-enable, device_manager::nvme_brokered_admin_queue_enable). The provider allocates the admin submission/completion queue pages through its DMAPool cap and requests enable through the CC selected-write claim (CC.EN set); the manager resolves those pages from the live ledger (proof_buffers slots 0/1), validates the authored bases through validate_doorbell_scan (ScanKind::QueueArm), and authors AQA/ASQ/ACQ plus the CC.EN write itself. The provider never receives the host physical / device-visible queue-base address; CSTS.RDY=1 is observed only through brokered reads. This is the brokered model applied to the admin queue-arm; the steady-state SQ-tail doorbell over provider-written PRPs still needs the direct-remapping/synthetic-address lane above. Proof make run-pci-nvme; provenance docs/devices/nvme.md §6.

Provider-Side Isolation Versus Guest-Programmable Remapping

The decisive distinction for backend selection is between a DMA-capable surface and guest-programmable remapping authority:

  • SR-IOV (AWS ENA, Azure Accelerated Networking VF), a virtual NIC (gVNIC, virtio-net), a GPU/accelerator, or local NVMe identifies a device that does or could bus-master. This is a DMA-capable surface, not a safety property.
  • A direct-remapping classification requires a usable Intel VT-d, AMD-Vi, or Arm SMMU unit that the guest can discover, program, and validate, with translation/fault/invalidation behavior matching IOMMU Remapping Grounding. A DMA-capable surface alone never implies this.
  • Provider-side isolation facts (host-enforced VPC isolation, Nitro/host data-path bypass, hypervisor-side IOMMU) are support-policy assumptions a guest cannot prove from inside, not evidence that capOS can safely program direct DMA.

Runtime probing is authoritative for selecting the safe backend on a particular boot: capOS chooses from the device inventory, the remapping authority it can actually program, driver self-tests, and fail-closed probe results, and unknown or contradictory observations select the labeled bounce-buffer path or Unsupported. The cloud VM evidence matrix is the separate support-policy record for advertised targets and provider assumptions a guest cannot fully prove by itself; it does not override the boot-time runtime selection.

Candidate Trade-Off Analysis

DimensionDirect remapping domainLabeled bounce-buffer fallbackUnsupported
IOMMU coverage requirementRequires a guest-programmable VT-d/AMD-Vi/SMMU unit capOS can program and validate per device.None: used precisely when no usable guest IOMMU is exposed.N/A: device stays unbound.
Cloud VM shape coverage (per inventory)No probed GCE shape exposes a guest-programmable IOMMU; AWS/Azure shapes not yet probed. So no probed shape currently qualifies.Indicated for shapes with a DMA-capable surface but no guest IOMMU (the probed GCE rows); fail-closed default for unproven shapes with a manager-ownable surface.Ambiguous, contradictory, or unvalidated observations.
Per-operation costTranslation only; no data copy. IOTLB/context-cache invalidation on teardown.Copy between device-visible bounce pages and non-device memory on every transfer; Confidential VMs force this in hardware regardless.None.
Hostile-smoke coverage todayBounded QEMU Intel path only (make run-iommu-remapping, ddf-iommu-remapping-production-closeout); no cloud guest-IOMMU evidence.S.11.2.7/8/9 rows enforced by the make run-net gate (tools/qemu-net-smoke.sh), with bounce-buffer virtio-net provider evidence in ddf-provider-virtio-net-driver-closeout; bounce-buffer DMAPool lifecycle by make run-dmapool-grant. The GCP-shape local binding precursor (cloud-gcp-virtio-net-local-qemu-binding) asserts, in both make run-net and make run-ddf-provider-consumer, that the enumerated/bound function matches the documented GCP 1st/2nd-gen virtio-net device surface (standard virtio-net, vendor 0x1af4) and that the resolved backend is this labeled bounce-buffer path; it does not claim live GCP enumeration.N/A.
Hostile-hardware isolation claimClaimable only with per-domain remapping evidence and the IOMMU hostile smokes; not yet established for any cloud shape.not-claimed: a malicious bus-mastering device without a trusted remapping domain can still write arbitrary RAM.N/A.

The GCE live-probe rows in the evidence inventory record that every probed GCE shape (1st-gen n1, 2nd-gen e2, 3rd-gen Intel c3, and AMD-SEV Confidential n2d) boots with intel_iommu=off, DMAR: IOMMU disabled, SWIOTLB software bounce buffering, empty /sys/kernel/iommu_groups, and no DMAR/IVRS/IORT table; the Confidential shape forces bounce buffering as a memory-encryption invariant. These rows are support-policy expectations, not a hardcoded selection table: they describe what capOS’s runtime probe should expect to find on those shapes today, and they confirm that the fail-closed default lands on the labeled bounce-buffer path there. The boot-time probe, not this matrix, makes the binding selection on each boot, so a shape whose IOMMU exposure changes is handled by the probe re-evaluating rather than by editing this text. AWS and Azure shapes carry no live-probe evidence yet; the probe treats them the same as any other unproven platform and defaults to the bounce-buffer path.

Runtime Selection Rule (Fail-Closed Default)

On each boot, capOS probes the platform for guest-programmable remapping authority — IOMMU presence, programmability, and coverage for the DMA-capable functions it intends to bind — and selects the backend fail-closed:

  1. Probe the platform. Discover DMA-capable functions, then test whether a usable Intel VT-d, AMD-Vi, or Arm SMMU unit is present, discoverable, and programmable, and whether its translation/fault/invalidation behavior passes the self-tests in IOMMU Remapping Grounding.
  2. Select fail-closed. Select the direct remapping domain backend for a device only when the probe positively verifies a usable+safe IOMMU for that device. If the probe cannot verify it — IOMMU absent, not programmable, coverage unproven, self-test failed, or observations ambiguous — select the labeled bounce-buffer fallback for any DMA-capable surface the manager can keep manager-owned, or Unsupported when even that cannot hold. This probe-gated rule is the default: on an unproven platform the probe cannot verify, so direct DMA is not used and the bounce-buffer path is chosen.
  3. There is no human in this loop. The machine decides per boot; the only external authority is the optional manifest override below.

This is the boot-time authority. The cloud VM evidence matrix above is the support-policy expectation of what the probe should find, not the decision itself.

Manifest Override Field (Operator Authority Lever)

An operator can override the runtime default for a deployment through one declarative, auditable enum field in the system manifest’s kernel parameters: the dmaBackendPolicy field of the SystemConfig struct in schema/capos.capnp. It is config, not a doc-signing ritual, and is not gated on any specific person. The field is absent by default, and the absent default applies the probe-gated runtime selection rule above. The enum values and their interaction with the probe result are:

ValueProbe verifies usable+safe IOMMUProbe cannot verifyNotes
(field absent)direct remapping domainbounce-buffer fallbackDefault: the probe-gated runtime selection rule. Direct-DMA when the probe verifies a usable+safe IOMMU, bounce-buffer fallback otherwise (fail-closed). Identical to enable-if-verified.
enable-if-verifieddirect remapping domainbounce-buffer fallbackThe explicit, auditable form of the default. Probe-gated direct-DMA with fail-closed bounce-buffer fallback. Redundant with the absent default but kept for explicit configuration.
enable-unsafedirect remapping domaindirect remapping domainForce direct-DMA even when the probe cannot verify it. The operator takes responsibility for the platform’s DMA isolation; the value name carries the warning. Use only on a platform whose isolation is known-good out of band.
bounce-bufferbounce-buffer fallbackbounce-buffer fallbackPin the labeled bounce-buffer path and disable direct-DMA entirely, even where the probe would verify a usable IOMMU. The most conservative value.

Selection rules that hold for every value:

  • The absent default and enable-if-verified select direct DMA only when the probe verifies a usable+safe IOMMU, and otherwise fall back to bounce-buffer.
  • enable-unsafe is the sole value that can pick direct DMA without probe verification. The value name is the acknowledgement; there is no separate per-shape “I-accept-unverified” ceremony.
  • bounce-buffer never selects direct DMA, even where the probe would verify a usable IOMMU.
  • When the selected backend is Unsupported for a device (no manager-ownable DMA-capable surface at all), the device stays unbound regardless of the override value. The override governs direct-vs-bounce, not whether an unbindable device is forced online.

This selection mechanism is implemented. The dmaBackendPolicy capnp enum encodes an absent field as ordinal 0 (unspecified), which decodes identically to enable-if-verified; an unrecognized ordinal decodes fail-closed to the bounce-buffer path (never direct DMA) rather than failing the manifest parse or honoring the probe-gated default. The kernel resolves the backend on each boot from the IOMMU probe verdict and this override and emits a boot proof line of the form dma: backend selection dma_backend=<direct-remapping|bounce-buffer> dma_backend_override=<absent|enable-if-verified|enable-unsafe|bounce-buffer> probe_verified_usable_iommu=<bool>. The bounded QEMU shapes prove the probe-gated default end-to-end: make run-iommu-remapping (verifiable Intel VT-d shape) records dma_backend=direct-remapping dma_backend_override=absent, and make run-dmapool-grant (no usable IOMMU) records dma_backend=bounce-buffer dma_backend_override=absent. The override values and the unknown-ordinal fail-closed decode are covered by cargo test-config over the shared selection rule, capnp round-trip, and CUE decode.

The production cloud (non-qemu) build emits the same backend selection on the cloudboot harness’s serial-port-1 path as cloudboot-evidence: dma-backend bounce_buffer (in the harness token namespace), and on the bounce-buffer verdict drives the production-path bounce-buffer DMAPool + DMABuffer grant proof (kernel/src/cap/dmapool_bounce_buffer_grant_proof.rs, called from kernel::run_init): stage a parked manager-attached DMAPool over one DMA-capable PCI function from the inventory through device_manager::stage_bounce_buffer_dmapool_record (kernel/src/device_manager/stub.rs), allocate one bounded bounce-buffer DMABuffer through device_manager::issue_manager_attached_dmabuffer_handle_with_request (which calls device_dma::allocate_manager_attached_dmapool_bounce_buffer_page), assert the cap-info labels (userspace_dmapool=manager-issued-bounce-buffer, allocation=single-bounce-buffer-page, real_dma=not-attempted, direct_dma=blocked, host_physical_user_visible=0, iova_export=disabled-future-only), quiesce-before-release (release_dmapool_record_for_cap_release returns pending-buffer-release while the buffer is live), scrub-before-reuse (the released bounce-buffer frame is zeroed in place between scrub and frame-free), and stale-handle-after-detach, then emit cloudboot-evidence: dma-pool-grant <token> with shape <seg>.<bus>.<dev>.<fn>-pool.<slot>.gen.<gen>-phys.<hex> (every character inside the harness grammar [A-Za-z0-9._-]+). The proof stages a pool, allocates one bounce-buffer page, asserts the invariants, and emits a marker: it does not program real DMA, attach a queue, program interrupts, claim a device for sustained ownership beyond the grant, or emit provider-nic-bound / storage-bound. A future direct-remapping verdict skips the proof rather than aliasing direct-DMA onto the bounce-buffer assertion shape.

Implementation note, 2026-05-29 11:50 UTC: the production-cloud bounce-buffer stub now implements the cap-side DMABuffer.map(R+W) / DMABuffer.unmap admission and state-machine entry points (validate_dmabuffer_map_admission, record_dmabuffer_user_mapping, begin_dmabuffer_user_mapping_unmap, restore_dmabuffer_user_mapping, clear_dmabuffer_user_mapping in kernel/src/device_manager/stub.rs) so a userspace holder of a manager-issued DMABuffer cap can map the single bounce-buffer page read/write, write payload bytes through that mapping, unmap it, and observe DMAPool.info.mapped_vmas reflect the live mapping (make run-cloud-dmapool-grant). The same Absent -> Mapped -> Unmapping -> Absent state machine the QEMU side enforces governs the parked slot: a duplicate map fails closed with the DmaPoolLive shape, a clear before the cap-side aspace unmap returns DmaPoolTeardownEvidenceInvalid, an identity-mismatched begin/clear fails closed, and a post-freeBuffer map fails closed with the DmaBufferStaleHandle shape. The manager remains the single owner of the bounce-buffer page’s device-visible host-physical address and IOVA: the mapping does not expose either to userspace, real DMA stays not-attempted, direct DMA stays blocked, and IOVA export stays disabled-future-only. This is a local-QEMU proof of the userspace mapping path only; it does not unlock a live cloud NIC bind, IOMMU programming, or production direct-DMA authority.

Implementation note, 2026-06-03: Phase C slice 2 (cloud_virtio_net_userspace_ownable_vring_proof, make run-cloud-prod-nic-driver-userspace-ownable-vring) wires this landed bounce-buffer authority to a userspace virtio-net driver’s own vring without adding a new isolation backend. The driver allocates its descriptor / available / used ring pages through a granted DMAPool, and the kernel programs the virtio queue-address registers (queue_desc / queue_driver / queue_device) with the manager-owned bounce host-physical address. The brokered-address- publication model (above) holds: the driver never authors a device address. It writes an opaque per-buffer device-usable handle (exported through DMABuffer.info.deviceIova with scope bounce-handle, a deterministic non-address encoding of the buffer’s manager identity), and the kernel resolves that handle against the live grant ledger (device_manager::stub::resolve_virtio_vring_device_address) to the real host-physical address before any MMIO write. The no-host-physical-exposure invariant is preserved end-to-end: host_physical_user_visible=0, iova_export=disabled-future-only, and reads of the queue-address base registers are refused so the resolved address is never read back into userspace. Out-of-grant, host-physical-looking, and stale-generation handle writes fail closed, mirroring the NVMe doorbell-scan reject classes. queue_enable / DRIVER_OK stay fail-closed (slice 3); this is a local-QEMU proof of the userspace-ownable-vring path only, not a live cloud NIC bind.

Downstream-Contract Scaffolding

A cloud NIC/storage driver declares its chosen backend through the device-manager policy fields that already label the local paths in this design. The contract below scaffolds the values each candidate requires; it is the shape a driver preflight declares, not an authorization to enable any value.

Policy fieldDirect remapping domainLabeled bounce-buffer fallbackUnsupported
direct_dmaenabled for the programmed per-device domainblockedblocked
trusted_domainmanager-owned domain id for the devicenonenone
bounce_buffernot required (mapped IOVA path)requiredN/A (device unbound)
remapping_tablesprogrammednot-programmednot-programmed
hostile_hardware_isolationclaimed only with per-domain remapping evidence + IOMMU hostile smokesnot-claimedN/A
exported_device_addressesiova-only, domain-labelednone (no host physical or IOVA exposed)none

Each candidate must satisfy the following gates, mapped to the assurance-model invariants in DMA Assurance Model:

  • Direct remapping domain — mapping before publication; no page reuse before teardown, with mapping removal and observed invalidation completion ordered strictly before scrub/reuse; stale-handle and stale-completion fail-closed; domain-scoped aliasing only (per-device context entries where a peer domain cannot resolve another domain’s IOVA); no host-physical exposure (export only the domain-scoped IOVA, labeled as meaningless outside the domain); backend evidence explicit. The only landed remapping-domain evidence is the bounded QEMU Intel path in ddf-iommu-remapping-production-closeout, exercised by make run-iommu-remapping. A cloud shape needs its own guest-programmable remapping evidence before this candidate applies to it.
  • Labeled bounce-buffer fallbackdirect_dma=blocked; all device-visible memory is manager-owned bounce pages; no host physical address and no generic MemoryObject/FrameAllocator authority is exposed to the driver; stale-handle, stale-completion, and exit-under-DMA teardown fail closed; hostile_hardware_isolation=not-claimed stays explicit. The landed evidence is the S.11.2.7/8/9 hostile-smoke rows enforced by the make run-net gate (tools/qemu-net-smoke.sh), with the bounce-buffer virtio-net provider evidence in ddf-provider-virtio-net-driver-closeout, plus the bounce-buffer DMAPool grant lifecycle in make run-dmapool-grant. As of ddf-real-dma-s112-hostile-smokes, the S.11.2.7 stale-IRQ-after-reset and S.11.2.8 stale-DMA-completion-after-reset closure summaries are asserted on the dedicated make run-dmapool-grant gate as well, so the DMA-grant gate fails closed on a hostile-row regression without depending on make run-net; exit-under-DMA (in-flight drain, scrub, page free) is enforced by make run-dmapool-grant-exit. See the “Fallback Policy For No Usable IOMMU Exposure” and “Hostile-Smoke Acceptance Matrix” sections in this document for the full gate text.
  • Unsupported — the device stays unbound or disabled; no driver-visible DMA, MMIO doorbell, interrupt ownership, or storage/network readiness claim is made. A serial boot line or PCI enumeration line is not readiness evidence.

The models/dma/ TLA+ and Alloy files, the extracted Kani core, and the focused Loom model are checked bounded evidence for these invariants. They supplement the QEMU/host evidence above; they do not satisfy a candidate hardware or cloud backend gate by the mere presence of model files. Each checked result records its tool version, command, bounds, and output in ../models/dma/README.md, and the CI placement is tracked in Security and Verification.

First QEMU Intel Remapping Smoke Acceptance Gate

Decision recorded 2026-05-14 09:07 UTC. The first slice that programs real Intel VT-d remapping state under QEMU has an explicit, bounded acceptance gate. This unblocks the ddf-iommu-qemu-intel-remapping-smoke task, whose ## Acceptance section carries the full gate; the summary here is the design-level decision.

The first slice programs the minimum Intel VT-d path for exactly one selected DMA-capable function under a pinned QEMU q35 -device intel-iommu,aw-bits=39 shape: one root entry, one context entry bound to a device-manager-owned domain ID and a 39-bit second-level page-table root, and a single second-level mapping from one device-visible IOVA page to one kernel-owned DMAPool page, plus the root-table-address register write and the global-command/global-status translation-enable handshake. Acceptance requires observable proof that the mapped IOVA was translated and that an out-of-domain IOVA faults closed in the fault-status/fault-recording registers.

Invalidation is part of the gate, not a follow-up. On revoke, device reset, driver death, and DMAPool page release, the slice must remove the second-level mapping and invalidate the relevant context-cache and IOTLB state through the selected invalidation interface, observe invalidation completion, and order page scrub/reuse strictly after that completion. This sits at the existing “remove IOMMU mappings” and “scrub and free DMA pages” steps of the DeviceOwnerState revocation order and must not be reachable before QueuesQuiesced or a completed Resetting transition.

IOVA export discipline: host physical addresses stay hidden from userspace in every result cap, diagnostic line, and audit record. The selected QEMU Intel IOMMU-backed DMABuffer.info path may export only the domain-scoped IOVA for the live mapped buffer generation, explicitly labeled as meaningless outside that domain; fallback bounce-buffer paths keep IOVA export disabled.

Per-device domain granularity: the selected QEMU Intel path programs two distinct per-device context entries and second-level roots for two claimed DMA-capable functions under the same DRHD. Both domains may use the same IOVA, but the peer domain’s second-level walk is proven not to resolve that IOVA to the primary page; stale and wrong-owner domain assignment fail closed, and trusted multi-device sharing groups remain disabled.

The kernel-owned bounce-buffer fallback stays the path for VM shapes without usable remapping hardware and must remain explicitly labeled (fallback_policy=kernel-owned-bounce-buffer-only, remapping_tables=not-programmed, hostile_hardware_isolation=not-claimed); it must never be silently reinterpreted as direct-DMA or hostile-hardware isolation. The IOMMU-backed path adds stale-DMA-handle, stale-completion, descriptor-abuse, revoke/reset-race, teardown-under-DMA, cross-domain stale-handle, and fail-closed teardown branch hostile smokes while the existing bounce-buffer exit-under-DMA and stale-DMA evidence (the device-dma stale-handle and stale-completion proofs and the S.11.2.7/S.11.2.8 closure summaries) is preserved unchanged.

Explicitly out of scope for this first slice: AMD-Vi table programming; trusted multi-device sharing groups; scalable-mode context entries; interrupt remapping and device-IOTLB options; 48-bit IOVA space / 4-level tables; production NIC or storage driver ownership; userspace DMAPool direct-DMA authority; and moving the live virtio-net path off bounce buffers. The acceptance evidence is QEMU-only emulator evidence, not a hardware-isolation claim. The smoke adds or selects a focused make run-iommu-remapping gate asserted by tools/qemu-iommu-remapping-smoke.sh.

Implementation note, 2026-05-02 04:58 UTC: the ACPI discovery path recognizes DMAR and IVRS in the root table walk, reports absent/valid/malformed/unsupported state, records bounded table length/header facts, DMAR host address width and flags, IVRS IVinfo/flags, and bounded remapping-structure type counts. Malformed DMAR or IVRS structure lengths stop parsing, and unsupported shapes such as parser scan-cap overflow leave direct_dma=blocked with bounce_buffer=required.

Implementation note, 2026-05-02 05:31 UTC: the attachment-policy slice also retains DMAR DRHD include-all and bounded PCI endpoint device-scope metadata, including segment, single-hop BDF, and remapping-hardware register base, and reports each retained DMA-capable PCI function as IOMMU-attached/covered when that static table metadata covers its segment/BDF. Bridge and multi-hop scopes are diagnostic-only until PCI topology traversal can resolve them, and include-all fallback fails closed when retained DRHD units or scopes are capped. Functions without trusted static coverage are reported as uncovered; covered functions are reported as attached/covered, but both paths keep dma_policy=prototype-bounce-buffer-only, bounce_buffer_required, and blocked_direct_dma_devices because remapping domains are unsupported. The direct-DMA trusted-domain count remains zero, and userspace DMAPool, DeviceMmio, and Interrupt authority remain unavailable.

Implementation note, 2026-05-02 07:27 UTC: the domain-policy staging slice adds a pci: dma-domain policy proof line and a diagnostics mirror. The proof reports the future domain owner as the device manager, the domain granularity as per-device or trusted-sharing-group, exported device addresses as IOVA-only, host_physical_user_visible=0, direct_dma_trusted_domains=0, claimed_device_domains_ready=0, remapping_tables=not-programmed, remapping_domains=not-started, userspace DMAPool/DeviceMmio/Interrupt as not-started, and prototype devices as kernel-owned bounce-buffer-only. Malformed, unsupported, absent, or retained-capped metadata leaves direct DMA blocked; proof_result=ok is only evidence for that conservative blocked-direct-DMA policy.

Implementation note, 2026-05-09 18:47 UTC: the blocked-direct-DMA admission decision now lives in the pure capos-lib::device_authority validator next to the DMA/MMIO/IRQ handle validators. Host tests cover the current all-prototype bounce-buffer shape, fail-closed results if any direct trusted domain is claimed before the policy is ready, fail-closed results if the prototype bounce-buffer count does not cover every DMA-capable function, and the absent, malformed, unsupported, and retained-capped metadata labels. The kernel PCI proof line and diagnostics mirrors consume that pure decision while preserving the existing direct_dma=blocked, remapping_tables=not-programmed, domain_activation=not-started, and policy=blocked-direct-dma labels. This is IOMMU/remapping groundwork only; it does not program remapping tables, create trusted domains, expose host physical addresses, or enable production userspace DMA authority.

Implementation note, 2026-05-02 15:29 UTC: the COM1 devices diagnostics command now prints the same bounded DMA-domain policy facts without naming an owner identity. The line explains that all current DMA-capable prototype functions remain on direct_dma=blocked, bounce_buffer=required, direct_dma_trusted_domains=0, claimed_device_domains_ready=0, remapping_tables=not-programmed, exported_device_addresses=iova-only, host_physical_user_visible=0, and prototype_devices=kernel-owned-bounce-buffer-only. This is a diagnostics mirror for the current conservative policy, not evidence that IOMMU remapping domains or userspace DMA authority exist.

Implementation note, 2026-05-02 15:45 UTC: attached device-manager DMAPool records now store the current explicit bounce-buffer policy and the QEMU device-manager proofs read it back through the active device record plus the matching DmaPoolHandle. The logged policy scope is device-manager-attached-dmapool-bounce-buffer-policy, with direct_dma=blocked, bounce_buffer=required, trusted_domain=none, remapping_tables=not-programmed, remapping_domain=not-started, userspace_dmapool=not-started, host_physical_user_visible=0, and policy_bound_to_manager=true. This binds the conservative policy to current manager state; it still does not program remapping domains, expose userspace DMAPool, or perform real DMA mapping teardown.

Implementation note, 2026-05-11 00:00 UTC: attached device-manager DMAPool policy records now also carry an explicit manager-owned remapping-domain ledger staging record. The lifecycle and imported-live proofs report remapping_domain_ledger_scope=device-manager-attached-dmapool-remapping-domain-ledger, static_iommu_coverage=acpi-pci-diagnostic-only, remapping_domain_owner=device-manager, remapping_domain_granularity=per-device-or-trusted-sharing-group, remapping_domain_ledger=manager-owned-staging-record, remapping_domain_ready=false, and iova_export=disabled-future-only, while preserving direct_dma=blocked, remapping_tables=not-programmed, and host_physical_user_visible=0. This is a software ledger/readiness record only: capOS still does not program Intel VT-d/AMD-Vi/QEMU remapping tables, create a trusted direct-DMA domain, expose host physical addresses or IOVAs, or claim production hostile-hardware DMA isolation.

Implementation note, 2026-05-11 17:07 UTC: the same manager-owned remapping-domain staging record is now an explicit activation gate tied to the active DMAPool record and matching handle. The device manager validates that gate before current DMAPool policy/accounting and buffer issue paths proceed. The gate reports domain_ownership=manager-owned-active-dmapool, but keeps direct_dma=blocked because remapping_table_programming=not-programmed, iova_export=disabled-future-only, remapping_invalidation_policy=not-installed, remapping_iotlb_flush_policy=not-installed, and remapping_stale_mapping_cleanup=not-installed; the selected fallback remains remapping_fallback_policy=kernel-owned-bounce-buffer-only. The activation result is blocked-remapping-prerequisites-missing with remapping_activation_gate=fail-closed, remapping_activation_blocker=remapping-tables-not-programmed, and remapping_activation_side_effect=side-effect-blocked. This is a software policy gate and proof surface only: capOS still does not program Intel VT-d/AMD-Vi/QEMU remapping tables, create a trusted direct-DMA domain, export IOVAs or host physical addresses, remove real IOMMU mappings, flush IOTLB state, or prove IOMMU-backed hostile stale-DMA behavior.

Implementation note, 2026-05-12 18:49 UTC: the manager-owned staging record now includes a concrete per-device remapping-domain identity for the active DMAPool handle: claimed-device domain identity, staged single-device sharing group, BDF-derived device id, pool slot, pool generation, and owner generation. The activation preflight treats that identity binding as a prerequisite before direct DMA could be considered, and the QEMU lifecycle/imported-live proofs emit a dmapool remapping domain identity proof line. The direct-DMA blocker remains unchanged: remapping tables are still not-programmed, IOVA export is still disabled-future-only, invalidation, IOTLB flush, and stale mapping cleanup are still not-installed, direct DMA remains blocked, and the fallback remains kernel-owned-bounce-buffer-only.

Implementation note, 2026-05-12 21:19 UTC: the same manager-owned remapping-domain ledger now carries a separate mapping-lifecycle preflight record. The record is bound to the active DMAPool handle and claimed-device domain identity, and the existing device-manager policy gate validates it before accepting the current bounce-buffer attach/accounting/buffer-issue paths. Its direct-DMA result remains fail-closed with explicit blockers: IOVA space, mapping install, removal before page reuse, invalidation policy, IOTLB flush policy, and stale mapping cleanup are all not-installed. This is still an in-repo software preflight only; it does not program remapping tables, expose IOVAs or host physical addresses, enable direct DMA, remove real IOMMU mappings, flush IOTLB state, or prove IOMMU-backed hostile stale-DMA behavior.

Implementation note, 2026-05-12 22:26 UTC: capOS now has an Intel/QEMU remapping table scaffold that can represent a DRHD identity field, PCI segment/BDF/source ID, domain ID, QEMU Intel address-width choice, disabled root/context entries, and a second-level page-table-root placeholder. PCI diagnostics can bind that scaffold to discovered DRHD/segment metadata when ACPI/PCI discovery provides it. The disabled backend registry’s only accepted active state is disabled. The proof labels distinguish representability from programming: root-table pointer, context-entry programming, invalidation registers, fault registers, protected-memory registers, and invalidation queue remain not-written; remapping tables remain not-programmed; hardware programming remains not-attempted; direct DMA remains blocked; IOVA export remains disabled-future-only; and host physical addresses remain hidden from userspace. This is still not Intel VT-d, AMD-Vi, or QEMU IOMMU programming.

Implementation note, 2026-05-23 18:06 UTC: the first production DMAPool ledger integration for the QEMU Intel remapping path now maps the selected virtio-rng request-buffer IOVA to an active manager-owned DMABuffer page. Mapping install is admitted through the matching active DmaPoolHandle and DmaBufferHandle generations, stale pool and buffer generations fail closed, and wrong-owner mapping attempts are side-effect blocked. On teardown, the target second-level leaf is removed, the context-cache and IOTLB completion polls finish, and only then is the DMABuffer released through the production device-manager ledger. The proof keeps the IOVA internal, keeps host_physical_user_visible=0, keeps userspace IOVA export disabled for this slice, and leaves the no-remapping fallback policy as kernel-owned-bounce-buffer-only.

Implementation note, 2026-05-23 19:18 UTC: QEMU Intel remapping fault reporting now decodes VT-d FSTS plus FRCD[0] into a bounded kernel record for the faulting IOVA, reason, requester source ID, and DMA read/write type. The unmapped-IOVA, stale-handle, and stale-completion proofs record the fault, clear it with write-1-to-clear semantics, verify the clear-after-record state, and report source/IOVA match status without exposing host physical addresses. The COM1 devices diagnostics path now prints an IOMMU fault summary that reserves fault_summary=clean for a successful clear fault-status read and labels unavailable fault-status reads as unavailable/fail-closed. Owner identity, DRHD register bases, and host physical addresses stay hidden. The optional audit route is explicitly not wired for this slice (volatile_audit=not-routed).

Implementation note, 2026-05-13 15:29 UTC: active manager-owned DMAPool remapping preflight records now consume the same retained DMAR DRHD/requester metadata when PCI coverage is complete. The active disabled Intel/QEMU scaffold records the retained DRHD identity and requester segment/BDF/source ID for the bound pool handle, but absent, malformed, capped, unsupported, or uncovered metadata still leaves the scaffold not-bound and disabled. This is metadata-only binding: capOS still does not program root/context tables, install or remove mappings, invalidate remapping caches, flush IOTLB state, export IOVAs, enable direct DMA, or expose host physical addresses.

Implementation note, 2026-05-12 23:07 UTC: PCI diagnostics now include a separate Intel/QEMU remapping MMIO-status proof for the selected DMA-capable function. When complete retained DMAR DRHD metadata covers that function and the register base is page-aligned, capOS maps only the selected remapping-register page for bounded volatile diagnostic reads of the version, capability, global-status, root-table-address, and fault-status registers. The mapped label describes the diagnostic access pattern, not page-table write protection. When the default diagnostics shape has no DRHD, metadata is capped, or the retained DRHD base is invalid, the same proof reports mmio_window=not-mapped, mmio_read=not-attempted, unavailable capability/status/fault reads, and a fail-closed reason. The labels preserve remapping_tables=not-programmed, direct_dma=blocked, fallback_policy=kernel-owned-bounce-buffer-only, and hostile_hardware_isolation=not-claimed. This is not remapping-domain activation: capOS still does not write VT-d, AMD-Vi, or QEMU remapping registers, install root/context tables or invalidation queues, export IOVAs, or claim hostile-hardware DMA isolation.

Implementation note, 2026-05-13 01:20 UTC: active manager-owned DMAPool records now also carry a generic disabled IOVA ledger under the remapping-domain record. The ledger binds a domain-scoped reservation identity, hidden internal range metadata, and reservation generation to the active owner and pool generation. The same device-manager accounting path validates that ledger before current bounce-buffer allocation, descriptor submission, completion accounting, buffer free/page release, and pool release checks proceed. Pure tests and QEMU proof labels reject stale reservation generations and wrong owner generations as disabled-iova-ledger-stale with side effects blocked. The active state remains disabled: proof output keeps the internal range hidden and reports iova_base_user_visible=0, host_physical_user_visible=0, iova_export=disabled-future-only, direct_dma=blocked, mapping_install=not-installed, mapping_remove_before_page_reuse=not-installed, invalidation_policy=not-installed, iotlb_flush_policy=not-installed, and stale_mapping_cleanup=not-installed. This still does not program remapping tables, export IOVAs, expose host physical addresses, install or remove real IOMMU mappings, flush IOTLB state, or claim hostile-hardware isolation.

Implementation note, 2026-05-11 18:44 UTC: the selected userspace virtio-net TX provider smoke now grants a runtime-visible DeviceMmio notify BAR cap named notify_mmio, but keeps the active DMA posture unchanged. The provider still uses manager-owned bounce buffers, direct_dma=blocked, and host_physical_user_visible=false; the notify cap is a no-write MMIO admission boundary over the selected virtio-net TX notify offset, not a direct DMA or descriptor-ring ownership transfer. The selected submit path validates descriptor authority and scrubs the bounce page before consuming the grant-derived notify policy, and it proves wrong value, wrong offset, stale handle, and stale generation block before any doorbell. This does not program IOMMU/remapping tables, export IOVA or host physical addresses, mutate the real virtio-net descriptor ring, or claim production NIC isolation.

Implementation note, 2026-05-11 19:05 UTC: the notify_mmio grant remains runtime-visible but is now explicitly no-direct-MMIO as well as no-write. DeviceMmio.map and DeviceMmio.read32 for that provider notify cap return typed blocked results before any user VMA or register read, DeviceMmio.info validates the live mapping generation before accepting the provider-notify record, and notify_mmio detach clears the submit-path notify policy so a later selected submit reports stale-handle blocking with no doorbell write. The submit path also invalidates the cached notify policy on owner-generation transitions and stale/missing cap-release detach boundaries before accepted no-write authority can be reported.

Implementation note, 2026-05-11 19:56 UTC: the same provider smoke now grants a runtime-visible Interrupt cap named tx_interrupt for the selected virtio-net TX MSI-X route snapshot. This extends the grantable authority boundary without changing the DMA posture: the provider still uses manager-owned bounce buffers, direct DMA and IOMMU remapping remain blocked, and the interrupt cap only proves generation-checked admission for info/ack/unmask/wait/mask plus waiter cancellation and stale-after-release blocking. Later bounded follow-ups add selected TX MSI-X vector-control mask/unmask; this first grant slice did not deliver provider IRQs to userspace, acknowledge or mask/unmask hardware, ring a doorbell, or mutate real virtio-net descriptor rings.

Implementation note, 2026-05-11 20:09 UTC: provider TX waiters are now separate no-delivery waiter-table entries rather than generic delivery waiters. A pending tx_interrupt.wait remains pending across TX route delivery-count advancement and only completes through the explicit mask/release cancellation path. The staged provider TX grant source also tracks a live-issued cap and refuses another live tx_interrupt alias for the same route snapshot.

Implementation note, 2026-05-12 09:13 UTC: the selected userspace virtio-net TX provider path now performs one bounded real descriptor/avail publication on queue 1 while keeping the DMA posture conservative. The descriptor points at the manager-owned bounce buffer already governed by the DMABuffer record; the submit path validates live buffer identity, scrubs before publication, requires the live no-write notify_mmio policy, asserts that the descriptor page is ledgered to the virtio-net TX queue, and blocks wrong queue, stale notify policy, and a real stale DMABuffer.submitDescriptor(queue=1) attempt at the stale capability/liveness boundary before touching the real ring. The proof logs descriptor/avail/used ring physical addresses for kernel evidence, but those addresses are not returned to userspace. Because this slice does not claim real used-ring/CQ completion, the published page remains pinned in the manager in-flight record; userspace completeDescriptor, freeBuffer, post-publication remap, and cap-release drain do not retire, remap, or release it. A follow-up rings one selected TX virtqueue notify doorbell after the same descriptor authority, submit-scrub, live notify_mmio policy, submit-effect write, and publication gates, while wrong-queue and stale-notify or stale-DMABuffer paths remain not-written. Readback-mismatch publication failures do not write the immediate doorbell and are treated as possibly-published ring state that quiesces later TX notification and keeps the manager buffer pinned rather than claiming rollback. Pre-publication bounce-page metadata remains doorbell_write=not-written. Any immediate used-ring or IRQ effect from that doorbell is recorded only as an out-of-scope hardware side effect. Later 2026-05-12 follow-ups advanced the selected path to a bounded used-ring completion handoff: DMABuffer.completeDescriptor validates the live manager-attached buffer and in-flight descriptor id, observes the real TX used ring for the stored software descriptor generation, consumes that entry through the existing descriptor tracker and DMA ledger, and only then clears the manager in-flight record. As of commit e248d42b (2026-05-23 13:36 UTC), kernel TX helpers stay quiesced after provider ownership starts while the provider path can publish the full selected TX queue-depth window of eight descriptors before the first completion; the smoke records live_inflight_after_submits=1/2/3/4/5/6/7/8 (the ninth allocation rejected dmapool-budget-exceeded), blocks map/free/reuse while any buffer is in flight, and proves wrong-order descriptor 7 used-ring handling preserves the observed descriptor 0 completion for its matching generation. The provider-facing tx_interrupt waiter is a runtime-visible completion-event consumer for the same selected route; delivery validates the expected TX source id, source generation, route generation, owner, driver-unmasked state, and live issue id before completing each bounded event. A 2026-05-13 follow-up adds the bounded incomplete-descriptor teardown drain: when one descriptor has completed and seven remain incomplete, release retires only the incomplete descriptors’ allocation-backed TX DMA ledgers and clears only the selected virtqueue descriptor/used-ring tracking needed for those releases, while CQ publication and provider IRQ delivery stay blocked and the pending waiter remains undelivered until the smoke explicitly cancels it through the existing mask/cancel path. Commit e248d42b (2026-05-23 13:36 UTC) extends that drain to the full selected TX queue-depth window and keeps the completed descriptor’s buffer retained until it is explicitly freed. A later 2026-05-13 remediation binds each provider TX in-flight descriptor to the submission-time provider issue/source/route generation. If that old descriptor completes after tx_interrupt release/regrant, DMABuffer.completeDescriptor fails closed as dmabuffer-provider-tx-stale-issue before consuming the used ring, publishing provider CQ/IRQ state, or advancing provider acknowledgements; later cap release may still drain the descriptor as teardown-only. tx_interrupt.wait posting is serialized with provider release, mask, and delivery, and stale issue ids fail closed at admission and insertion. A later 2026-05-13 follow-up lets tx_interrupt.acknowledge account exactly one already observed selected TX dispatch token paired with one delivered provider CQ event; the smoke proves pre-event, duplicate, teardown-drain, masked-route, reset/regrant, stale-after-release, and stale issue acknowledgements fail closed before delivery-count, route-state, CQ, ack-ledger, or hardware-dispatch-ack mutation. This is still bounded selected-route evidence: provider IRQ ownership, deferred EOI, LAPIC/MSI-X acknowledgement, direct DMA, IOMMU mapping, full virtio-net ownership, production NIC/storage migration, and cloud readiness remain open. Commit e248d42b (2026-05-23 13:36 UTC) adds release-time retirement for delivered but unacknowledged bounded provider TX CQ events: the release proof now records seven pending provider completion acks retired from the ledger in one live issue while preserving the separate stale-bound in-flight descriptor proof, with stale post-release ack revoked and no hardware ack claimed. A 2026-05-13 follow-up adds bounded selected TX MSI-X mask/unmask only: live provider tx_interrupt.mask and unmask toggle the selected TX vector-control bit plus route state, preserve generations and delivery counts, and block stale issues before side effects.

Implementation note, 2026-05-02 19:43 UTC: the bounded zero-live device-manager DMAPool lifecycle proof now treats its manager-attached DMA buffer record as teardown-blocking metadata. The pool detach path still checks authoritative live accounting first, then rejects zero-live detach while the proof buffer is attached as dmapool-buffer-attached. Before the active free path, the proof validates stale same-slot and wrong-identity FreeBuffer operations through capos-lib::device_authority::validate_dma_buffer_operation. The wrong identity cases cover wrong owner generation, wrong pool slot, wrong pool generation, and wrong buffer slot; each records dmabuffer-stale-handle, the exact validator reason (stale-owner-generation, wrong-pool, stale-pool-generation, or wrong-slot), side-effect-blocked, and a preserved manager-owned buffer record. The stale same-slot case continues to record stale-slot-generation and buffer_stale_free_preserved=true, then the proof observes that pool detach still fails as dmapool-buffer-attached. The proof clears the gate only after validating a proof-scoped active FreeBuffer operation, scrubbing and freeing the kernel-owned proof frame, and detaching the manager-owned buffer record. This remains lifecycle evidence only: no userspace DMAPool or DMA-buffer authority is exposed, no physical address or IOVA is exposed, and S.11.2 hostile stale-DMA smokes remain open.

Implementation note, 2026-05-03 02:31 UTC: the same zero-live device-manager DMAPool lifecycle proof now validates manager-record CompleteDescriptor authority for the attached DmaBufferHandle. Active completion validation records buffer_active_complete_result=ok; freed-buffer, reused-slot generation, and stale-after-revoke completion attempts fail closed as dmabuffer-stale-handle with exact pure-validator reasons and side-effect-blocked. This is manager-record validation evidence only: it does not complete a real descriptor, publish a completion queue entry, grant a userspace DMABuffer, run real DMA, or clean up or reuse production userspace DMA pages.

Implementation note, 2026-05-14 14:05 UTC (DDF IOMMU remapping Slice A1): the first slice that programs real Intel VT-d remapping state under QEMU has landed. The ## First QEMU Intel Remapping Smoke Acceptance Gate above defines the full bounded gate; that gate is being delivered as a sequenced A1/A2/B/C split (the slice was correctly scoped as bigger than one reviewable unit). This note records Slice A1.

Pinned QEMU shape: qemu-system-x86_64 8.2.2, -machine q35, -device intel-iommu,aw-bits=39 (3-level second-level page tables, 39-bit IOVA space). The kernel iommu module (kernel/src/iommu.rs, cfg(qemu)-only) selects one DMA-capable function that is not the live virtio-net bounce-buffer path (virtio-rng under the default smoke shape), allocates a root table, one context table, a 3-level second-level page table, and one mapped DMA page; encodes and writes one root entry, one context entry (binding the requester source id to a domain id, aw-bits=39, and the second-level root), and the second-level table-pointer / leaf entries through the HHDM; writes the Root Table Address Register; and runs the global-command/global-status SRTP-then-TE handshake, polling the status register for each step. The capability-register extended-capability IRO field (IOTLB register offset) is decoded and reported for Slice B’s benefit. MMIO ordering invariants are enforced with SeqCst fences: between the last in-memory table-entry write and the RTAR write, between the RTAR write and GCMD.SRTP, and between the latched root pointer and GCMD.TE.

Slice A1 proves translation with kernel-side structural evidence only, which the gate’s IOVA-export-discipline clause explicitly permits. Hardware confirms translation-enabled (GSTS.TES + GSTS.RTPS polled set), the written entry words are read-back-verified through the HHDM, the pure capos_lib::device_authority validator accepts the layout, and the unmapped IOVA’s 3-level walk structurally terminates at a non-present entry. The proof labels are scrupulously honest: mapped_iova_translated=structural (not hardware-dma), unmapped_iova_fault=structural-not-present (not observed), proof_evidence=kernel-side-structural. A real hardware-DMA translation and a real fault-status fault require driving a device virtqueue through the IOMMU; that is deferred to follow-on task A2 (a virtio-rng virtqueue driver as the DMA proof vehicle). Invalidation + IOTLB flush with completion polling (Slice B) and the IOMMU-backed stale-handle / stale-completion hostile smokes (Slice C) are also follow-on slices; at A1 their proof lines emit proof_result=deferred-next-slice. A2, B, and C have since landed — see the implementation notes below.

The table pages are recorded in a bounded ledger modeled on the device-manager DMAPool page-accounting discipline (allocate-record, scrub-before-free on the fail-closed path); mapping removal with IOTLB-flush-ordered scrub/free is Slice B. IOVA export stays disabled (iova_export=disabled-this-slice), no host physical address is user-visible (host_physical_user_visible=0), and no hostile-hardware isolation is claimed (hostile_hardware_isolation=not-claimed). The kernel-owned bounce-buffer fallback is unchanged for QEMU shapes without usable intel-iommu hardware and is emitted with the explicit fallback_policy=kernel-owned-bounce-buffer-only / remapping_tables=not-programmed labels. The new make run-iommu-remapping gate is asserted by tools/qemu-iommu-remapping-smoke.sh; make run-net, make run-dmapool-grant, and make run-diagnostics continue to prove the fallback path unchanged.

Implementation note, 2026-05-14 15:19 UTC (DDF IOMMU remapping Slice A2): the device-DMA proof vehicle has landed, upgrading the A1 structural proof to a real hardware-DMA proof and closing the literal hardware-DMA text of gate part

  1. After the VT-d tables are programmed and GCMD.TE is set, a minimal virtio-rng virtqueue driver — split into a mapped-DMA phase (crate::virtio::prove_iommu_rng_mapped_dma) and an unmapped-DMA phase (crate::virtio::prove_iommu_rng_unmapped_dma) — drives the device QEMU exposes on the intel-iommu shape. The second-level table now installs four leaf entries inside one shared L1 page: the request buffer plus the three virtqueue ring pages (descriptor table, available ring, used ring). The driver programs the device’s QUEUE_DESC / QUEUE_DRIVER / QUEUE_DEVICE registers and the request descriptor’s addr field with the programmed IOVAs, never the host-physical page addresses. Because VT-d translation is global per requester once GCMD.TE is set, every DMA the device issues — every ring access and the entropy write — must walk the second-level table. A used-ring completion plus a non-zero buffer reading therefore proves a real hardware DMA reached the kernel page through the programmed IOVA translation: mapped_iova_translated=hardware-dma, proof_evidence=virtio-rng-hardware-dma.

The driver then publishes a second descriptor whose addr is the deliberately-unmapped IOVA and kicks the device. The device’s DMA to that IOVA raises a real VT-d translation fault; the kernel reads it back out of the Fault Status Register (FSTS.PPF), the first Fault Recording Register’s fault bit (FRCD[0].F at the decoded CAP.FRO offset), and the faulting page address recorded in FRCD[0] — which must equal the unmapped IOVA the device was pointed at — and reports unmapped_iova_fault=observed with the fault_recording_reason code. The fault gate is purely the VT-d register surface: whether QEMU’s virtio-rng still pushes the faulting descriptor onto the used ring afterward (it does, with the entropy write dropped) is QEMU device behavior, reported as the unmapped_descriptor_uncompleted diagnostic field but deliberately not a gate condition. The fault registers are cleared (write-1-to-clear) before the device DMA and again after the observed-fault read so no stale fault is mistaken for the proof and no fault is left for a later VT-d consumer. The MMIO discipline reuses A1’s NO_CACHE mapping and the descriptor/available-ring writes are SeqCst-fenced before the notify doorbell. The two phases are deliberately split so the kernel reads the fault registers strictly between them — the unmapped-IOVA descriptor is never in flight while the mapped-DMA result is judged. The virtio-rng device negotiates VIRTIO_F_ACCESS_PLATFORM, which is what makes QEMU route its DMA through the platform IOMMU rather than treating the ring registers as host-physical addresses; the run-iommu-remapping make target therefore creates the virtio-rng device with iommu_platform=on (a target-scoped override of the shared QEMU_SECOND_DEVICE, which no other run target needs because none of them drives virtio-rng DMA). The IOMMU-backed hostile smokes (Slice C) were a follow-on at A2 (proof_result=deferred-next-slice) and have since landed — see the Slice C implementation note below; IOVA export stays disabled and no host-physical address is user-visible.

Implementation note, 2026-05-14 17:19 UTC (DDF IOMMU remapping Slice B): the invalidation + IOTLB flush + invalidation-ordered scrub/free has landed, closing gate part 2. After the A2 hardware-DMA proof, kernel/src/iommu.rs runs a revocation cycle (run_invalidation_revocation_cycle) that models the device-manager DeviceOwnerState revocation FSM at the QueuesQuiesced -> Resetting -> DmaMappingsRemoved -> Dead steps. The cycle removes the four second-level leaf entries the A2 layout installed (request buffer + the three virtqueue ring pages), SeqCst-fences so the in-memory removal is visible to the IOMMU before the flush, then issues two register-based invalidations: a context-cache invalidation through the Context Command register (CCMD_REG at offset 0x28, CCMD.ICC set with CCMD.CIRG global granularity, polling CCMD.ICC clear for completion) and a domain-selective IOTLB invalidation through the IOTLB register at the A1-decoded CAP.IRO offset + 8 (IOTLB.IVT set with IOTLB.IIRG domain-selective granularity and the domain id in IOTLB.DID, polling IOTLB.IVT clear and reading IOTLB.IAIG back non-zero to confirm the request was serviced). Both completion polls are bounded by the same VTD_STATUS_POLL_BUDGET the A1 status handshakes use.

The hard ordering invariant — the whole point of the slice — is that the eight VT-d ledger-owned table/ring/used pages and the separate production DMAPool-owned request-buffer page are scrubbed and returned to their ledgers strictly after both completion polls return. A SeqCst fence sits between the completion reads and the scrub/free so the ordering is explicit in program order. A poll that exhausts its bounded budget fails closed: invalidation_completed is false, the pages are deliberately not freed (a page reused while hardware may still hold a cached translation through it is a stale-DMA hole), the ledgers keep them accounted rather than leaked-and-forgotten, and the proof line reports proof_result=fail-closed. Slice B uses register-based invalidation only: no GCMD.QIE queued-invalidation bit is set, so the A1 single-bit-GCMD discipline (correct only by minimalism — no other persistent GCMD bit set) still holds and the GCMD-reconstruct boundary is not crossed. The production DMAPool programming-abort path follows the same rule: if VT-d programming fails before root/translation state can expose the DMAPool page to hardware, the prepared DMABuffer/DMAPool records are detached; if the mapping may already be hardware-visible, the partial VT-d ledger is carried through the same leaf-removal and invalidation teardown before any VT-d table/ring page or production DMAPool page can be reused. The make run-iommu-remapping smoke and tools/qemu-iommu-remapping-smoke.sh now assert the invalidation proof line as proof_result=ok with mapping_removed=true context_cache_invalidated=true iotlb_flushed=true iotlb_actual_granularity_nonzero=true invalidation_completed=true page_reuse_ordered_after_invalidation=true table_pages_live_after=0 invalidation_interface=register-based-ccmd+iotlb, and forbid both a regression to the Slice-A deferred label and any invalidation_interface=queued value. The IOMMU-backed stale-handle / stale-completion hostile smokes (Slice C) were the deferred follow-on; they have since landed (see the note below), and as part of that work the single-phase run_invalidation_revocation_cycle was refactored into a two-phase run_target_revocation_phase + complete_revocation_teardown so the hostile re-drive can sit between the phases — the Slice B contract (every page freed strictly after its mappings are invalidated) is unchanged, and the combined invalidation proof line still asserts proof_result=ok for the complete teardown. IOVA export stays disabled and no host-physical address is user-visible.

Implementation note, 2026-05-14 19:13 UTC (DDF IOMMU remapping Slice C): the IOMMU-backed hostile stale-DMA smokes have landed, closing gate part 5 and the parent IOMMU remapping task. Closing the slice required refactoring the Slice B revocation into two phases so the hostile re-drive can run against a partially revoked remapping — the original single-phase cycle freed every page (including the virtio-rng descriptor table and available ring) before any hostile re-drive could observe a fault, so the re-driven device read an all-zero descriptor and issued no DMA at all. The kernel/src/iommu.rs ledger now classes each page by revocation-phase role: Target (request buffer + used ring — what the device’s DMA lands on), RingInfra (descriptor table + available ring — what the device reads to issue a DMA), and Table (root/context/second-level tables). run_target_revocation_phase removes the Target second-level leaves, invalidates the context-cache + IOTLB, and frees the Target pages — while deliberately keeping the RingInfra + Table pages mapped and live. run_hostile_stale_dma_cycle then re-drives the same still-live old-generation virtio-rng device through the new crate::virtio::prove_iommu_rng_stale_dma (each re-drive uses a fresh available-ring index past the A2 phases so the device sees a genuinely new descriptor): because the ring-infra pages are still mapped, the device reads a valid descriptor whose addr is a revoked target IOVA, so the DMA faults in the IOMMU (FSTS.PPF + FRCD[0].F, recorded faulting address is the stale IOVA) instead of reaching memory. A stale mapping-install attempt is refused (attempt_stale_mapping_install — the RevokedRemapping token is a dead-domain receipt with no live table handle, not install authority), and the freed Target page reads back as the scrubbed zeros. A second re-drive at the revoked used-ring IOVA faults too, publishes no device-written used-ring CQ entry into the freed page, exposes no memory to a would-be new owner, and makes no freed page eligible for reuse. complete_revocation_teardown then finishes the Slice B teardown by revoking + freeing the RingInfra + Table pages; the combined gate-part-2 invalidation proof line (invalidation_phases=target-then-ringinfra) still asserts proof_result=ok for the full two-phase teardown, with the same hard ordering invariant in each phase (pages freed strictly after that phase’s invalidation completion polls return; a poll that exhausts its budget fails closed and the phase’s pages are not freed). The load-bearing observation is the revoked translation state — not device cooperation, not a software ledger drop — blocking the stale DMA, confirmed by the VT-d fault registers plus a freed-page-stays-scrubbed read-back through the HHDM. Existing bounce-buffer stale-DMA evidence (the device-dma S.11.2.7/S.11.2.8 proofs) is preserved unchanged; the IOMMU-backed hostile smokes are strictly additive. The make run-iommu-remapping smoke and tools/qemu-iommu-remapping-smoke.sh now assert both hostile proof lines as proof_result=ok and forbid regression to the deferred, not-reached, or fail-closed labels. IOVA export stays disabled (iova_export=disabled-this-slice), no host-physical address is user-visible (host_physical_user_visible=0), and no hostile-hardware isolation is claimed (hostile_hardware_isolation=not-claimed).

Implementation note, 2026-05-23 (domain-scoped IOVA export): the selected QEMU Intel production DMAPool path now exposes the mapped request-buffer IOVA through DMABuffer.info only while the matching active DmaBufferHandle generation is live. The schema fields are deviceIova, deviceIovaScope=domain-scoped-iova, deviceIovaMeaning=meaningless-outside-domain, and iovaExport=domain-scoped-only; the production remapping proof asserts that deviceIova=0x200000 matches the installed second-level mapping. After the buffer is freed and the pool is released, an export attempt on the same handle fails closed with side-effect-blocked. The bounce-buffer grant path still reports deviceIova=0, deviceIovaScope=none, deviceIovaMeaning=not-exported, and iovaExport=disabled-future-only.

Implementation note, 2026-05-23 21:34 UTC (production-path hostile smokes): the selected QEMU Intel path now emits and asserts iommu-remapping: production dmapool hostile proof over the active manager-owned DMAPool / DMABuffer ledger. The proof ties the raw VT-d stale-handle and stale-completion faults to the production mapped IOVA, synthetic stale pool/buffer generation mismatch candidates, post-teardown stale-handle export failure, and per-device cross-domain boundary. It covers stale IOVA after revoke/reset, descriptor abuse, revoke/reset race ordering, stale completion after reset, teardown-under-DMA ordering, and cross-domain stale-handle attempts; no second-level entry is installed for stale authority, no CQ entry is published, no new-owner memory is exposed, and page reuse stays ordered after invalidation completion. It does not claim a process-exit trigger for the IOMMU path; the existing make run-net bounce-buffer evidence remains the exit-under-DMA source. The same smoke asserts the complete_iommu_dmapool_mapping_teardown prerequisite-false return and the hold_iommu_dmapool_mapping_ledger_after_abort path as fail-closed branch evidence. Existing bounce-buffer S.11.2 evidence from make run-net and make run-dmapool-grant is preserved unchanged.

Implementation note, 2026-05-26 05:55 UTC (direct-DMA posture transition for the selected QEMU Intel path): the closeout slices above landed the full mechanism — real hardware DMA over a manager-owned DMAPool DMABuffer page mapped through the per-device IOMMU domain, domain-scoped IOVA export, per-device domains, and the production hostile matrix — but deliberately deferred the headline direct_dma=enabled claim behind iova_export=disabled-this-slice. The selected QEMU Intel path now emits iommu-remapping: direct-dma posture real_dma=attempted direct_dma=enabled remapping_tables=programmed trusted_domain=<domain-id> descriptors_reference=domain-scoped-iova mapped_page_source=manager-owned-dmabuffer mapping_installed_before_doorbell=true invalidated_before_page_reuse=true bounce_buffer=not-required exported_device_addresses=iova-only host_physical_user_visible=0 hostile_hardware_isolation=not-claimed proof_result=ok. Every field is computed from the real proof facts, not asserted as a constant: remapping_tables=programmed requires the root/context/second-level entries written plus the SRTP/TES handshakes; real_dma=attempted requires the virtio-rng device’s mapped DMA to have completed through the programmed IOVA (hardware_dma_translation_proven); direct_dma=enabled additionally requires the manager-owned DMAPool mapping to have been installed before the device doorbell; and invalidated_before_page_reuse folds in the two-phase revocation’s page_reuse_ordered_after_invalidation and invalidation_completed results. This is bounded QEMU-emulator evidence, so hostile_hardware_isolation stays not-claimed (real hostile-hardware isolation needs real hardware, not QEMU). The no-IOMMU run-net / run-dmapool-grant bounce-buffer fallback is untouched: it keeps direct_dma=blocked with no IOVA export, and make run-iommu-remapping now forbids this path from regressing to the bounce-buffer fallback proof or to a blocked/not-attempted posture. The contract table in “Downstream-Contract Scaffolding” (direct-remapping domain: direct_dma=enabled, remapping_tables=programmed, exported_device_addresses=iova-only) is now backed by an emitted, asserted posture on the selected path.

Authority Model

Device authority is split into three independent capabilities:

  • DMAPool: authority to allocate, expose, and revoke device-visible memory within a kernel-owned physical range or IOMMU domain.
  • DeviceMmio: authority to map and access one device’s register windows.
  • Interrupt: authority to wait for and acknowledge one interrupt source.

Holding one of these capabilities never implies the others. A driver needs all three for a normal device, but the kernel and init can grant, revoke, and audit them separately.

Production Handle Epoch Invariants

All three object families use opaque handles whose identity is checked against kernel-owned records before every operation. A raw object id is never enough to authorize DMA, MMIO, interrupt waits, acknowledgements, descriptor submission, or teardown. A handle is accepted only when all of these facts match in the same ownership transaction:

  • the object id resolves to a live record of the expected type;
  • the handle’s device owner generation matches the current device-manager owner record;
  • the handle’s pool, mapping, slot, source, or route generation matches the current reusable subrecord;
  • the record state permits the requested operation.

The exact ABI shape may change when the capability surface is implemented, but production handles must carry the equivalent identity:

#![allow(unused)]
fn main() {
struct DmaPoolHandle {
    device_id: u32,
    owner_generation: u64,
    pool_id: u32,
    pool_generation: u64,
}

struct DmaBufferHandle {
    device_id: u32,
    owner_generation: u64,
    pool_id: u32,
    pool_generation: u64,
    slot: u32,
    slot_generation: u64,
}

struct DeviceMmioHandle {
    device_id: u32,
    owner_generation: u64,
    bar: u8,
    mapping_id: u32,
    mapping_generation: u64,
}

struct InterruptHandle {
    device_id: u32,
    owner_generation: u64,
    source_id: u32,
    source_generation: u64,
    route_generation: u64,
}
}

Object identity fields have distinct jobs:

  • DMAPool handles name the claimed device, the device owner generation, and the pool record generation. Buffer handles issued by the pool repeat the device-owner and pool identity and additionally name a buffer slot and slot generation. The pool identity prevents a handle from crossing devices or owners; the slot identity prevents a freed or reused buffer slot from accepting an old descriptor, free, or completion.
  • DeviceMmio handles name the claimed device, owner generation, BAR or subrange mapping record, and mapping generation. The physical range, cache attributes, and access policy remain in the kernel record and are not user-editable handle fields.
  • Interrupt handles name the claimed device, owner generation, source record, source generation, and route generation. Waiter records may carry their own waiter generation internally, but they must be invalidated whenever the source or route generation changes.

Owner generations and subrecord generations are intentionally separate. The device owner generation belongs to the device-manager ownership record and invalidates every DMAPool, DeviceMmio, and Interrupt handle for the old owner when ownership is revoked, transferred, reset, or reassigned. Pool, buffer-slot, MMIO-mapping, interrupt-source, and route generations belong to records that may be reused below the device owner. They prevent stale buffer, mapping, route, waiter, and completion handles from matching a newly allocated subrecord even when the device id or pool id is reused.

Every epoch is non-wrapping for authority purposes. Implementations must use an epoch width that cannot wrap during the object’s lifetime, or permanently retire the exhausted device, pool, slot, mapping, source, or route record. Epoch exhaustion is a closed allocation or reassignment failure; it must never wrap back to a value that could match an old handle.

Generation mismatch, wrong object type, wrong device owner, freed slot, detached source, revoked mapping, and wrong device-owner state are hard closed results. The failed operation must not program a descriptor, ring a doorbell, perform an MMIO write, unmask or acknowledge an interrupt, wake a waiter, publish a CQE, decrement completion accounting, free a page for reuse, or mutate the device ledger except for bounded failure accounting or audit metadata.

Transfer, revoke, reset, and reassignment are ordered around those epoch checks:

  • Transfer: The old owner leaves Active, the owner generation advances, and old handles become invalid before a new owner receives handles. A transfer may preserve hardware state only after old interrupt notifications, MMIO write authority, and DMA submissions are either quiesced or represented by old-generation ledger entries that the new owner cannot consume.
  • Revoke: The device manager invalidates user-visible handles first, then follows the revocation order below: MMIO write authority removed, interrupts masked or detached, queues quiesced or reset, mappings removed, and pages scrubbed before release.
  • Reset: Reset or disable advances the owner generation and any affected source, route, pool, mapping, and buffer-slot generations before new handles can be issued. If old DMA writes cannot be proven stopped, buffer slots stay unavailable until reset completion and mapping invalidation prove reuse is safe.
  • Reassignment: Interrupt sources, MMIO mappings, and DMA pool records are detached or unmapped, their subrecord generations advance, pending waiters or completions are drained or marked stale, and only then can a new owner receive authority for the reused source, mapping, or slot.

Handle reuse rules:

  • stale handles fail closed;
  • freed-handle reuse fails closed;
  • reallocated slots must not restore authority to old handles;
  • old interrupt waiters must not observe or acknowledge a new owner’s interrupt source;
  • old DMA handles must not reference a newly allocated buffer in the same slot.

Production proof obligations are split between host tests and QEMU smokes. Host tests must cover the pure validator and state-machine cases: stale owner generation, stale pool or mapping generation, stale buffer slot, stale interrupt source or route, wrong owner, wrong device, wrong object type, freed object, wrong state, epoch exhaustion/retirement, and no side effects on failure. QEMU smokes must prove the hardware-facing ordering: stale DMA handles after free/reuse cannot submit descriptors, stale DMA completions after revoke/reset cannot publish CQEs or mutate reused buffers, stale MMIO handles cannot ring doorbells after revoke, stale interrupt waiters or acknowledgements cannot wake or affect a new owner, and process-exit or driver-crash teardown reaches a zero-live ledger before pages are reused. These production handles and proofs remain open; the current QEMU scratch proofs are prerequisite evidence for this contract, not completion of it.

Implementation note, 2026-05-02 13:18 UTC: capos-lib::device_authority now implements the bounded pure host-testable validator prerequisite for these handle epoch invariants. The module models the documented handle and record identity fields for DMAPool, DMA buffer, DeviceMmio, and Interrupt, separates device-owner generations from pool, slot, mapping, source, and route generations, returns explicit fail-closed error labels, blocks the relevant side-effect class on validation failure, and refuses epoch wrap or retired epoch reuse. This does not expose production userspace handles, wire kernel device paths, attach budget/OOM policy to real handle creation, or complete the QEMU stale-handle or S.11.2 hostile-smoke gates.

Implementation note, 2026-05-03: the pure host-test operation matrix now enumerates every current validator operation variant: DMAPool::{AllocateBuffer,IssueBufferHandle}, DMABuffer::{SubmitDescriptor,CompleteDescriptor,FreeBuffer}, DeviceMmio::{Map,Read,Write,RingDoorbell,Unmap}, and Interrupt::{Wait,Acknowledge,Mask,Unmask}. Each row asserts active acceptance plus stale owner/subrecord, freed, revoked, and retired failures with the exact blocked side-effect class for that operation. This remains ABI-independent host-test evidence only; it does not create production userspace handles or replace the QEMU stale-handle and S.11.2 hostile-smoke gates.

Implementation note, 2026-05-02 13:43 UTC: the current kernel device-manager DMAPool lifecycle and imported-live accounting proofs now adapt their BDF, owner generation, pool slot, and pool generation into capos-lib::device_authority records. The QEMU proof records active validator success, stale-after-revoke failure as dmapool-stale-handle, the validator reason stale-owner-generation, and side-effect-blocked. This is still a bounded kernel-proof adapter, not production userspace handle exposure, DeviceMmio/Interrupt handle wiring, production page cleanup, or S.11.2 hostile smoke completion.

Implementation note, 2026-05-02 17:04 UTC: the current kernel device-manager also has a bounded manager-owned DeviceMmio record proof adapter. The record carries BAR, mapping id, mapping generation, and owner generation fields, and the QEMU virtio-rng device-manager path validates a RingDoorbell operation through capos-lib::device_authority. After revoke begins, the old handle fails through the pure validator as stale-owner-generation, records devicemmio-stale-handle plus side-effect-blocked, and no doorbell write is attempted. The lifecycle proof blocks RevokingHandles -> MmioRevoked while the record is attached, then allows the transition after bounded detach. This does not expose production userspace DeviceMmio authority, program real BAR mappings, create mapping objects, or complete hostile stale-MMIO smokes.

Implementation note, 2026-05-02 17:29 UTC: the bounded DeviceMmio adapter now derives the proof mapping from the first decoded PCI memory BAR on the tested PciDevice through the shared BAR-region validator. The attached manager-owned record carries that BDF/BAR/base/length metadata and validates that it is the same BDF, a memory BAR, nonzero length, and the same BAR named by the handle before constructing the pure capos-lib::device_authority record. The QEMU smoke asserts region_source=pci-decoded-memory-bar, region_bound_to_manager=true, bar_present=true, bar_memory=true, bar_base, and bar_length. This is still prerequisite evidence only: it does not create userspace DeviceMmio handles, program real MMIO mappings, enforce cache attributes or write policy, or write a real doorbell.

Implementation note, 2026-05-02 17:54 UTC: the same bounded DeviceMmio adapter now records fail-closed malformed-region evidence before the positive attach. Wrong-BDF metadata, wrong BAR/handle mismatch, and zero-length region metadata all report devicemmio-region-invalid, with region_invalid_mapping=not-created and region_negative_side_effect=side-effect-blocked; the proof still records real_mmio_mapping=not-programmed and real_doorbell=not-written. This is bounded manager-proof evidence only. It does not create userspace DeviceMmio handles, map real BAR pages, enforce cache attributes or write policy, complete hostile stale-MMIO smokes, or perform a real doorbell write.

Implementation note, 2026-05-02 20:14 UTC: the bounded DeviceMmio adapter now stores future mapping policy metadata on the attached manager-owned record and reads it back through the active record plus matching DeviceMmioHandle. The proof line records policy_scope=manager-attached-devicemmio-cache-write-policy, cache_policy=device-uncacheable, page_table_protection=capability-scoped-device-nx, write_policy=claimed-registers-and-doorbells-only, executable=blocked, userspace_devicemmio=not-started, host_physical_user_visible=0, policy_bound_to_manager=true, and policy_result=ok. A tampered cache/write policy record fails closed with policy_tamper_result=fail-closed, policy_tamper_mapping=not-created, and policy_tamper_side_effect=side-effect-blocked. This is still metadata proof only: no PAT/MTRR or page-table programming is performed, no userspace DeviceMmio handle is created, no real BAR mapping object exists, and no doorbell is written.

Implementation note, 2026-05-02 20:45 UTC: while the same bounded manager-owned DeviceMmio record is still active, the proof now validates hostile RingDoorbell handles through a proof-scoped adapter that uses the already-attached record rather than manager lookup short-circuits. Wrong owner generation, wrong mapping generation, wrong mapping id, wrong BAR, and wrong BDF/device fail closed with exact pure-validator reasons stale-owner-generation, stale-mapping-generation, wrong-mapping, wrong-bar, and wrong-device. Each records side-effect-blocked; the proof also records that the attached manager record is preserved, no fake mapping is created, and no doorbell is written. This remains bounded proof evidence only: production userspace handles, real MMIO mappings, real cache/write-policy enforcement, and hostile stale-MMIO smokes remain open.

Implementation note, 2026-05-03 00:36 UTC: the schema and kernel now include a result-only DeviceMmio.info skeleton that can wrap a manager-issued DeviceMmioHandle. The object validates the live device-manager record through validate_devicemmio_record() before returning status labels such as userspaceDeviceMmio=manager-issued-skeleton, managerRecord=validated-active, realMmioMapping=not-programmed, realDoorbell=not-written, hostPhysicalUserVisible=false, directMmio=blocked, registerRead=blocked, registerWrite=blocked, and bootstrapGrant=blocked. The QEMU device-manager lifecycle proof constructs that cap object while the attached record is active, records devicemmio_cap_info_result=ok, exercises the serialized CapObject::call(0, &[]) path and decodes the returned DeviceMmio.info Cap’n Proto result as devicemmio_cap_serialized_call_result=ok, then verifies the same cap fails closed after revoke begins as devicemmio_cap_stale_after_revoke_result=devicemmio-stale-handle; the same stale object also fails the serialized method-0 path as devicemmio_cap_serialized_stale_after_revoke_result=invoke-failed. A later manifest-grant smoke explicitly releases the granted DeviceMmio cap through CAP_OP_RELEASE and proves a subsequent typed DeviceMmio.info call fails closed from userspace. A focused grant-cycle smoke now repeats that grant, release, and stale-info proof twice in sequence and asserts the second manager-grant-source acquire receives a fresh mapping generation after the first release; the same smoke also decodes both acquire/release cycles through the typed volatile HardwareAuditLog.snapshot surface. The focused hardware-audit interrupt-waiter smoke also decodes recent boot-time DmaBuffer, DmaPool, and Interrupt driver-crash / reset-disable lifecycle records from the current volatile 16-record snapshot window. The same smoke now uses the startSequence cursor to decode older retained DeviceMmio lifecycle rows that the default latest 16-record tail has skipped. A 2026-05-10 15:33 UTC manifest-grant follow-up turns DeviceMmio.map from admission-only into a read-only userspace VMA over the boot-preseeded BAR page already used by brokered read32. The typed smoke validates the active DeviceMmioOperation::Map authority check, rejects writable, executable, unknown, zero-size, unaligned, out-of-BAR, and overflow requests with typed no-side-effect results, reads the same QEMU BAR value through the returned userspace address and DeviceMmio.read32, rejects a duplicate active map, explicitly calls DeviceMmio.unmap, proves a second unmap is a typed no-op, remaps, and proves stale unmap fails closed after cap release. Release/drop/driver-crash/reset-disable cleanup revokes any borrowed user VMA before detaching the manager record. This is read-only BAR VMA evidence only: it does not add writable MMIO, volatile register writes, doorbells, host physical/IOVA exposure, post-userspace kernel MMIO mappings, IOMMU programming, durable/signed audit persistence, concurrent sharing semantics, or a production provider-driver consumer. A 2026-05-10 20:06 UTC follow-up promotes DeviceMmio.write32 to one bounded brokered volatile dword write through that same boot-preseeded kernel MMIO cache after active manager-attached handle, owner/state, policy/region, pure Write authority, dword alignment, decoded-BAR range validation, and a single provider-scoped claim derived from PCI MSI-X metadata, including BDF, BAR, BAR base, offset, and value. The focused proof writes the claimed virtio-rng MSI-X entry-0 vector-control mask dword, reads it back through both brokered read32 and the read-only userspace VMA, then proves an unclaimed message-address dword write leaves the original value unchanged. Invalid range and unclaimed calls remain typed no-write results, while stale or released handles fail closed before any write and do not return a write32 result payload. This does not add writable userspace BAR mappings, arbitrary register writes, doorbells, host physical/IOVA exposure, post-userspace arbitrary remaps, IOMMU programming, or a production provider-driver consumer.

A 2026-05-26 07:32 UTC follow-up (ddf-userspace-writable-devicemmio-interrupt) proves the cross-authority non-implication required by the gate (“holding one authority must not imply either of the others”). Each grant smoke’s granted CapSet is its sole authority source: the DeviceMmio smoke holds only console + device_mmio and the Interrupt smoke only console + interrupt. The smokes assert that the other DDF grants (interrupt/device_mmio and dmapool) are absent from the CapSet, and that the held cap cannot be reinterpreted as another interface because the kernel-delivered interface id is fixed at grant time (negative-authority ... result=ok lines, with the kernel “spawned … 2 caps” structural counterpart). This is a non-implication proof over the already-landed authorities; it adds no new kernel MMIO/IRQ/DMA surface. Real userspace wait/acknowledge over the live route with deferred LAPIC EOI is proven separately by the provider tx_interrupt/rx_interrupt consumer (make run-ddf-provider-consumer).

DMAPool Invariants

DMAPool is the only future userspace-facing authority that may cause a device-visible DMA address to exist.

  • Authority: A holder may allocate buffers only from the pool object it was granted. It may not request arbitrary physical frames, import caller virtual memory by address, or derive another pool.
  • Handle identity: A pool operation checks the claimed device id, owner generation, pool id, and pool generation before changing pool state. Buffer operations additionally check the buffer slot and slot generation before descriptor validation, completion accounting, free, scrub, or reuse.
  • Physical range: Every exported device address must resolve to pages owned by the pool. The kernel records the allowed host-physical page set and validates every descriptor mapping against that set before a device can use it. If an IOMMU domain backs the pool, the exported address is an IOVA, not raw host physical memory.
  • Ownership: Each DMA buffer has one pool owner, one device-domain owner, and explicit CPU mappings. Sharing a buffer with another process requires a later typed memory-object transfer; copying packet data is the default until that object exists.
  • No raw grants: Userspace never receives an unrestricted host-physical address. A driver may receive an opaque DMA handle or an IOVA meaningful only to its DMAPool/device domain. It cannot turn that value into access to unrelated RAM.
  • Residency: DMA pages are committed before exposure to the device, resident for the entire device-visible lifetime, unswappable, and scrubbed before reuse by another owner.
  • Bounds: Buffer length, alignment, segment count, and queue depth are bounded by the pool. Descriptor chains that point outside an allocated buffer, wrap arithmetic, exceed device limits, or reference freed buffers fail closed before doorbell writes.
  • Revocation: Revoking the pool first quiesces the device path using it, prevents new descriptors, waits for or cancels in-flight descriptors, then removes IOMMU mappings or invalidates bounce-buffer handles before freeing pages.
  • Reset: If in-flight DMA cannot be proven stopped, revocation escalates to device reset through the owning device object before pages are reused.
  • Residual state: Pages returned from a pool are zeroed or otherwise scrubbed before reuse by a different owner. Receive buffers are treated as device-written untrusted input until validated by the driver or stack.

Device-visible memory authority is not ordinary MemoryObject authority. FrameAllocator and MemoryObject must not become raw physical-address escape hatches. A future shared-buffer transfer may share CPU-visible packet bytes after validation, but it does not by itself grant IOVA creation, descriptor programming, or device write authority.

For the in-kernel QEMU smoke, the kernel is the only DMAPool holder. The same invariants apply internally even though no userspace capability object is exposed yet.

Implementation note, 2026-04-24: the initial virtio-net transport uses kernel-owned frame-allocator pages for RX/TX split-virtqueue descriptor, available, and used rings plus the one-shot TX descriptor proof buffer, ARP TX buffer, ICMP TX buffer, smoltcp adapter/TCP TX buffers, and posted RX packet buffers. The smoltcp adapter copies completed RX frame bytes out of those device-written pages before handing them to the stack. Those pages are programmed into the device only by kernel code after modern PCI transport discovery and feature negotiation; no userspace process receives a DMA buffer, physical address, or BAR mapping.

Implementation note, 2026-05-02: the current QEMU virtio-net DMA path routes those kernel-owned pages through a bounded device_dma pool ledger. The net smoke proves live pool bytes, page counts, page-rounded MMIO mapping bytes, config/RX/TX interrupt holds, RX/TX ring depths, and RX/TX submission/completion and in-flight descriptor accounting. This is the first kernel-owned DMAPool accounting proof; it does not expose userspace DMA, MMIO, or interrupt handles and does not complete the production S.11.2 hostile-smoke gate.

Implementation note, 2026-05-02 06:59 UTC: the kernel-owned device_dma ledger now has an explicit bounded budget/OOM policy for the current virtio-net proof path: 32 DMA pages, 131072 DMA bytes, queue depth 8, submission depth 8, four page-rounded MMIO mapping holds, 16384 MMIO mapped bytes, and three interrupt holds. make run-net emits a scratch-ledger device-dma: budget oom proof ... proof_result=ok line proving page and byte allocation over budget, overlarge queue depth, duplicate and over-budget MMIO holds, MMIO byte over budget, duplicate and over-budget interrupt holds, and descriptor submission beyond queue depth all fail closed without mutating the live virtio-net ledger; the proof also revalidates the live ledger. This is a bounded prerequisite for the production DMAPool contract. It still does not expose userspace DMAPool, DeviceMmio, or Interrupt handles, wire real lifecycle hooks, program IOMMU remapping domains, or close the S.11.2 hostile smoke matrix.

Implementation note, 2026-05-02 16:31 UTC: attached device-manager DMAPool records now also carry that budget profile. The lifecycle and imported live-accounting proofs read page, byte, queue-depth, submission-depth, MMIO mapping, MMIO byte, and interrupt-hold budgets through the active device-manager record plus the matching DmaPoolHandle. Queue and submission depth remain per-queue limits; the manager proof records queue_count=2 plus derived aggregate in-flight/submission budgets and checks imported virtio-net accounting against those aggregate totals. This keeps the budget policy tied to manager state but still does not create userspace DMAPool handles or enforce budgets at userspace handle creation, transfer, or revoke.

Implementation note, 2026-05-02 21:13 UTC: the zero-live device-manager DMAPool lifecycle proof now validates a proof-scoped tampered AttachedDmaPoolBudgetPolicyRecord through the manager budget-policy helper while the attached pool record is active. The tampered record uses the wrong policy scope/source/label plus stricter page, byte, queue, in-flight, MMIO, interrupt, and submission budgets, and it fails closed before it can be treated as a usable policy. The QEMU proof records budget_policy_tamper_result=fail-closed, budget_policy_tamper_allocation=not-created, budget_policy_tamper_ledger=not-mutated, budget_policy_tamper_teardown=not-advanced, and budget_policy_tamper_side_effect=side-effect-blocked. This is still bounded metadata proof only: no userspace DMAPool handle is exposed, no production userspace DMA page is allocated, freed, or reused, and no real DMA teardown is claimed.

Implementation note, 2026-05-02 22:16 UTC: the manager-owned DMAPool budget-accounting proof now fails closed on accounting over the attached budget instead of only logging passive booleans. The positive zero-live and imported-live budget_*_within_policy=true labels call a helper that revalidates the active attached record, matching DmaPoolHandle, owner, active state, and attached budget policy before accepting the record accounting. While the zero-live record remains active, synthetic attached-accounting candidates exceed buffer count, page count, byte count, and the current aliased in-flight/submission total; each candidate fails closed before it can be treated as usable manager state. The QEMU proof records exact overrun reasons, no fake allocation, no ledger mutation, no teardown advancement, and side-effect blocking. A proof-scoped over-budget attach candidate now fails before pool generation allocation and records preserved generation state. This remains bounded manager-record evidence only; later grant slices add the first single-page bounce-buffer allocation/free authority, but multi-buffer allocation, DMA mapping, descriptor execution, IOMMU programming, production driver consumption, and S.11.2 hostile smokes remain open.

Implementation note, 2026-05-02 23:59 UTC, updated 2026-05-03: the schema and kernel now include a result-only DMAPool.info skeleton that can wrap a manager-issued DmaPoolHandle. The object validates the live device-manager record through validate_dmapool_record() before returning status labels such as userspaceDmaPool=manager-issued-bounce-buffer, realDma=not-attempted, hostPhysicalUserVisible=false, and directDma=blocked. The QEMU zero-live device-manager lifecycle proof constructs that cap object while the attached record is active, records dmapool_cap_info_result=ok, exercises the serialized CapObject::call(0, &[]) path and decodes the returned DMAPool.info Cap’n Proto result as dmapool_cap_serialized_call_result=ok, then verifies the same cap fails closed after revoke begins as dmapool_cap_stale_after_revoke_result=dmapool-stale-handle; the same stale object also fails the serialized method-0 path as dmapool_cap_serialized_stale_after_revoke_result=invoke-failed. The same proof now exercises DMAPool.allocateBuffer through call_with_table() on a real DmaPoolCap in a CapTable: it decodes bufferIndex=0, verifies CAP_CQE_TRANSFER_RESULT_CAPS, cap_count=1, the transfer-result record’s DMABUFFER_INTERFACE_ID, the non-transferable same-session result-cap hold, and DMABuffer.info through the returned result cap. Duplicate allocation and stale-after-revoke allocation both fail closed without adding result caps. The duplicate-active valid-size path now reports a structured schema result with result=dmapool-already-attached, reason=active-buffer-attached, sideEffect=side-effect-blocked, and bufferPresent=false; the duplicate path also preserves the manager generation counter. Invalid-size requests use the same no-result-cap response shape with result=dmapool-allocation-request-invalid, the exact request reason, sideEffect=side-effect-blocked, and bufferPresent=false. As of the 2026-05-08 DMAPool grant-source follow-up, the same bounded path is also available through KernelCapSource::DmaPool: the grant source attaches a fresh zero-live manager-owned pool record, stages matching zero-live release evidence, and lets the child mint one DMABuffer result cap. A 2026-05-09 release-order follow-up has the smoke explicitly release the parent DMAPool before the result DMABuffer; the parent detach remains pending until the DMABuffer frees the page and completes the staged zero-live pool detach. A later 2026-05-09 follow-up adds a typed DMABuffer.freeBuffer method for that bounded result cap: the method reuses the same FreeBuffer authority validation and page scrub/ledger/frame-free cleanup path as cap release, emits a free-buffer audit event, invalidates later DMABuffer.info, and makes the later cap release a no-op detach. A second bounded follow-up keeps the parent DMAPool live after that first explicit free, reallocates the same slot with a fresh slotGeneration, and then repeats the parent-first release proof on the second buffer. The focused read-side HardwareAuditLog.snapshot smoke also decodes both slot generations, both typed free-buffer records, the parent DmaPool release, and both no-op release-after-free records through the volatile audit cap. The run-net DMABuffer driver-crash and reset-disable proofs also cover the same pending-parent completion path so successful buffer cleanup cannot orphan the staged parent release. A later admission follow-up adds typed DMABuffer.submitDescriptor to the same manifest-granted bounce-buffer path: the method validates the active manager-attached buffer epoch through the existing DmaBufferOperation::SubmitDescriptor authority validator, echoes the queue/descriptor/length and generation identity, and proves the same call fails closed after freeBuffer revokes the old cap. A later symmetric follow-up adds typed DMABuffer.completeDescriptor to the same bounded path. The 2026-05-10 request-shaping follow-up routes both typed descriptor calls through a shared pure bounded descriptor validator: valid bounce-buffer requests return ok request labels plus queue_count=4, descriptor_count=8, and buffer_bytes=4096, while out-of-range queues/descriptors, zero submit lengths, submit lengths beyond the bounce buffer, and completion lengths beyond the bounce buffer fail closed as dmabuffer-descriptor-request-invalid with side-effect-blocked before any descriptor side effect. A later manager-accounting follow-up records only the bounded manager counter: submit returns manager-inflight-recorded and raises DMAPool.info live_inflight to 1, completion returns manager-inflight-completed and restores it to 0, and a valid completion with no outstanding submission returns dmabuffer-no-inflight-submission. Too-small descriptor result buffers are rejected before accounting mutation, and cap-table release drains bounded in-flight accounting before detaching the bounce buffer. The 2026-05-10 06:37 UTC follow-up makes this allocateBuffer/freeBuffer page lifecycle the first production-labeled single-page bounce-buffer allocation/free authority. The typed surfaces report userspaceDmaPool=manager-issued-bounce-buffer, allocation=single-bounce-buffer-page, recordPool=userspace-bounce-buffer-live, zero-live-dmapool-bounce-buffer, and freeBuffer=bounce-buffer-page; the underlying device_dma ledger uses a manager-attached bounce-buffer helper and scrubs before frame free. A 2026-05-10 11:44 UTC follow-up extends that bounded manifest-granted path to two fixed manager-owned slots; a 2026-05-10 12:49 UTC follow-up extends the same path to three fixed manager-owned slots: slot 0, slot 1, and slot 2 can be live together, DMAPool.info reports three live buffers/pages while all are attached, a fourth allocation fails closed as dmapool-already-attached, and slot 0 can be freed and reused with a fresh generation while slots 1 and 2 remain live. There is still no allocation beyond those three fixed slots, real device-visible DMA mapping, host physical address or IOVA exposure, BAR mapping, production descriptor-ring mutation, CQ publication, IOMMU programming, or production driver consumer. Stale allocation attempts preserve the live backing, and page allocation failure occurs before buffer-generation allocation so it does not burn a generation. The 2026-05-10 13:45 UTC follow-up (3bbeb3d4) adds explicit typed DMABuffer.unmap for the mapped bounce-buffer userspace VMA. The method validates the live DMABuffer handle before reporting success or no-op, removes only the borrowed VMA owned by that mapping for the calling process, and publishes the mapping as absent only after the borrowed-range ownership check, page-table unmap, TLB wait, and waiter cleanup succeed. While teardown is in progress, concurrent map/free/release paths fail closed against an in-progress mapping state. A second unmap returns dmabuffer-mapping-absent / no-user-mapping with no side effect. This is userspace VMA cleanup only: it does not free or scrub the bounce page, detach the buffer record, change DMAPool.info live buffer/page/in-flight counts, program or remove real DMA or IOMMU mappings, expose host physical/IOVA addresses, mutate descriptor rings, publish CQ entries, or add a production driver consumer. The 2026-05-10 14:12 UTC follow-up moves bounded descriptor accounting from a single pool-global descriptor identity to per-slot state on each live manager-owned DMABuffer record. DMAPool.info live_inflight remains the aggregate sum across live slots. A valid submit on slot 0 and a valid submit on slot 1 can coexist; duplicate submit on either same slot still fails closed; mismatched completion preserves that slot’s descriptor without touching other slots; matching completion of slot 0 decrements the aggregate while slot 1 remains in flight; explicit freeBuffer of an in-flight slot fails closed; and cap-release/process-exit cleanup drains only the releasing slot’s descriptor before detach. This is still bounded manager accounting and does not mutate descriptor rings, publish CQ entries, expose host physical or IOVA addresses, attempt direct DMA, program IOMMU state, or add a production driver consumer. The 2026-05-10 18:11 UTC follow-up makes the single manager-owned bounce-buffer page exclusive between userspace borrowed-VMA ownership and manager in-flight descriptor ownership for each DMABuffer cap. A valid submit while the same cap still has a live mapping fails closed as dmabuffer-mapping-live / user-mapping-live before manager in-flight accounting changes; explicit unmap restores submit. A valid map while the slot has an in-flight descriptor fails closed as dmabuffer-inflight-submission / in-flight-submission, returns addr=0, and does not publish a borrowed VMA; matching completion restores map for that slot. The lock order remains cap mapping state before manager validation, with no address-space lock held across manager state mutation. The 2026-05-10 19:29 UTC follow-up adds bounded completion data on that same manager-owned bounce-buffer page. The successful matching DMABuffer.completeDescriptor path keeps the existing manager-inflight-completed result labels, validates the active owner, pool/slot generation, queue/descriptor identity, and submitted length, then writes a deterministic byte pattern into only the accepted completionLength bytes before clearing the in-flight record. Because submit is blocked while a cap-owned borrowed VMA is live and map is blocked while the slot is in flight, the write happens while no live user VMA exists for that slot; a later successful map lets userspace read the pattern. Invalid requests, stale caps, no-inflight completions, descriptor mismatches, length-exceeded completions, mapped-live cases, and after-free calls do not write. This is userspace-visible bounce-buffer completion data, not device DMA completion: there is still no descriptor-ring mutation, CQ publication, direct DMA, host physical/IOVA exposure, IOMMU programming, or production driver consumer. Implementation note, 2026-05-10 04:40 UTC: duplicate-active valid-size and invalid-size DMAPool.allocateBuffer requests now use schema result data for domain rejection instead of an application-exception label. The no-result-cap response reports either result=dmapool-already-attached / reason=active-buffer-attached or result=dmapool-allocation-request-invalid with the exact request reason, plus sideEffect=side-effect-blocked and bufferPresent=false, before any allocation side effect. The 2026-05-10 live-accounting follow-up also carries that bounded frame into the attached DMAPool record exposed by typed DMAPool.info: the manager record starts as zero-live-dmapool-bounce-buffer, changes to userspace-bounce-buffer-live with one, two, or three live 4096-byte pages while manager-attached DMABuffer result caps exist, and returns to zero-live after typed DMABuffer.freeBuffer or cap release scrubs/releases every live bounce page. The same lifecycle proof validates that active manager-record accounting against the attached budget policy before treating allocation as usable state. This still does not expose a device-visible DMA address, IOVA, host physical address, production descriptor side effect, DMA mapping, or production driver consumer.

Implementation note, 2026-05-11 03:00 UTC: the manifest-granted manager-owned fixed bounce-buffer DMAPool path now has its own device-manager budget policy instead of importing the live virtio-net device_dma policy. The policy covers three live buffers/pages, 12288 bytes, four queues, eight descriptors per queue, one in-flight descriptor per live slot, zero MMIO mappings/bytes, and zero interrupt holds. DMAPool.allocateBuffer validates the next-live manager accounting against that policy before slot selection, frame allocation, generation allocation, result-cap minting, or manager ledger mutation. With all three fixed slots live, a fourth valid-size allocation returns no result cap and reports result=dmapool-budget-exceeded, reason=over-buffer-budget, sideEffect=side-effect-blocked, and bufferPresent=false. Imported live virtio-net proof records continue to use the device_dma:virtio-net budget policy. This remains the bounded manager-owned bounce-buffer path: direct DMA is blocked, host physical addresses and IOVAs stay hidden, descriptor rings and completion queues are not mutated, and IOMMU/remapping plus production driver consumption remain out of scope.

Implementation note, 2026-05-11 06:10 UTC: the same fixed-slot manager-owned DMAPool family now revalidates current budget accounting before publishing an allocated DMABuffer result cap, before acquire-audit publication, before parent DMAPool release intent or detach, before grant-rollback/drop/teardown detach, before pending-parent release completion, before DMABuffer page release, and before descriptor-completion state advancement. The focused grant smoke labels the full-pool budget rejection as no leaked result cap, no generation burn, no ledger mutation, and no stale authority publication. DeviceMmio and Interrupt budget propagation is not changed by this slice.

Implementation note, 2026-05-03 02:05 UTC: the schema and kernel now include a result-only DMABuffer.info skeleton that can wrap the manager-attached DmaBufferHandle record already issued inside the zero-live DMAPool lifecycle proof. The object validates the live manager-owned buffer record through validate_dmabuffer_record() and the existing pure DMA-buffer validator before returning status labels such as userspaceDmaBuffer=manager-issued-bounce-buffer, managerRecord=validated-active, bufferRecord=manager-attached-buffer, realDma=not-attempted, hostPhysicalUserVisible=false, directDma=blocked, descriptorSubmit=manager-inflight-accounting, descriptorComplete=manager-inflight-accounting, freeBuffer=bounce-buffer-page, and bootstrapGrant=blocked. The QEMU zero-live lifecycle proof constructs that cap object while the buffer record is active, records dmabuffer_cap_info_result=ok, exercises the serialized CapObject::call(0, &[]) path and decodes the returned DmaBuffer.info Cap’n Proto result as dmabuffer_cap_serialized_call_result=ok, then verifies the same cap fails closed after revoke begins as dmabuffer_cap_stale_after_revoke_result=dmabuffer-stale-handle; the same stale object also fails the serialized method-0 path as dmabuffer_cap_serialized_stale_after_revoke_result=invoke-failed. Later bounded grant slices expose this result cap through the DMAPool manifest-grant path, add typed DMABuffer.freeBuffer, and add bounded userspace bounce-buffer DMABuffer.map / DMABuffer.unmap plus manager-accounted request-shaped DMABuffer.submitDescriptor / DMABuffer.completeDescriptor; the path still exposes no real DMA, descriptor-ring mutation, CQ publication, host physical address or IOVA export, production page cleanup/reuse, or production userspace DMABuffer completion.

Implementation note, 2026-05-11 11:22 UTC: the provider-consumer smoke now uses that same bounded bounce-buffer path to prove a descriptor-ring-equivalent provider side effect. After DMABuffer.submitDescriptor validates live owner/pool/slot authority, descriptor queue/id/length bounds, no live user mapping, and no duplicate in-flight descriptor, the manager scrubs the page and writes a provider-visible shadow descriptor entry with magic, queue, descriptor id, submitted length, and flags before writing the existing submit marker and committing the in-flight record. DMABuffer.completeDescriptor still writes completion bytes only inside the validated completion length, and the smoke proves the shadow entry and marker remain visible outside that completion window. Provider-effect submits shorter than the 24-byte shadow-descriptor-plus-marker footprint are rejected as a typed no-side-effect boundary, even though the shared descriptor request shape is otherwise valid. This is a bounded bounce-buffer side effect only; no hardware descriptor ring, CQ publication, MMIO doorbell, direct DMA, IOVA, host physical address, or remapping-domain claim is added.

Implementation note, 2026-05-11 12:01 UTC: the same submit path now replaces the shadow-descriptor payload with a selected provider-owned queue entry plus marker. The entry records queue magic, queue id, tail, descriptor id, submitted length, and flags after descriptor authority validation and the submit scrub; make run-ddf-provider-consumer maps the buffer after completion and proves the queue entry and marker remain visible outside the completion window. Provider-effect submits shorter than the current 72-byte provider queue-entry-plus-marker footprint are rejected before in-flight accounting or provider-visible mutation. This remains bounded bounce-buffer evidence only: no hardware descriptor ring, CQ publication, MMIO doorbell, direct DMA, IOVA, host physical address, or remapping-domain claim is added.

Implementation note, 2026-05-12 20:30 UTC: the accepted DMABuffer.submitDescriptor path now constructs the candidate in-flight descriptor in manager-local state and validates the resulting DMAPool budget/accounting before scrubbing or writing provider-visible bounce-buffer bytes. The provider-consumer smoke snapshots the selected provider queue-entry and marker bytes, drives a short provider-effect submit rejection, and proves both those bytes and live_inflight=0 are preserved. The selected virtio-net TX publication gate remains separately bounded after provider-entry write: quiesced publication still fails with no extra pin and no doorbell, not with rollback of the already-written shadow entry.

Implementation note, 2026-05-11 14:39 UTC, branch commit f04a14f4: the selected provider-owned queue entry now carries a staged claimed virtio-net notify-offset admission record instead of only a “requires claim” gate. The selected queue 1 path records accepted notify-offset admission plus blocked wrong-queue and wrong-offset admissions after descriptor authority validation and submit scrub; a separate queue 0 submit proves non-selected queues remain neutral and blocked from selected-backend doorbell metadata. This is not a real doorbell path: no virtio-net notify BAR handle is granted, no notify register is written, no real virtio-net descriptor ring is mutated, and production userspace NIC readiness is not claimed.

Current QEMU evidence: the same make run-net path now exports a bounded live-pool snapshot from the kernel-owned virtio-net device_dma ledger and feeds it into a device-manager DMAPool record proof. The live record carries buffer/page count, live bytes, current in-flight submissions, committed/resident/unswappable flags, and scrub-before-release policy. The device-manager proof rejects both DmaMappingsRemoved and teardown detach while that authoritative ledger snapshot remains live. The proof now calls the device_dma teardown-evidence API and records the expected authoritative-ledger-live block with matching imported live accounting, then reports completion as deferred because no authoritative zero-live/scrubbed evidence is available for the live virtio-net ledger. It does not zero the imported record to simulate teardown, does not claim DmaMappingsRemoved, terminal Dead, or release for the live virtio-net pages, and does not scrub or free live virtio-net DMA pages. This is still a prerequisite record-accounting proof: the current pages remain kernel-owned, only bounded info-skeleton hardware cap grants are exposed to userspace, production page release hooks for live virtio-net DMA are not wired, IOMMU remapping is not programmed, and S.11.2 hostile smokes remain open.

The same smoke emits a separate device_dma scratch proof for the positive zero-live teardown-evidence path: teardown_evidence() fails closed before quiesce and scrub markers, rejects one-marker states, and reports authoritative-ledger-zero-live only after both markers are set. The scratch proof revalidates the live virtio-net ledger but does not mutate, zero, scrub, free, or claim teardown completion for real virtio-net DMA pages.

Implementation note, 2026-05-03 03:21 UTC: the zero-live device-manager DMAPool lifecycle proof now consumes that scratch zero-live evidence before final pool detach, and binds that evidence to the attached record source. The manager-owned zero-live record is labeled device-manager / zero-live-dmapool-bounce-buffer; imported live records keep the source labels from the authoritative device_dma snapshot. After the proof-scoped buffer record is actively freed and detached, a zero-live pool detach with mismatched virtio-net / kernel scratch evidence fails closed as dmapool-zero-live-evidence-invalid, and detach without authoritative evidence fails closed as dmapool-zero-live-evidence-absent. Only scratch authoritative-ledger-zero-live evidence carrying the same record source plus both quiesce and scrub markers allows the manager-owned record to detach and the revocation path to advance to DmaMappingsRemoved, Dead, and release. This remains scratch/no-real-DMA evidence: it does not tear down live virtio-net pages, program or remove IOMMU mappings, expose userspace DMAPool mapping/descriptor authority, or claim production page cleanup beyond the bounded manager-attached bounce page.

The current scratch proof set also covers stale DMA page handles without touching real virtio-net pages: reusing the same synthetic physical page bumps the DMA page generation, the old handle fails closed as stale-dma-handle, wrong-queue and wrong-label frees preserve the active page record, and duplicate free remains rejected. Production userspace DMAPool stale-handle smokes, descriptor-abuse coverage, revoke/reset races, and real quiesce/scrub/release remain open.

Implementation note, 2026-05-02 23:23 UTC: the kernel-owned virtio-net device_dma page release path now validates the DeviceDmaAllocation against the live ledger before any scrub or frame-allocator call, scrubs the frame through the HHDM mapping, removes the ledger entry only after scrub succeeds, and then returns the frame to the allocator. The frame scrub helper checks frame alignment, HHDM/allocator initialization, range, and allocated state before zeroing the page. make run-net emits a bounded device-dma: release scrub proof line using a proof-only kernel-owned page: stale generation, wrong queue, and wrong label release attempts fail before scrub, frame free, or ledger mutation; the active release path records scrub_before_frame_free=true and ledger_removed_after_scrub=true. This is still no-userspace-handle/no-real-teardown evidence for the current kernel-owned virtio-net path only; production DMAPool handles, real device-manager teardown hooks, IOMMU/bounce-buffer mapping removal, hostile stale-DMA smokes, and full page lifecycle cleanup remain open.

Implementation note, 2026-05-02 12:28 UTC: the same scratch proof family now covers stale DMA completion ordering without touching real virtio-net pages. A synthetic reused DMA page slot bumps generation, stale completion validation checks the page generation before queue-completion accounting, and the old completion fails closed as stale-dma-handle before completion counters can underflow or the reused page/submission state can mutate. This is still prerequisite evidence only: production userspace DMAPool hostile smokes, reset/revoke races with outstanding descriptors, CQ notification publication, real quiesce/scrub/release, and IOMMU or bounce-buffer teardown remain open.

Implementation note, 2026-05-02 14:03 UTC: that scratch stale-DMA-completion proof now adapts the synthetic DMA buffer slot into capos-lib::device_authority before completion accounting can mutate. The QEMU line records current-handle validation as ok, stale same-slot reuse as stale-slot-generation, and side-effect-blocked, then preserves the existing stale-dma-handle completion outcome. This remains a scratch/no-real-DMA validator adapter, not production userspace DMAPool authority or S.11.2 hostile-smoke completion.

Implementation note, 2026-05-02 15:15 UTC: the same scratch proof family now adds a paired stale-completion publication check. A synthetic reset bumps the device owner generation so an old completion fails as stale-owner-generation with side-effect-blocked before any CQ publication. The same-slot reuse path then fails the old completion as stale-dma-handle, preserves the new submission accounting, and records both cq_publication_blocked=true and new_owner_exposure_blocked=true. This is still scratch/no-real-DMA evidence; production userspace DMAPool completion notification, real hardware stale-completion injection, reset/revoke races, and IOMMU or bounce-buffer teardown remain open.

Implementation note, 2026-05-02 12:46 UTC: the device-manager interrupt handoff proof now includes a bounded stale IRQ after-detach check. After an attached interrupt route is detached, the proof delivers the old LAPIC vector through the dispatch path and requires stale_irq_delivery_after_detach to be unregistered with stale_irq_wake_blocked=true; the old route handle also continues to fail as interrupt-stale-route. This is prerequisite evidence for route teardown ordering only. It does not cover a pending hardware IRQ across reset, userspace Interrupt waiter wakeup semantics, or reassignment reuse by a new owner.

Implementation note, 2026-05-02 14:24 UTC: the same device-manager interrupt handoff proof now adapts the attached source into capos-lib::device_authority before exercising the active wait path. After revoke begins, the old handle fails the pure validator as stale-owner-generation with side-effect-blocked, then the proof preserves the existing interrupt-stale-route, detached-vector unregistered, and stale_irq_wake_blocked=true checks. This remains a proof adapter, not production userspace Interrupt handles, real waiters, reset/reassignment reuse proof, or S.11.2 hostile-smoke completion.

Implementation note, 2026-05-02 14:51 UTC: the interrupt handoff smoke now adds two bounded stale IRQ ordering points. After revoke begins and while the route is still attached, delivery to the old vector remains masked, matches the attached route generation, reports wake blocking, and leaves the old route delivery count unchanged. During the reset phase, a synthetic route-registry same-vector reuse proof re-registers and claims the route with bumped source and route generations, then shows delivery to that vector is still masked, matches the new route generation, and leaves the reused route delivery count unchanged. This is route-manager prerequisite evidence only: it is not a true pending hardware MSI/reset hostile smoke, does not involve userspace Interrupt waiters, and does not prove DMA buffer reuse race closure.

Implementation note, 2026-05-03 18:48 UTC: the interrupt handoff smoke now snapshots a bounded pending IRQ token from the old vector, source id, source generation, and route generation before revocation. Checking that token after revoke blocks as stale-pending-irq-masked with reason route-masked; after detach it blocks as stale-pending-irq-unregistered; and after reset/reuse it blocks as stale-pending-irq-generation with reason source-route-generation-mismatch. Each check records side-effect-blocked, wake blocking, and unchanged delivery counts, and the reset/reuse check records that the new route did not receive a delivery count. The same proof rejects a malformed pending token with a zero generation as stale-pending-irq-invalid-state before any delivery-count mutation. This was bounded token-generation evidence only and did not inject a real pending MSI across reset; the real-int $vector injection added at 2026-05-05 18:17 UTC (see below) closes the S.11.2.7 stale IRQ hostile-smoke gate row by exercising the production CPU exception entry path across the same revoke, detach, and reset/reuse boundaries.

Implementation note, 2026-05-09 18:12 UTC: the pending IRQ token decision path now has a pure capos-lib::device_authority validator. Host tests cover current-route acceptance and the same fail-closed label space used by the kernel/QEMU proof: stale source generation, stale route generation, both generations changed after reuse, source mismatch, route masked, route unregistered, invalid route state, invalid owner, malformed zero-generation or unassigned source identity, and unsupported vector. The kernel still snapshots the live dispatch slot and delivery count, but delegates the pending-token identity/state decision to that shared helper before returning stale-pending-irq-* labels. This is validator/adapter coverage only; it does not expose production userspace Interrupt waiters or wait/ack/mask/unmask authority.

Implementation note, 2026-05-05 18:17 UTC: the interrupt handoff smoke now fires a real INT $vector instruction at the device MSI vector at three points across the revoke/reset/reuse boundary, exercising the production IDT entry, extern "x86-interrupt" stub, record_lapic_delivery dispatch slot read, and LAPIC EOI write rather than the helper-call path the prior proofs used. The new proof scope strings drop “no-real-msi” and read stale-vector-after-detach-real-int-vector-injected, manager-attached-claimed-masked-after-revoke-real-int-vector-injected, route-registry-vector-reuse-during-reset-real-int-vector-injected, and bounded-pending-irq-token-generation-real-int-vector-cross-reset-injection. Each injection point requires the slot’s delivery count to remain zero before and after the real INT, and the post-INT outcome to match masked (after revoke, route still attached but masked), unregistered (after detach, slot cleared), and masked (after reset/reuse, slot now belongs to a freshly registered+claimed route with bumped source and route generations). The proof emits a closure summary s11_2_7_proof_scope=s11-2-7-stale-irq-after-reset-real-int-vector-cross-reset-injection-no-userspace-waiter, s11_2_7_real_irq_injected_across_reset=ok, s11_2_7_old_waiter_cannot_wake_new_owner=true, and s11_2_7_stale_ack_blocked=true. This closes the S.11.2.7 row of the hostile-smoke acceptance matrix at make run-net (which invokes tools/qemu-net-smoke.sh). It does not yet create a userspace Interrupt waiter object; the in-flight delivery is observed via the kernel-owned dispatch slot atomic state machine that the production path consumes. S.11.2.9 hostile-smoke gate-wiring closed 2026-05-05 20:49 UTC (see the implementation note below).

Implementation note, 2026-05-05 19:37 UTC: the device-manager hostile-smoke suite now closes the S.11.2.8 stale-DMA-completion-after-reset row. prove_qemu_stale_dma_completion_handoff claims a fresh probe-then-driver record on the virtio-net PCI BDF (separate from the live virtio-net driver state) and walks it through the same revoke, detach, and reset/reuse boundaries S.11.2.7 uses. At each boundary the proof allocates a real virtio-net DMA page through the production device_dma::allocate_virtio_net_page helper, frees the page through device_dma::free_virtio_net_page, reallocates so the live ledger’s page generation advances, and synthesizes a stale DeviceDmaAllocation keyed to the live phys with a decremented generation. The synthesized stale handle is then fed to the production device_dma::record_virtio_net_completion_for_allocation path – the same function the live virtio-net Virtqueue::record_used_completion_for_allocation invokes after descriptor tracking validates a hardware used-ring entry. The production validator rejects each stale injection as stale-dma-handle before any queue accounting decrement, completion side effect, CQ publication, or new-owner memory exposure. The bounded run-net proof records real_completion_inject_after_revoke_result=stale-dma-handle, real_completion_inject_after_detach_result=stale-dma-handle, real_completion_inject_after_reset_reuse_result=stale-dma-handle, all three with side-effect-blocked, queue_account_preserved=true, live_page_preserved=true, cq_publication_blocked=true, new_owner_exposure_blocked=true, freed_buffer_unchanged=true, and generation_bumped=true, plus a closure summary s11_2_8_proof_scope=s11-2-8-stale-dma-completion-after-reset-real-free-realloc-cross-revoke-detach-reset-reuse-no-userspace-dmapool, s11_2_8_real_completion_injected_across_reset=ok, s11_2_8_old_completion_cannot_publish_to_new_owner=true, s11_2_8_freed_buffer_reuse_blocked=true, and s11_2_8_accounting_underflow_blocked=true. The new shape is enforced in tools/qemu-net-smoke.sh and runs from make run-net. This is the production paired stale-DMA-completion proof showing old completions cannot publish stale CQ notifications or expose new-owner memory after real revoke, detach, and reset/reuse boundaries with real free + realloc page generation advances on the live kernel-owned ledger; S.11.2.9 hostile-smoke gate-wiring closed 2026-05-05 20:49 UTC (see the implementation note below). Userspace DMAPool handles and real device-manager page quiesce/scrub/release hooks remain open as separate follow-ups.

Implementation note, 2026-05-05 20:49 UTC: the S.11.2.9 hostile-smoke coverage row of the acceptance matrix is closed by aggregating every matrix-row proof line into the make run-net -> tools/qemu-net-smoke.sh gate. Every proof line referenced by the matrix has at least one assertion in the harness today; the assertion shape varies by row. The two driver-crash lines wired by that gate slice, the existing S.11.2.8 device-manager: dma completion handoff proof closure-summary line, and the S.11.2.7 device-manager: interrupt handoff proof closure-summary line (whose trailing anchor was added by this slice for harness-strictness consistency with S.11.2.8) all use anchored extended-regex assertions (field-by-field match plus proof_result=ok[[:cntrl:]]?$ trailing anchor); other matrix-row rows reuse the harness’s pre-existing mix of unanchored extended-regex and fixed-string grep -Fq assertions on the emitted proof lines. The two previously unasserted lines wired by this slice are device-manager: devicemmio driver crash hook proof source=devicemmio-driver-crash-hook ... trigger_path=trigger-driver-crash-for-devicemmio and device-manager: interrupt driver crash hook proof source=interrupt-driver-crash-hook ... trigger_path=trigger-driver-crash-for-interrupt. Both proofs were already emitted by the kernel on every boot (via prove_qemu_devicemmio_driver_crash_hook and prove_qemu_interrupt_driver_crash_hook in kernel/src/device_manager/proofs.rs) and exercise the explicit driver-crash teardown trigger path with a stale rerun noop, validate-live revoked cap state, and cap_release_after_crash as noop. The chosen gate strategy keeps S.11.2.9 inside make run-net rather than splitting into a separate make run-hostile-smokes target, because all six matrix rows depend on the same virtio-net device bring-up state (probe-then-driver records on the virtio-net BDF, real IDT vector injection, real DMA page free + reallocate). A separate target would duplicate the bring-up cost without adding coverage. Tightening the remaining unanchored assertions to the same anchored shape is a follow-up harness-hardening task and is not part of S.11.2.9 closure because each affected proof line is still uniquely identified by its emitted prefix and the asserted field set. Production userspace DMAPool/DeviceMmio/Interrupt handles, real device-manager page quiesce/scrub/release hooks, hardware-backed provider-driver Interrupt wait/ack dispatch beyond the current bounded route-dispatch waiter proof, durable/signed production audit consumption beyond the first volatile HardwareAuditLog.snapshot cap, and IOMMU domain programming all remain open as separate follow-ups tracked in docs/backlog/hardware-boot-storage.md and the docs/tasks/README.md userspace-driver-transition bullet.

Implementation note, 2026-05-08 09:44 UTC: the same make run-net gate now also asserts cap-specific DMA driver-crash proofs. DmaBufferCap routes the explicit trigger through the bounded FreeBuffer cleanup path and proves page scrub/ledger/frame-free labels before stale rerun and post-trigger cap release both return noop; DmaPoolCap routes the explicit trigger through the zero-live evidence-gated detach path and proves authoritative zero-live, quiesced, and scrubbed evidence labels before stale rerun and post-trigger cap release return noop.

Implementation note, 2026-05-03 01:05 UTC: the schema and kernel now include a result-only Interrupt.info skeleton that can wrap a manager-issued device handle plus the attached DeviceInterruptRoute. The object validates the live manager record, owner, claimed route, and attached route record through validate_interrupt_record() before returning status labels such as userspaceInterrupt=manager-issued-skeleton, managerRecord=validated-active, routeRecord=manager-attached-route, realInterruptDelivery=not-delivered, wait=admission-check-only, acknowledge=admission-check-only, mask=route-state-control, unmask=route-state-control, and bootstrapGrant=blocked. The interrupt handoff QEMU proof constructs that cap object while the route record is active, records interrupt_cap_info_result=ok, exercises the serialized CapObject::call(0, &[]) path and decodes the returned Interrupt.info Cap’n Proto result as interrupt_cap_serialized_call_result=ok, then verifies the same cap fails closed after revoke begins as interrupt_cap_stale_after_revoke_result=interrupt-stale-handle; the same stale object also fails the serialized method-0 path as interrupt_cap_serialized_stale_after_revoke_result=invoke-failed. A later manifest-grant smoke explicitly releases the granted Interrupt cap through CAP_OP_RELEASE and proves a subsequent typed Interrupt.info call fails closed from userspace. A focused grant-cycle smoke now repeats that grant, release, and stale-info proof twice in sequence and asserts the second manager-grant-source acquire preserves the source generation and receives a fresh route generation after the first release; the same smoke also decodes both acquire/release cycles through the typed volatile HardwareAuditLog.snapshot surface. The focused hardware-audit interrupt-waiter smoke also decodes recent boot-time DmaBuffer, DmaPool, and Interrupt driver-crash / reset-disable lifecycle records from the current volatile 16-record snapshot window. The same smoke now uses the startSequence cursor to decode older retained DeviceMmio lifecycle rows that the default latest 16-record tail has skipped. A 2026-05-09 19:18 UTC follow-up adds a bounded Interrupt.wait admission method to that skeleton. The method validates the same manager-attached route, snapshots a pending-token candidate, delegates to the shared capos-lib::device_authority pending-IRQ validator, and returns typed labels through capos-rt; the focused grant smoke asserts the current masked-route result stale-pending-irq-masked, reason route-masked, side-effect-blocked, matching token/current source and route generations, unchanged delivery counts, no waiter wake, and fail-closed behavior after cap release. This is bounded manager-issued skeleton evidence only: there is no blocking wait, real hardware acknowledgement, real hardware mask/unmask side effect, interrupt delivery authority, real waiter object, production Interrupt completion, durable/signed audit persistence, or concurrent sharing claim. A 2026-05-09 23:21 UTC follow-up adds bounded Interrupt.acknowledge admission to the same skeleton. It validates the manager-attached route through the existing Acknowledge authority path, returns admission-check-only, interrupt-ack-not-attempted, and side-effect-blocked, and proves delivery counts remain unchanged with no waiter wake or hardware acknowledgement. A 2026-05-09 23:52 UTC follow-up adds bounded Interrupt.mask and Interrupt.unmask admission to the same skeleton. They validate the manager-attached route through the existing Mask and Unmask authority paths, return admission-check-only, interrupt-mask-not-attempted / interrupt-unmask-not-attempted, and side-effect-blocked, and prove route state and delivery counts remain unchanged with no hardware mask/unmask, waiter wake, or IRQ delivery. A 2026-05-10 04:01 UTC follow-up promotes those methods to bounded route-state control over the manager-attached dispatch slot. Interrupt.unmask now changes claimed-masked to driver-unmasked, Interrupt.mask changes it back to claimed-masked, and both preserve delivery counts while still avoiding hardware MSI/MSI-X table programming, waiter wakeup, hardware acknowledgement, or real IRQ delivery. A 2026-05-10 22:54 UTC follow-up wires real waiter completion to the existing route-dispatch delivery counter from scheduler/poll context. The poller observes matching delivered routes by vector, source generation, and route generation without taking the waiter-table lock in the IRQ dispatch path, then revalidates the manager-attached route under the route-post exclusion before posting the deferred cap completion. The focused grant smoke proves the first unmasked manifest-granted wait completes as interrupt-delivered / waiter-completed-irq with real_interrupt_delivery=delivered and an advanced delivery count, while a second unmasked wait still remains pending until Interrupt.mask completes it through the existing interrupt-waiter-cancelled / waiter-completed-no-irq path. Stale, masked, released, reset, or reused routes do not wake as delivered IRQs. This remains a bounded route-dispatch proof; it does not program hardware MSI/MSI-X tables, acknowledge hardware, add provider-driver interrupt consumption, or claim hostile hardware isolation.

Implementation note, 2026-05-03 13:49 UTC: the result-only DMAPool.info, DMABuffer.info, DeviceMmio.info, and Interrupt.info surfaces now return numeric identity fields alongside the conservative status labels. The fields mirror the documented handle identity shape: deviceId, BDF bus/device/function, owner generation, and the relevant pool id/generation, buffer slot/generation, BAR/mapping id/generation, or interrupt source/generation/route generation. The QEMU proof logs and net smoke assert the active serialized method-0 decode for those fields, while stale method-0 calls still fail closed as invoke-failed. This remains a result-only manager-issued skeleton surface; it does not add production DMA allocation, free, map, submit, or completion authority, production MMIO mapping or doorbell authority, production interrupt wait/ack/mask/unmask authority, real DMA page cleanup/reuse, hostile hardware isolation, or S.11.2 completion.

Implementation note, 2026-05-03 16:37 UTC: the bounded interrupt route identity skeleton now carries separate source and route generations end to end. DeviceInterruptRoute, LegacyIoApicInterruptRoute, route records, diagnostics summaries, dispatch-slot metadata, and the device-manager attached interrupt bridge store both generations. Registration allocates both fields, PCI MSI-X route reassignment preserves source generation while bumping only the route generation, release/re-register allocates both generations fresh, and Interrupt.info returns the independent values. The QEMU net smoke asserts the split in PCI and legacy route logs, metadata proof logs, handoff proof logs, serialized Interrupt.info, and cap-release proof logs. This closes only the bounded identity proof gap; it does not expose production userspace Interrupt authority, create real waiters, or complete the S.11.2 hostile IRQ smoke matrix.

DeviceMmio Invariants

DeviceMmio is register authority, not memory authority.

  • Authority: A holder may map only BARs or subranges recorded in the claimed device object. It may not map PCI config space globally, another function’s BAR, RAM, ROM, or synthetic kernel pages.
  • Handle identity: Each call checks the claimed device id, owner generation, BAR or subrange mapping record, and mapping generation before mapping, unmapping, reading, or writing registers.
  • Physical range: Each mapping is bounded to the BAR’s decoded physical range, page-rounded by the kernel, and tagged as device memory with cache attributes appropriate for MMIO. Partial BAR grants must preserve page-level isolation; otherwise the grant must cover the whole page-aligned register window and be treated as that much authority.
  • Ownership: At most one mutable driver owner controls a device function’s MMIO at a time. Management capabilities may inspect topology, but register writes require the claimed DeviceMmio object.
  • No DMA implication: Mapping registers does not grant any DMA buffer, frame allocation, interrupt, or config-space authority. Doorbell writes are accepted only as effects of register access; descriptor validity is enforced by DMAPool before queues are made visible to the device.
  • Revocation: Revocation unmaps the driver’s register pages, marks the device object unavailable for new calls, and invalidates outstanding MMIO handles. Stale mappings or calls fail closed.
  • Reset: Revoking the final mutable DeviceMmio owner resets or disables the device unless a higher-level device manager explicitly transfers ownership without exposing it to an untrusted holder.

Interrupt Invariants

Interrupt is event authority for one routed source.

  • Authority: A holder may wait for, mask/unmask where supported, and acknowledge only its assigned vector, line, or MSI/MSI-X table entry. It may not reprogram arbitrary interrupt controllers or claim another source.
  • Handle identity: Each wait, mask, unmask, and acknowledge checks the claimed device id, owner generation, source id, source generation, route generation, and any live waiter generation before affecting delivery state.
  • Ownership: Each interrupt source has one delivery owner at a time. Shared legacy lines must be represented as a kernel-demultiplexed object with explicit device membership, not as ambient access to the whole line.
  • Range: The capability records the hardware source, vector, trigger mode, polarity, and target CPU/routing state. User-visible operations are checked against that record.
  • Revocation: Revocation masks or detaches the source, drains pending notifications for the old holder, invalidates waiters, and prevents stale acknowledgements from affecting a new owner.
  • Reset: If the source cannot be detached cleanly, the owning device is reset or disabled before the interrupt is reassigned.
  • No MMIO or DMA implication: Interrupt delivery does not grant register access, DMA buffers, or packet memory.

Revocation Ordering

Device revocation must follow a fixed order:

  1. Stop new submissions by invalidating the driver’s user-visible handles.
  2. Revoke MMIO write authority by write-blocking or unmapping BAR pages, or by disabling the device before any DMA teardown starts.
  3. Mask or detach interrupts.
  4. Quiesce virtqueues or device command queues.
  5. Reset or disable the device if in-flight DMA cannot be accounted for.
  6. Remove IOMMU mappings or invalidate bounce-buffer handles.
  7. Scrub and free DMA pages.

This order prevents a stale driver from racing revocation with doorbell writes, interrupt acknowledgement, or descriptor reuse. Logical handle invalidation is not sufficient while a BAR remains mapped; register-write authority must be removed or the device must be disabled before descriptor or DMA-buffer ownership is reclaimed.

Implementation should represent the order as an explicit device-owner state machine rather than as ad hoc booleans:

#![allow(unused)]
fn main() {
enum DeviceOwnerState {
    Active,
    RevokingHandles,
    MmioRevoked,
    InterruptsDetached,
    QueuesQuiesced,
    Resetting,
    DmaMappingsRemoved,
    Dead,
}
}

No path may free or reassign DMA pages until the state has reached QueuesQuiesced with all in-flight descriptors accounted for, or Resetting has completed and the device can no longer write old buffers. Dead means all user-visible handles are invalid, interrupts are detached or masked, DMA mappings are removed, and pages have been scrubbed or transferred to a trusted owner.

Hard invariants:

  • DMA pages cannot be freed before QueuesQuiesced or a completed Resetting transition proves old DMA writes are stopped.
  • MMIO write authority must be revoked before DMA ownership teardown.
  • Interrupt reassignment cannot happen before old pending notifications are drained or generation-invalidated.
  • Device reset is mandatory if in-flight DMA cannot be proven stopped.

Future Userspace-Driver Transition Criteria

Moving NIC or block drivers out of the kernel is gated by Security Verification Track S.11.2. The gate is only open when all rows below are implemented and demonstrated. The S.11.2.N labels are local checklist row IDs for this gate.

The completed Device Driver Foundation selected milestone used this track as the prerequisite for the DMAPool, accounting, and hostile-smoke sub-gate. Future DDF follow-ups still use these rows as the userspace-driver transition gate: generic MSI/MSI-X dispatch and second-device reuse may land first, but userspace DeviceMmio and Interrupt exposure stays blocked until these rows pass.

Production DMAPool Ledger Prerequisite

Before userspace NIC or block drivers receive DeviceMmio, Interrupt, or DMAPool handles, the device manager must own one ledger of record for each claimed device. That ledger is the authoritative source for every device-visible hold, not a diagnostic mirror of separate subsystems.

The ledger records:

  • DMA pool bytes reserved and live;
  • DMA buffer count, slot generation, and owner generation;
  • mapped userspace DMA VMAs, quiesce state, scrub state, and release eligibility for each attached DMA pool;
  • descriptor and ring depth limits, including live in-flight submissions and completions;
  • page-rounded MMIO mappings and their owning DeviceMmio generations;
  • interrupt holds, waiter generations, and routed-source generations;
  • budget and OOM policy for allocation, queue growth, mapping, and interrupt attachment;
  • teardown state in the device-owner state machine.

Every operation that creates, consumes, or releases device-visible authority must update this ledger as part of the same ownership transaction that changes device-manager state. That includes DMA buffer allocation/free, descriptor submission, completion accounting, BAR mapping/unmapping, interrupt attach/detach, reset, revoke, process exit, and capability release.

Implementation note, 2026-05-03 13:18 UTC: the QEMU virtio-rng metadata path now runs a bounded teardown-trigger proof for cap-release, process-exit, driver-crash, reset-disable, interrupt-waiter, future-devicemmio, and future-dmapool. Each trigger row sequentially claims and transfers the same PCI function, begins revocation, walks the existing device-owner state machine to Dead, releases only after Dead, and proves generation bumps, stale handle rejection, direct state-skip rejection, pre-Dead release rejection, and per-trigger coverage without duplicates. The cap-release row attaches a bounded manager-owned DeviceMmio record to the active driver handle, removes a DeviceMmioCap from a cap table, runs the CapOpRelease hook, and records cap-table removal plus detached/stale manager validation before normal revocation. The process-exit row attaches the same bounded DeviceMmio record shape to a real proof Process, runs Process::release_caps_for_exit(), and records cap-table removal plus detached/stale manager validation before normal revocation. The driver-crash, reset-disable, and interrupt-waiter rows register and claim bounded PCI MSI-X lifecycle-probe routes, attach them to the device manager, prove InterruptsDetached is blocked as interrupts-attached, detach and release the routes while still in MmioRevoked, and then advance normally. The future-devicemmio row attaches a bounded manager-owned DeviceMmio record from the first decoded PCI memory BAR, proves MmioRevoked is blocked as devicemmio-attached, detaches while still in RevokingHandles, and then advances normally. The future-dmapool row attaches a bounded zero-live DMAPool record, proves DmaMappingsRemoved is blocked as dmapool-attached, detaches while still in Resetting, and then advances normally. The generic teardown-trigger summary reports no label-only rows and seven object-backed rows, while the route-aware interrupt handoff smoke also labels the claimed MSI-X route as bounded interrupt-waiter blocker evidence: interrupt_waiter_object=interrupt-route-record, interrupt_waiter_block_state=InterruptsDetached, interrupt_waiter_block_result=interrupts-attached, interrupt_waiter_detach_result=ok, and interrupt_waiter_route_generation_preserved=true. This bounded route-record evidence is contract proof for the shared ownership transaction only: it does not expose production userspace authority handles, real MMIO, real DMA, a userspace waiter, or production crash/reset observers. Separate first DeviceMmioCap, InterruptCap, DmaPoolCap, and DmaBufferCap release-hook proofs now exercise both the production ring CAP_OP_RELEASE dispatch path and a real Process::release_caps_for_exit() path for those cap objects, validating cap-table removal plus exact manager-owned DeviceMmio detach, manager-attached interrupt-route release, bounded zero-live DMAPool detach, or proof-owned DMA-buffer record cleanup. The generic route-record trigger rows and remaining DMA production work do not yet implement production observers, production interrupt-waiter objects, userspace DeviceMmio, production userspace DMAPool/DMABuffer authority, full device authority, or true pending hardware MSI/reset-hostile route teardown.

Implementation note, 2026-05-08 10:08 UTC: the first cap-specific reset/disable trigger entry points now exist for DeviceMmioCap and InterruptCap. trigger_reset_disable_for_devicemmio and trigger_reset_disable_for_interrupt route through the same idempotent stale-safe detach helpers as cap release and driver-crash cleanup, emit one cap-audit: ... event=reset-disable detach=ok line on the first successful trigger, and keep stale reruns silent. This is still bounded trigger plumbing: the reset/disable observer, non-proof DMA cleanup integration, userspace MMIO/interrupt operations, and IOMMU-backed remapping work remain future requirements.

Implementation note, 2026-05-08 10:39 UTC: the DMA caps now have the matching cap-specific reset/disable trigger plumbing. DmaPoolCap::on_reset_disable uses the same authoritative zero-live/quiesced/scrubbed evidence-gated detach as cap release and driver-crash cleanup. DmaBufferCap::on_reset_disable reuses the bounded FreeBuffer authority validation and page-scrub/frame-free cleanup path, then leaves the parent pool attached until staged zero-live cleanup. make run-net asserts the dmabuffer-reset-disable-hook and dmapool-reset-disable-hook proof lines, stale rerun noop, revoked cap validation, post-trigger release noop, and exact-one cap-audit: cap={dmabuffer,dmapool} event=reset-disable lines. This is still proof-owned no-real-DMA cleanup; production userspace DMAPool/DMABuffer authority and non-proof page lifecycle integration remain future work.

Budget or OOM failure is closed before the driver can observe a new handle, program a descriptor, map MMIO, attach an interrupt, or ring a doorbell. A failed submission must leave no live descriptor hold behind, or must leave an explicit in-flight record that teardown can drain or reset. A completed teardown must reconcile the ledger to zero live DMA buffers, zero live MMIO mappings, zero interrupt holds, and no in-flight descriptor submissions for the released device generation.

Implementation note, 2026-05-02 06:59 UTC, updated 2026-05-11 06:10 UTC: the current kernel-owned virtio-net ledger now proves the closed budget/OOM cases above with a scratch ledger and the live ledger validation described earlier. Imported live device-manager DMAPool records still preserve the device_dma:virtio-net source policy and prove imported live accounting stays within its aggregate in-flight budget while preserving that policy’s per-queue queue/submission depth limits. The manifest-granted manager-owned bounce-buffer DMAPool path now attaches its own device-manager budget policy to userspace DMAPool.allocateBuffer handle creation and the current fixed-slot DMAPool/DMABuffer transfer, release, pending-release, drop, rollback, teardown-detach, page-release, and descriptor-completion cleanup paths. The full eight-slot pool fails as dmapool-budget-exceeded / over-buffer-budget before allocation, cap minting, or ledger mutation, and the selected release paths revalidate current or next accounting before advancing manager-owned state. Production userspace DMAPool records must still attach budget checks to broader provider-driver transfer/revoke/reset transactions, IOMMU or direct-DMA mapping state, and non-fixed-slot allocation before this row can be treated as the complete userspace-driver transition gate.

Implementation note, 2026-05-02 08:33 UTC: the QEMU virtio-rng metadata path now runs a bounded DMAPool record lifecycle proof on the device-manager teardown state. The first slice keeps the record zero-live: it records a pool slot, pool generation, and owner generation, rejects stale and owner-mismatched attach attempts, rejects duplicate attachment, and proves that begin_revocation invalidates the user-visible pool handle by bumping the device owner generation. The ordered teardown path now fails closed with dmapool-attached if it tries to enter DmaMappingsRemoved while the pool record remains attached. The current continuation also proves that the revoke handle cannot detach the zero-live pool without scratch authoritative zero-live, quiesced, and scrubbed evidence bound to that record’s source; a mismatched scratch source is rejected before detach. With matching proof-scoped evidence, the record detaches after queues are quiesced/reset and before DmaMappingsRemoved. Later bounded manifest grants expose conservative DMAPool, DeviceMmio, and Interrupt surfaces; the current DMAPool grant can mint only eight fixed manager-attached proof DMABuffer result caps. The remaining gap is production userspace authority, allocation beyond those eight fixed slots, real device-visible page allocation through the device manager, non-proof DMA page lifecycle integration, IOMMU remapping, and the S.11.2 hostile smoke matrix.

Current QEMU evidence: the QEMU virtio-net path now adds the corresponding imported live-accounting prerequisite proof. A device-manager DMAPool record is attached with accounting derived from the live device_dma ledger: live buffer/page count, live bytes, current in-flight submissions, committed/resident/unswappable residency flags, and scrub-before-release policy. DmaMappingsRemoved fails closed with dmapool-attached while the record remains attached, direct teardown detach fails closed with dmapool-live while the authoritative ledger remains live, and the live proof consumes the device_dma teardown-evidence API, observes authoritative-ledger-live with matching imported live accounting, and explicitly defers completion with no real DMA teardown attempted. The same proof path now validates the imported DMAPool record through capos-lib::device_authority for the active handle and stale-after-revoke failure labels. This does not create production userspace handles, real page-release hooks, IOMMU mapping invalidation, scrubbed release, terminal Dead, or hostile-smoke coverage for the live virtio-net record. The companion scratch-ledger proof covers the positive zero-live teardown-evidence result without claiming that the live virtio-net record has been torn down. The manager-owned zero-live lifecycle proof consumes matching-source device-manager teardown evidence for the positive detach/DmaMappingsRemoved path and separately proves mismatched-source and missing-evidence detach attempts fail closed. The manifest-granted bounded DMAPool path now keeps mapped userspace VMA count, in-flight descriptor holds, residency, quiesce/scrub state, and release eligibility in that manager record. Borrowed or device-visible pages remain committed, resident, unswappable, generation-bound, and unavailable for reuse until the manager record is zero-live, unmapped, quiesced, and scrubbed. Descriptor submission is refused while a buffer is borrowed to userspace, and release consumes manager-owned teardown evidence instead of proof-only device_dma zero-live evidence. The corresponding DMAPool.info ABI reports mapped VMA count, quiesce state, scrub state, and release eligibility for QEMU proof assertions. This is still bounded bounce-buffer lifecycle authority only: direct DMA, host physical or IOVA exposure, IOMMU/remapping, production provider-driver consumption, durable audit, and broader transfer/revoke policy remain future work.

Gate itemRequired stateMust-have proof
S.11.2.0 DMA-owned buffersDMAPool owns every driver-visible DMA mapping.A driver receives opaque buffer handles or IOVA-only values; no path hands out raw host physical addresses.
S.11.2.1 Bound checksAllocation, descriptor chain length, alignment, segment length, and ring depth are bounded and constant-time validated before ring submission.Ring submissions fail closed on overflow, wrap, stale-handle, and freed-handle reuse attempts.
S.11.2.2 Explicit remap/ownershipDeviceMmio can only grant claimed BAR pages; cache attributes and write policy are enforced.Driver cannot access unclaimed BARs, ROM, RAM pages, config-space globals, or stale mappings after revoke.
S.11.2.3 Interrupt correctnessInterrupt owns exactly one logical source at a time and drains/waits only for that source.Reassigning an owner invalidates old waiters and masks or detaches the source first.
S.11.2.4 Quiesce + reset contractDevice manager can force reset/disable on failed revoke or teardown.No in-flight descriptor may continue touching freed buffers after driver removal.
S.11.2.5 Process lifecycleCapability release, process exit, and process-spawn cleanup paths cannot leak DMA pages/MMIO/intr ownership.Crash-path teardown removes holds and invalidates user-visible handles before page free.
S.11.2.6 Isolation and accountingSecurity Verification Track S.9 quota and authority ledgers include DMA, MMIO, and interrupt hold edges.A malicious or buggy driver cannot consume more than its allocated authority budget.
S.11.2.7 Stale IRQ orderingStale interrupt delivery after revoke cannot wake, acknowledge, or signal a new owner.Interrupt generation mismatch is ignored, or the source is masked/detached/reset before reassignment. Hostile smoke revokes a driver while an interrupt is pending, reassigns the source, and proves the old waiter cannot wake against the new owner. Closed 2026-05-05 18:17 UTC by make run-net’s device-manager: interrupt handoff proof line: real INT $vector injection across revoke, detach, and reset/reuse exercises the production IDT entry/handler/EOI path, asserts s11_2_7_real_irq_injected_across_reset=ok, s11_2_7_old_waiter_cannot_wake_new_owner=true, and s11_2_7_stale_ack_blocked=true, and is enforced by tools/qemu-net-smoke.sh. Userspace Interrupt waiter objects remain a future requirement for a full production-driver path.
S.11.2.8 Stale DMA completion orderingStale DMA completion after revoke cannot cause freed buffer reuse, stale CQ notification, or new-owner memory exposure.Closed 2026-05-05 19:37 UTC by make run-net’s device-manager: dma completion handoff proof line: real virtio-net DMA page free + reallocate cycle bumps the live page generation, then the production device_dma::record_virtio_net_completion_for_allocation path (the same function the live Virtqueue::record_used_completion_for_allocation invokes) is fed a stale DeviceDmaAllocation keyed to the live phys with a decremented generation, at three boundaries (after revoke, after detach, after reset/reuse). All three reject as stale-dma-handle with side-effect-blocked, queue accounting unchanged, live new-owner page preserved, no CQ publication, no new-owner exposure, and the freed-buffer slot remaining unchanged. The closure summary asserts s11_2_8_real_completion_injected_across_reset=ok, s11_2_8_old_completion_cannot_publish_to_new_owner=true, s11_2_8_freed_buffer_reuse_blocked=true, and s11_2_8_accounting_underflow_blocked=true, and is enforced by tools/qemu-net-smoke.sh. Prior acceptance text: in-flight DMA is accounted for, or device reset/disable completes before buffer reuse; hostile smoke covers revoke/reset with outstanding descriptors and proves no old completion can publish new-owner memory. S.11.2.9 hostile-smoke gate-wiring also closed 2026-05-05 20:49 UTC (see the row below). Userspace DMAPool handles and real device-manager page quiesce/scrub/release hooks remain open as separate follow-ups.
S.11.2.9 Hostile-smoke coverageQEMU/CI smokes cover stale handles, descriptor abuse, revoke races, stale IRQ after reset, stale DMA completion after reset, and exit-under-dma.Smoke output has explicit closed-case proof lines for each above failure mode. Closed 2026-05-05 20:49 UTC by aggregating the existing per-row proof lines into the make run-net -> tools/qemu-net-smoke.sh gate. Every matrix-row proof line has at least one assertion in the harness; the original two driver-crash assertions, the existing S.11.2.8 device-manager: dma completion handoff proof closure-summary assertion, and the S.11.2.7 device-manager: interrupt handoff proof closure-summary assertion (whose trailing anchor was added by this slice for harness-strictness consistency with S.11.2.8) all use the anchored extended-regex shape (field-by-field match plus proof_result=ok[[:cntrl:]]?$ trailing anchor), and the other matrix-row rows reuse the harness’s pre-existing mix of unanchored extended-regex and fixed-string grep -Fq assertions. A 2026-05-08 09:44 UTC follow-up adds anchored assertions for the cap-specific dmabuffer-driver-crash-hook and dmapool-driver-crash-hook proof lines; a 2026-05-08 10:08 UTC follow-up adds anchored assertions and exact-one audit counts for the first cap-specific devicemmio-reset-disable-hook and interrupt-reset-disable-hook proof lines; a 2026-05-08 10:39 UTC follow-up does the same for dmabuffer-reset-disable-hook and dmapool-reset-disable-hook; a 2026-05-08 13:42 UTC follow-up (aeef8b41) adds the cap-specific device-manager: interrupt waiter hook proof source=interrupt-waiter-hook ... trigger_path=trigger-interrupt-waiter-for-interrupt assertion plus an exact-one cap-audit: cap=interrupt event=interrupt-waiter count. Per-row coverage: stale DMA handle (device-dma: stale dma handle proof, device-dma: live stale dma completion accounting proof); descriptor abuse (virtio-net: software descriptor generation model proof, virtio-net: invalid used descriptor id software-token proof, virtio-net: descriptor generation guard proof ok, virtio-net: invalid used descriptor id live software-token proof ok, plus device-dma: budget oom proof); revoke/reset race (device-manager: ownership proof, the seven device-manager: teardown trigger proof trigger=... variants plus the final aggregate, device-manager: dma completion handoff proof for S.11.2.8, device-manager: interrupt handoff proof for S.11.2.7, the device-manager: devicemmio driver crash hook proof source=devicemmio-driver-crash-hook ... trigger_path=trigger-driver-crash-for-devicemmio, device-manager: interrupt driver crash hook proof source=interrupt-driver-crash-hook ... trigger_path=trigger-driver-crash-for-interrupt, device-manager: dmabuffer driver crash hook proof source=dmabuffer-driver-crash-hook ... trigger_path=trigger-driver-crash-for-dmabuffer, device-manager: dmapool driver crash hook proof source=dmapool-driver-crash-hook ... trigger_path=trigger-driver-crash-for-dmapool, device-manager: devicemmio reset disable hook proof source=devicemmio-reset-disable-hook ... trigger_path=trigger-reset-disable-for-devicemmio, device-manager: interrupt reset disable hook proof source=interrupt-reset-disable-hook ... trigger_path=trigger-reset-disable-for-interrupt, device-manager: dmabuffer reset disable hook proof source=dmabuffer-reset-disable-hook ... trigger_path=trigger-reset-disable-for-dmabuffer, device-manager: dmapool reset disable hook proof source=dmapool-reset-disable-hook ... trigger_path=trigger-reset-disable-for-dmapool, and device-manager: interrupt waiter hook proof source=interrupt-waiter-hook ... trigger_path=trigger-interrupt-waiter-for-interrupt lines, all requiring first-trigger ok, stale rerun noop, cap validate_live=revoked, post-trigger release noop, and proof_result=ok with cap-specific cleanup/evidence labels); stale IRQ after reset (S.11.2.7 closure summary, see row above); stale DMA completion after reset (S.11.2.8 closure summary, see row above); exit-under-DMA (device-manager: teardown trigger proof trigger=process-exit owner=virtio-rng, the teardown-trigger aggregate triggers=cap-release,process-exit,driver-crash,reset-disable,interrupt-waiter,future-devicemmio,future-dmapool line, the four cap-release-hook proofs each containing process_exit_path=process-release-caps-for-exit, plus hardware-cap-release: ... reason=process-exit count assertions). A 2026-05-23 21:34 UTC follow-up adds the IOMMU production DMAPool hostile proof over the active mapped ledger, covering stale IOVA after revoke/reset, descriptor abuse, revoke/reset race ordering, stale completion after reset, teardown-under-DMA ordering, cross-domain stale-handle attempts, and the fail-closed teardown branch proof; process-exit/exit-under-DMA remains the existing run-net bounce-buffer evidence. Production userspace DeviceMmio/Interrupt handles, broader non-proof device-manager page quiesce/scrub/release hooks outside the selected IOMMU smoke, hardware-backed provider-driver Interrupt wait/ack dispatch beyond the bounded route-dispatch waiter proof, and durable/signed production audit consumption beyond the first volatile HardwareAuditLog.snapshot cap remain open as separate follow-ups.

For each row, the transition requires an owner, implementation notes, and a CI-backed verification path. Until all rows pass, Phase 4.2 NIC/block drivers remain in-kernel for functionality, and only kernel-mapped bounce-buffer mode is allowed for prototype DMA.

Hostile-Smoke Acceptance Matrix

These smokes are the acceptance requirements for the userspace driver transition. The S.11.2.7, S.11.2.8, and S.11.2.9 rows are now backed by current make run-net QEMU evidence enforced by tools/qemu-net-smoke.sh (see the per-row “Closed” notes for closure timestamps and the proof-line shapes). The other matrix rows remain acceptance requirements for future implementation work; their proof lines are emitted by the kernel today and asserted by the same harness, but the production userspace handles, real device-manager page quiesce/scrub/release hooks, real userspace Interrupt waiter objects, IOMMU domain programming, and durable/signed production audit consumption beyond the volatile HardwareAuditLog.snapshot cap that complete each row’s full closure remain open as separate follow-ups.

Hostile caseRequired setupClosed-case proof expectation
Stale DMA handleAllocate a DMA buffer, revoke or free it, advance the slot or pool generation, then attempt descriptor submission or buffer reuse through the old handle.The operation fails closed on generation mismatch; no descriptor is made visible to the device, no DMA byte or buffer hold is restored, and any reused slot remains owned only by the new generation.
Descriptor abuseSubmit chains with out-of-pool addresses, stale or freed buffer slots, arithmetic wrap, misalignment, overlong segments, excessive chain length, or ring-depth overflow.Validation rejects the chain before any doorbell write; the ledger shows no leaked descriptor hold, no in-flight increment without an owning buffer, and no access outside the pool range.
Revoke/reset raceRace revoke, reset, or process teardown against a driver that is submitting descriptors or ringing the device doorbell.Revocation first invalidates handles and MMIO write authority; later submissions fail closed, existing in-flight records are either completed under the old generation or reset/disabled before page reuse, and teardown cannot skip to DmaMappingsRemoved while the ledger has live submissions.
Stale IRQ after resetLeave an interrupt pending or a waiter blocked, reset or reassign the device/source, then deliver or acknowledge using the old generation.The old waiter cannot wake against the new owner, stale acknowledgements do not affect the reassigned source, and the source is masked, detached, or generation-invalidated before reassignment. Closed 2026-05-05 18:17 UTC: make run-net injects a real INT $vector through the IDT/handler/EOI path at three points across revoke, detach, and reset/reuse and records s11_2_7_real_irq_injected_across_reset=ok, s11_2_7_old_waiter_cannot_wake_new_owner=true, s11_2_7_stale_ack_blocked=true, plus matching real_irq_inject_after_revoke_result=masked, real_irq_inject_after_detach_result=unregistered, real_irq_inject_after_reset_reuse_result=masked on the kernel proof line.
Stale DMA completion after resetReset with outstanding descriptors, reuse or prepare to reuse pool slots, then inject or observe a completion from the old device generation.The stale completion cannot publish a CQE to a new owner, cannot expose new-owner memory, cannot underflow accounting, and cannot make a freed buffer eligible for reuse unless reset/disable has proven old DMA stopped. Closed 2026-05-05 19:37 UTC: make run-net walks a fresh device-manager record on the virtio-net BDF through the Active>RevokingHandles>MmioRevoked>InterruptsDetached>QueuesQuiesced>Resetting>DmaMappingsRemoved>Dead revocation path, exercises a real virtio-net DMA page free + reallocate cycle at three boundaries (after revoke, after detach, after reset/reuse), and feeds a synthesized stale DeviceDmaAllocation (live phys, decremented generation) to the production device_dma::record_virtio_net_completion_for_allocation path. Each boundary records real_completion_inject_after_*_result=stale-dma-handle, _side_effect=side-effect-blocked, _queue_account_preserved=true, _live_page_preserved=true, _cq_publication_blocked=true, _new_owner_exposure_blocked=true, _freed_buffer_unchanged=true, and _generation_bumped=true, plus a closure summary s11_2_8_real_completion_injected_across_reset=ok, s11_2_8_old_completion_cannot_publish_to_new_owner=true, s11_2_8_freed_buffer_reuse_blocked=true, s11_2_8_accounting_underflow_blocked=true.
Exit-under-DMATerminate or crash a driver process while it holds DMA buffers, MMIO mappings, interrupt waiters, and in-flight descriptors.Process exit enters the device-manager teardown path, invalidates all user-visible handles, revokes MMIO, detaches interrupts, quiesces or resets queues, scrubs DMA pages before release, and reports a terminal ledger with no live holds for the old owner generation.

Security Verification Track S.11.2 Decision Record

Security Verification Track S.11.2 is backend-scoped. The current brokered-bounce userspace-provider path has enough reviewed evidence to close the retained DDF production-authority finding, but that closeout is not a general direct-DMA, hostile-hardware, or device-autonomous interrupt claim.

Current status: the brokered-bounce transition path is represented by done task evidence for DMAPool, DeviceMmio, and Interrupt lifecycle ownership, provider virtio-net/NVMe chains, and hardware-audit consumption of abort-held DMA mappings. The broader S.11.2 matrix remains the canonical gate for future direct-remapping/vIOMMU, trusted-sharing-group, hostile-hardware-isolation, or provider-written-address work. This document fixes the production handle epoch invariants, DMAPool ledger of record, and hostile-smoke acceptance criteria used by the completed Device Driver Foundation documentation gate. The current QEMU virtio-net path has a kernel-owned DMA pool ledger for page, descriptor, MMIO mapping, and interrupt-hold accounting proof coverage plus static IOMMU attachment-policy reporting for retained DMA-capable PCI functions and the bounded teardown trigger contract proof, bounded kernel-owned budget/OOM proof, manager-bound DMAPool budget-profile proof plus bounded budget-policy tamper and accounting-over-budget fail-closed proofs, bounded manager-owned DeviceMmio proof adapter bound to decoded PCI memory-BAR metadata plus future cache/write-policy metadata, bounded zero-live device-manager DMAPool record lifecycle proof, and imported live-accounting block/defer proof plus zero-live teardown-evidence scratch proof, stale DMA handle scratch proof, stale DMA completion scratch proof, paired scratch CQ-publication/new-owner-exposure proof, live software descriptor-generation guard proof, bounded invalid used-descriptor-id proof, and bounded stale IRQ after-detach, counter-backed after-revoke, counter-backed route-registry reset-reuse, and pending IRQ token checks described above. The bounded pure capos-lib::device_authority validator and host tests cover the documented identity, state, side-effect-blocking, non-wrapping epoch cases, and every current operation variant’s exact blocked side-effect label for stale owner/subrecord, freed, revoked, and retired failures. The zero-live device-manager DMAPool lifecycle proof now validates a proof-scoped tampered budget-policy record through the manager policy helper and records fail-closed, no fake allocation, no ledger mutation, no teardown advancement, and side-effect blocking while preserving the positive budget_policy_result=ok path. The positive zero-live and imported-live budget-accounting labels now go through the manager-owned active-record helper, and synthetic over-budget attached-accounting candidates fail closed with exact reasons while preserving the active manager record and blocking allocation, ledger, teardown, and side effects; an over-budget attach candidate fails before pool generation allocation. It also records a bounded manager-attached DMA buffer handle under the attached pool, validates active SubmitDescriptor and manager-record CompleteDescriptor through the pure DMA-buffer validator, and records stale-after-revoke, freed-buffer, and reused-slot rejection with exact reasons and side-effect-blocked; it now also blocks pool teardown as dmapool-buffer-attached, rejects a stale same-slot proof-scoped FreeBuffer as dmabuffer-stale-handle with stale-slot-generation and side-effect-blocked, rejects wrong-owner-generation, wrong-pool, wrong-pool generation, and wrong-buffer-slot FreeBuffer attempts with exact pure validator reasons and side-effect-blocked, preserves that manager-owned buffer record after each failed free, and clears the record only after a proof-scoped active FreeBuffer validation, proof-page scrub/free, and manager-owned buffer-record detach. The completion proof does not publish a CQ entry, complete a real descriptor, grant userspace authority, or clean up or reuse production userspace DMA pages. The live virtio-net queue-completion path now gates completion accounting on the completed descriptor’s DeviceDmaAllocation rather than the queue id alone: callers must validate the used descriptor id, recover the matching DmaPage, and pass its physical address, queue, label, and generation to the kernel-owned ledger before in-flight accounting is decremented. The paired run-net proof records that a stale generation for a live kernel-owned page fails as stale-dma-handle, leaves queue accounting and the live page unchanged, and blocks CQ publication plus new-owner exposure. This closes a live accounting prerequisite only; it does not inject a real post-reset device completion or expose userspace DMA authority. The live virtio-net used-ring path also carries bounded software descriptor generations: submissions reject invalid or already-active descriptor ids before accounting, completions must consume the active software token exactly once, and the run-net proof records side-effect blocking for active reuse, double completion, and an old software token after descriptor-id reuse. That guard does not make a stale hardware used-ring id distinguishable after deliberate id reuse because virtio used entries carry no device generation. The same gate now also covers invalid used-descriptor ids without corrupting the hardware ring: an out-of-range id fails as descriptor-id-out-of-range before completion observation, completion accounting, used_seen_idx, CQ publication, or new-owner exposure can change. This is still a software-token and constructed-token prerequisite, not a real malformed-device or post-reset completion injection. The same zero-live proof now also constructs the result-only DMAPool.info cap skeleton from the manager-issued DmaPoolHandle, validates the active manager record before returning conservative status labels plus numeric device/BDF/owner/pool identity fields, proves the serialized cap call path decodes to those labels and identity fields with host physical exposure off and direct DMA blocked, and proves the cap’s info path fails closed as dmapool-stale-handle after revoke begins. It also exercises DMAPool.allocateBuffer through call_with_table() on a real cap-table entry, returns zero-indexed DMABuffer result caps for eight fixed manager-owned bounce-buffer slots, validates those result caps’ DMABuffer.info, and proves a ninth allocation fails through the manager-owned budget policy as dmapool-budget-exceeded / over-buffer-budget before publishing another result cap or corrupting live slot state; full-pool allocation also preserves manager generation counters. Stale-after-revoke allocations still fail closed without publishing another result cap. The same zero-live proof constructs the result-only DMABuffer.info cap skeleton from the manager-attached DmaBufferHandle, validates the active manager-owned buffer record through the pure DMA-buffer validator before returning conservative no-authority labels plus numeric device/BDF, owner/pool/slot identity fields, proves the serialized cap call path decodes to those labels and identity fields with host physical exposure off and direct DMA blocked, and proves the cap’s info path fails closed as dmabuffer-stale-handle after revoke begins; the same stale cap’s serialized method-0 path fails as invoke-failed. The first DmaBufferCap release hook now reuses the bounded FreeBuffer validation shape to clear only the manager-attached proof_buffer record during cap-table removal, production ring CAP_OP_RELEASE, and real Process::release_caps_for_exit() paths. It proves stale same-slot release is side-effect-blocked, proves the parent DMAPool remains attached after buffer release, proves the bounded manifest grant can allocate the slot again after explicit freeBuffer with a fresh slot generation, and still requires staged zero-live evidence before the parent pool can detach. The selected provider-TX path now adds a bounded exception to the default manager-accounting descriptor contract: queue 1 submits may publish the selected eight-entry TX queue depth, descriptors 0..7, into the existing kernel-owned virtio-net TX ring before the first completion, ring one selected notify doorbell per accepted provider descriptor, and then complete each descriptor only after DMABuffer.completeDescriptor observes the matching used-ring entry for the stored software descriptor generation. Those handoffs clear the matching manager in-flight records, record bounded provider CQ completion and acknowledgement counts, and can deliver ordered bounded completion events to live tx_interrupt.wait calls for the same selected route. The selected provider-TX path also proves a teardown-only drain when one descriptor has completed and seven provider-published descriptors remain incomplete: direct DMABuffer.freeBuffer remains blocked while in flight, release explicitly drains only the incomplete matching used-ring entries and retires those allocation-backed TX DMA queue ledgers without DMABuffer.completeDescriptor results, no provider CQ/IRQ event is published for the quiesced descriptors, release retires seven delivered-but-unacked completion events, and later slot reuse requires a fresh generation plus normal completion. Wrong-queue, stale-buffer, stale-notify, inflight-publication, wrong-descriptor, duplicate-completion, and stale-tx_interrupt issue paths remain side-effect-blocked before their guarded effects. This does not grant direct DMA, arbitrary doorbells, arbitrary CQ ownership outside the selected TX route, full virtio-net ownership, production NIC/storage migration, IOMMU programming, hardware IRQ ownership, hardware acknowledgement, or broad interrupt ownership beyond the bounded selected TX MSI-X mask/unmask proof. The bounded DeviceMmio proof also records the manager-attached policy metadata listed above, fails closed on a tampered cache/write-policy record before creating any mapping, and validates active hostile handle identities for wrong owner generation, wrong mapping generation, wrong mapping id, wrong BAR, and wrong BDF/device with exact pure-validator reasons while preserving the attached record and blocking mapping/doorbell side effects. Its serialized cap call path also decodes to the direct DeviceMmio.info no-authority labels plus numeric device/BDF, owner, BAR, mapping id, and mapping generation identity fields with host physical exposure off and direct MMIO blocked, and its stale serialized method-0 path fails as invoke-failed. The DMAPool.info skeleton has the same kernel-side serialized stale failure evidence. The interrupt handoff proof now also constructs a result-only Interrupt.info cap skeleton from the manager-issued device handle and attached route record, records active info success, proves the serialized cap call path decodes to the direct no-authority labels plus numeric device/BDF, owner, source, source generation, and route generation identity fields, proves those source and route generations are distinct in the bounded route record, and proves stale-after-revoke info fails closed as interrupt-stale-handle plus stale serialized method-0 failure as invoke-failed before any acknowledgement, mask, unmask, blocking wait, or delivery authority exists. The manifest-granted skeleton now also exposes an admission-only Interrupt.wait method that returns the pending-token validator’s fail-closed labels without waking a waiter or changing delivery counts, and an admission-only Interrupt.acknowledge method that validates the active route while blocking hardware acknowledgement and preserving delivery counts. It also exposes route-state-control Interrupt.mask and Interrupt.unmask methods that validate the active route before changing the manager-attached dispatch slot between claimed-masked and driver-unmasked, while preserving delivery counts. A bounded Interrupt.wait call observed after unmask installs a fixed-table userspace waiter object for the current manager-granted route; the existing route-dispatch delivery counter can now complete that waiter as interrupt-delivered / waiter-completed-irq with real_interrupt_delivery=delivered and an advanced delivery count. The same focused smoke then submits a second unmasked wait, observes it remains pending, calls Interrupt.mask, and finishes that wait as interrupt-waiter-cancelled / route-masked / waiter-completed-no-irq with wake_blocked=false, preserved source/route generations, and unchanged delivery counts. The selected provider TX tx_interrupt cap can now observe the bounded used-ring completion event described above and account the already observed selected TX dispatch token paired with that delivered provider CQ event, but hardware MSI/MSI-X programming beyond the selected vector-control proof, full hardware IRQ ownership, deferred EOI, LAPIC/MSI-X acknowledgement, and broader production interrupt dispatch remain blocked. Provider TX MSI-X mask/unmask is limited to the selected-route vector-control proof described earlier. Provider RX MSI-X mask/unmask remains bounded to the selected RX route as well; release while masked restores that selected vector-control bit and route state before clearing the live issue gate. RX unmask admits the route transition before exposing the MSI-X vector-control bit, and the focused QEMU proof shows a failed route unmask leaves the selected vector masked with the route ledger preserved. Cleanup failure still leaves the issue uncleared so future RX cap issuance stays blocked on uncertain route state. RX wait/ack is now bounded to one selected-route zero-CQ dispatch token; RX descriptors and CQ ownership remain blocked. This is manager-record skeleton/no-production-DMA, no-real-MMIO-mapping, and bounded route-dispatch interrupt-waiter prerequisite evidence only. Production DMAPool, DeviceMmio, and Interrupt capability handles, production userspace DMAPool buffer handles, real DeviceMmio BAR mapping objects, real cache attributes/write policy enforcement, production kernel device-path wiring beyond the current proof adapters, real device-manager page quiesce/scrub/release hooks and real page cleanup/reuse beyond the bounded kernel-owned proof pages, production handle-attached budget/OOM enforcement beyond the current manager-owned DMAPool.allocateBuffer budget slice, IOMMU remapping domains, production handle-attached host tests, QEMU stale-handle smokes, broader userspace exposure, production NIC/storage migration, cloud readiness, and S.11.2 hostile smokes remain open.

Do not weaken the short-term virtio-net bounce-buffer path until DMAPool, DeviceMmio, Interrupt, device-manager ownership transactions, lifecycle teardown, accounting, and hostile smokes all exist.

Design Risks and Open Questions Register

Consolidated index of known design risks and open architectural questions for capOS. Every entry routes to the file that owns the long-form design or the remediation backlog for that risk; this register itself is a pointer document, not a place to put new design.

Use this document to answer “is this risk already tracked, and where?” without re-deriving the state from the proposal tree on each review.

Last refresh: 2026-06-07 08:02 UTC.

How To Use

  • Each design-risk row records the current observable state (what the code and docs say today), the owning tracker (the proposal/backlog/design file to update when the state changes), and the remaining gap (what is still open).
  • Each open-question row records a current answer if one exists in the tree, plus a pointer to the canonical tracker. Questions that are genuinely unanswered are marked Open; those should not be closed by guessing here – update the relevant proposal, then update this register.
  • When a risk is closed by code or by an explicit design decision, move the short closure summary into docs/changelog.md and remove the row.
  • New review findings go into task records under docs/tasks/; this register is about long-horizon design risks, not concrete unresolved review issues.

Design Risks

R1 – Process-wide ring vs multi-threaded userspace and full SMP

  • State. The capability ring is one per process. capos-rt enforces a single-owner RuntimeRingClient. After in-process threading, at most one process-ring waiter is allowed. The first SMP Phase C AP scheduler-owner proof deliberately keeps process-wide ring execution on a single CPU at a time behind a scheduler-owner latch.
  • Owner. docs/proposals/ring-v2-smp-proposal.md, docs/research/completion-ring-threading.md, docs/backlog/smp-phase-c.md, docs/architecture/threading.md.
  • Gap. Per-thread capability rings, per-thread completion routing, and the Multi-Process / In-Process Threading Scalability milestones in docs/roadmap.md remain future work. Userspace threading scales only as far as the single ring waiter allows.

R2 – “Interface IS the permission” pushes safety into wrapper TCB

  • State. capOS deliberately has no parallel rights bitmask: attenuation is done by handing out a narrower CapObject wrapper, not a flag-reduced copy of the same cap. Wrapper correctness is therefore part of the trust base.
  • Owner. docs/capability-model.md, docs/proposals/session-bound-invocation-context-proposal.md, docs/security/trust-boundaries.md, docs/backlog/stage-6-capability-semantics.md.
  • Gap. The completed Session-Bound Invocation Context migration has the one-session-per-process proof, privacy-preserving endpoint caller-session metadata, explicit subject-disclosure coverage, chat session-keyed state, Adventure service grants, terminal/stdio bridge liveness guards, and final Gate 4 verification. The first Tier-1 paper claim, covering session-bound invocation context evidence for implementation review, is closed. Remaining non-gating cleanup is stable service-audit identity across service replacement and legacy internal receiver-selector naming.

R3 – Legacy endpoint metadata as transitional service identity

  • State. Legacy endpoint receiver metadata is contained as internal transport/debug state for normal paths. Chat uses session-keyed membership, terminal/stdio bridges enforce live caller-session guards, and delegated relabeling containment plus the historical service-object routing/lifecycle proof have landed. Adventure/shared-service cleanup is landed for normal workload paths.
  • Owner. docs/proposals/session-bound-invocation-context-proposal.md, docs/backlog/stage-6-capability-semantics.md.
  • Gap. Finish final legacy cleanup. Receiver metadata must remain internal transport state or hostile-test fixture, not subject identity or disclosure.

R4 – Resource accounting is fragmented

  • State. Per-process memory, cap-table, and thread quotas exist; ResourceProfile, session quotas, scheduling-context donation, and cross-service donation/fairness are still proposal-shaped.
  • Owner. docs/proposals/resource-accounting-proposal.md, docs/proposals/memory-authority-model-proposal.md, docs/proposals/oom-and-swap-proposal.md, docs/proposals/user-identity-and-policy-proposal.md, docs/proposals/system-monitoring-proposal.md, docs/proposals/scheduler-evolution-proposal.md, docs/backlog/scheduler-evolution.md.
  • Gap. Phase D WFQ has landed; Phase E SchedulingContext bind/revoke, budget, donation/return, and depletion notification are closed at the scheduler-cap layer, but cross-service donation semantics, per-service fairness beyond thread weights, log volume accounting, memory authority/residency proof obligations, unified resource bundles for guest/anonymous/external/service principals, and the scratch-bytes / outstanding-calls / endpoint-queue / in-flight-call quota fields tracked in review-finding task records remain open.

R5 – Copy-transfer SQE replay is repeatable by design

  • State. docs/authority-accounting-transfer-design.md documents that userspace replay of a copy-transfer SQE is repeatable per dispatch attempt, with move-transfer replay failing closed once the source slot is removed/reserved. Exactly-once replay suppression is explicitly future work (security invariant T3).
  • Owner. docs/authority-accounting-transfer-design.md, docs/proposals/security-and-verification-proposal.md.
  • Gap. The (sender_pid, call_id, sqe_seq) plus monotonic transfer-epoch identity needed for exactly-once replay across dispatch attempts is not implemented. Each transferable interface must continue to acknowledge this in its threat model.

R6 – CAP_OP_RELEASE is deferred / queued, not synchronous

  • State. Owned-handle drop in capos-rt queues one local CAP_OP_RELEASE on the ring; process exit performs fallback cleanup. Release does not run before the next ring flush (cap_enter or process exit).
  • Owner. docs/authority-accounting-transfer-design.md, docs/proposals/error-handling-proposal.md, docs/capability-model.md.
  • Gap. Resource-pressure or revocation-sensitive flows must not assume a Drop call has already taken effect at the kernel layer. Time-critical revocation should use CapabilityManager.revoke or epoch revocation rather than relying on Drop.

R7 – Shared memory / zero-copy / shared park are incomplete

  • State. MemoryObject substrate exists; SharedBuffer provenance, file/network/DMA zero-copy paths, and shared park/SharedParkSpace are blocked on mapping provenance / object pinning work.
  • Owner. docs/proposals/storage-and-naming-proposal.md, docs/proposals/memory-authority-model-proposal.md, docs/proposals/networking-proposal.md, docs/architecture/park.md, docs/backlog/runtime-network-shell.md.
  • Gap. Workloads that need true zero-copy IPC, storage, or network pipelines pay a copy/serialization cost until provenance/pinning lands. ParkSpace private cleanup now covers anonymous VirtualMemory.unmap, VirtualMemory.decommit, and explicit MemoryObject.unmap for borrowed mappings; shared park keys and address-space generation cleanup remain open.

R8 – Networking lives inside the kernel TCB

  • State. Largely resolved: the Phase C userspace NIC driver and smoltcp network-stack process own the production socket path, the kernel no longer depends on smoltcp, and the kernel socket CapObjects are qemu-only fixtures that fail closed without a kernel socket owner. The Telnet and SSH terminal-host proofs that sat on the kernel path are retired.
  • Owner. docs/proposals/networking-proposal.md, docs/dma-isolation-design.md, docs/backlog/runtime-network-shell.md.
  • Gap. The remaining qemu-only kernel virtio-net fixture and socket CapObject surface is fixture code, not production authority. The kernel-side SocketTerminalSession transitional shim is retired (2026-06-10): TcpSocket.intoTerminalSession fails closed, and a network-backed TerminalSession must be re-built as a userspace terminal-session service over the userspace TCP stack if byte-stream terminal transport is needed again.

R9 – DMA isolation is backend-scoped, not a hostile-hardware blanket

  • State. docs/dma-isolation-design.md now records runtime fail-closed DMA backend selection. The current no-IOMMU cloud/DDF path uses manager-owned, brokered bounce buffers for userspace provider authority and hides host physical addresses and IOVAs from the driver. The selected QEMU Intel VT-d path has bounded per-device remapping evidence, but that remains emulator evidence rather than a general hardware-isolation claim. Without trusted remapping, hostile bus-mastering hardware remains out of scope.
  • Owner. docs/dma-isolation-design.md, docs/proposals/networking-proposal.md, docs/proposals/cloud-deployment-proposal.md, docs/backlog/hardware-boot-storage.md.
  • Gap. The retained DDF production-authority finding is closed in docs/tasks/done/2026-06-07/ddf-production-authority-closeout.md. Remaining work is explicit task or proposal scope: direct-remapping/vIOMMU production hardware support, broader provider/device variants, and device-autonomous MSI-X delivery rather than the current polled or kernel-injected waiter proofs.

R10 – Boot package model embeds all binaries

  • State. tools/mkmanifest embeds every declared binary as a NamedBlob inside manifest.bin. The kernel loads only init; everything else is fetched by init from the in-memory BootPackage.
  • Owner. docs/backlog/hardware-boot-storage.md, docs/proposals/storage-and-naming-proposal.md, docs/trusted-build-inputs.md.
  • Gap. Boot binary ISO layout (separate ELF payloads), package/storage update model, and persistent storage-backed delivery are not yet designed as code; the current scheme is an explicit prototype compromise.

R11 – Pre-auth and post-auth share a shell process

  • State. The shell-led boot flow folds console-login into capos-shell and uses an anonymous-first session that escalates via login/setup. The pre-auth and post-auth code paths run in one userspace process and address space.
  • Owner. docs/proposals/boot-to-shell-proposal.md, docs/proposals/shell-proposal.md, docs/security/trust-boundaries.md, docs/proposals/user-identity-and-policy-proposal.md.
  • Gap. Separation depends on shell/auth implementation quality, not on a process boundary. The future direction (separate login service with minimal authority, restricted launchers, WebShell/SshGateway) is proposal-shaped. Remote and non-loopback shells must remain blocked until pre-auth and post-auth authority are process-isolated or a shared-process proof is accepted.

R16 – Remote shell ingress is demo/prototype only

  • State. Telnet is a plaintext loopback-only QEMU demo. SSH has SSH-shaped prerequisites, fixture authentication proofs, dev key material, policy classification, and restricted-shell launcher coverage, but no production encrypted SSH transport, durable key/account storage, full OpenSSH-compatible userauth/channel handling, channel binding, or complete audit/storage gates.
  • Owner. docs/proposals/ssh-shell-proposal.md, docs/proposals/telnet-tls-shell-proposal.md, docs/backlog/runtime-network-shell.md, docs/tasks/README.md, docs/build-run-test.md.
  • Gap. Production/non-loopback shell exposure is blocked on SSH transport, key, account, audit, storage, session-bound delegation, and pre-auth/post-auth isolation gates.

R17 – Remote-session UI bridge and Tauri wrapper are research-only

  • State. The Linux remote-session-ui bridge and the repo-local Tauri wrapper run as trusted local backends that hold the upstream capOS session and project view models / call results to the browser/webview. A policy preflight now proves the wrapper remains check/dev only; distributable packaging and desktop automation modes are intentionally blocked.
  • Owner. docs/proposals/remote-session-ui-security-proposal.md, docs/proposals/remote-session-capset-client-proposal.md, docs/backlog/remote-session-capset-client.md.
  • Gap. Distributable packaging, desktop automation, and a reviewed production posture for the remote-session UI surface remain unreviewed in the relevant remote-session proposal/backlog task records. Non-loopback remote-session UI exposure must stay blocked until that posture is accepted.

R12 – Verification coverage is partial, not full proof

  • State. Bounded Kani gate (make kani-lib/make kani-lib-full), Loom ring model, Miri lib tests, proptest, fuzz harnesses, panic-surface inventory, and CI dependency policy exist. Coverage is not whole-system and not seL4-style functional refinement.
  • Owner. docs/proposals/security-and-verification-proposal.md, docs/security/verification-workflow.md, docs/panic-surface-inventory.md, docs/backlog/security-verification.md.
  • Gap. Public/external claims must distinguish “bounded model checked” from “fully verified”. Promote new properties into Kani/Loom only when the invariant is concrete and bounded. IPC/scheduler panic-surface hardening also remains open around guarded unwraps, rollback restoration, stale queues, blocking waits, process/thread exit, endpoint cancellation, TLB shootdown send failures, and scheduler hot-path expects. Kernel upper-half page-table mutation after AP startup is closed for the current MMIO/firmware helper path by docs/tasks/done/2026-06-07/kernel-upper-half-pml4-propagation-hardening.md; future helper windows or allocator-growth paths that need a new kernel-half PML4 slot still require boot preseed or synchronized live-root propagation.

R13 – Trusted build inputs are partly pinned

  • State. Limine (commit + artifact SHA-256), capnp 1.2.0 source tarball, CUE 0.16.0, mdBook/mdbook-mermaid, Typst 0.14.2, Cargo lockfiles, the Rust nightly date policy, the Kani toolchain bundle, OVMF firmware hash, and the CI apt package versions for qemu-system-x86, xorriso, make, git, and ovmf are pinned or policy-pinned. make build-provenance records local runner identity, GitHub-hosted image identity when present, selected host-tool paths, package identities and normalized apt source pockets when discoverable, and OVMF path/package/hash or absence. CI pull requests run a blocking environment provenance comparison against the latest successful main-branch qemu-smoke provenance artifact.
  • Owner. docs/trusted-build-inputs.md, docs/proposals/cloud-deployment-proposal.md.
  • Gap. The PR-blocking environment comparison and qemu-smoke package pins close the previous make/git identity and advisory-compare gap for CI proof branches, but ubuntu-24.04 is still a GitHub-managed mutable runner label, not an immutable production image digest. Full production reproducibility still needs a self-built runner image referenced by digest, repo-managed download-and-verify tool digests for the apt-pinned build/boot tools, or both.

R14 – User identity / policy is proposal-shaped

  • State. Anonymous/operator sessions, password setup/login, broker-issued shell bundles, and redacted audit records exist. Durable accounts, ABAC/MAC context, OIDC/passkeys, disk-backed account stores, and resource bundles are proposal-shaped. Stale-session calls and retained shell-bundle caps fail closed for current proof paths, but session liveness is still represented by immutable metadata plus expiry timestamps rather than a mutable session-manager cell with logout, revocation, recovery-only, and renewal state.
  • Owner. docs/proposals/user-identity-and-policy-proposal.md, docs/backlog/local-users-management.md, docs/backlog/session-bound-invocation-context.md, docs/proposals/oidc-and-oauth2-proposal.md, docs/proposals/certificates-and-tls-proposal.md, docs/proposals/cryptography-and-key-management-proposal.md.
  • Gap. Until durable identity / persistence / passkey paths land, capOS is not a complete multi-user OS. Demo claims must scope to the proven anonymous + operator + manifest-seeded local accounts model. Before treating fixed short session expiry as production interactive UX, capOS needs explicit logout, owner-shell/gateway close propagation, and renewal paths that mint fresh grant leases without reviving stale ordinary grants.

R15 – App exception serialization depends on result-buffer capacity

  • State. Application-level exceptions are serialized into the caller’s result buffer; if the target cannot be identified, invocation fails earlier with transport errors. Truncation/transport failures are documented.
  • Owner. docs/proposals/error-handling-proposal.md, docs/capability-model.md.
  • Gap. Service UX/debuggability can degrade for malformed or small-buffer clients. No remediation is required in code today, but each service contract should document its expected result-buffer capacity.

Open Design Questions

The following questions came up in external review. Each row gives the current best answer observed in the tree, the canonical tracker to update, and an explicit status.

Q1 – Cap’n Proto ABI compatibility policy

  • Current answer. docs/abi-evolution-policy.md defines compatibility classes, stable schema ordinals, reserved-field rules, ring layout rules, version negotiation, deprecation windows, and review gates. Generated-code drift is still checked through make generated-code-check and tools/check-generated-capnp.sh.
  • Tracker. docs/abi-evolution-policy.md, docs/trusted-build-inputs.md, schema/capos.capnp, capos-config/src/ring.rs.
  • Status. Answered for the current research tree. Ring v2 compatibility remains a separate open question below.

Q2 – Ring v2 backward compatibility

  • Current answer. docs/proposals/ring-v2-smp-proposal.md treats per-thread ring ownership as the full-SMP target and frames it as an evolution that may need ABI changes; docs/tasks/README.md calls runtime ring reactor work the compatibility bridge.
  • Tracker. docs/proposals/ring-v2-smp-proposal.md, docs/backlog/smp-phase-c.md.
  • Status. Open. Whether Ring v2 is backward-compatible with the process-wide ring or an explicit ABI break has not been decided.

Q3 – Which capabilities are copy-transferable vs move-only vs non-transferable

  • Current answer. docs/authority-accounting-transfer-design.md defines copy/move/none transfer modes and the accounting/rollback rules. Per-interface transfer mode is encoded on the schema-defined CapObject.
  • Tracker. docs/authority-accounting-transfer-design.md, schema/capos.capnp.
  • Status. Partial. The mode is enforced per object, but the user-visible matrix (which named caps are copy/move/none) is not consolidated in one document.

Q4 – Copy-transfer replay: feature or compromise

  • Current answer. Repeatable copy-transfer replay is documented as the current accepted semantics. Exactly-once replay suppression is future work. See R5.
  • Tracker. docs/authority-accounting-transfer-design.md.
  • Status. Decided as “current semantics, future tightening optional”.

Q5 – When legacy endpoint identity is replaced and what migrates

  • Current answer. docs/backlog/session-bound-invocation-context.md decomposes the selected migration: one immutable session context per process, privacy-preserving endpoint caller-session metadata, chat/adventure/stdio session-keyed migration, and legacy endpoint-identity cleanup. The old service-object identity plan is superseded.
  • Tracker. docs/proposals/session-bound-invocation-context-proposal.md, docs/backlog/session-bound-invocation-context.md, docs/backlog/stage-6-capability-semantics.md.
  • Status. Selected milestone. See R3.

Q6 – Minimum production TCB target

  • Current answer. docs/proposals/security-and-verification-proposal.md now enumerates the current demo/proof TCB and the target production TCB. Current proofs still trust kernel networking, init/supervisors, broker/session services, harnesses, and QEMU virtio. The target production TCB removes ordinary apps and shell children but still includes minimal init/supervisor, credential/session/broker/key/audit services, production device managers, and ABI/schema/build-signature inputs.
  • Tracker. docs/security/trust-boundaries.md, docs/proposals/userspace-authority-broker-proposal.md, docs/proposals/boot-to-shell-proposal.md.
  • Status. Partially answered. The TCB statement exists; reducing the actual implementation to that target and proving the non-loopback shell gates remains open.

Q7 – Revocation strategy

  • Current answer. Generation/epoch revocation exists for endpoint-backed caps; CapabilityManager.revoke cleans up endpoint-backed service objects by object behavior. Session-bound dispatch now fails closed for stale proof paths, but the target lifecycle splits revocation into session liveness cells, grant leases, and object/facet epochs. Revocation trees, leases, supervisor-owned-cap patterns, and session renewal/close propagation are proposal-shaped.
  • Tracker. docs/proposals/service-architecture-proposal.md, docs/proposals/session-bound-invocation-context-proposal.md, docs/proposals/user-identity-and-policy-proposal.md, docs/capability-model.md.
  • Status. Open. The chosen revocation primitive set (epochs vs trees vs leases vs explicit-revoke methods per object) needs an explicit decision, and interactive session lifecycle needs a concrete liveness-cell plus renewal protocol.

Q8 – Boundary between kernel and service-level resource accounting

  • Current answer. Memory frame grants and cap-table slots are kernel accounting; storage/network buffer accounting is proposed at the service layer. The boundary is not yet implementation-driven.
  • Tracker. docs/proposals/resource-accounting-proposal.md, docs/proposals/storage-and-naming-proposal.md, docs/proposals/networking-proposal.md.
  • Status. Open.

Q9 – CPU accounting and scheduling contexts

  • Current answer. Per-CPU WFQ run queues, per-thread weighted vruntime, SchedulingPolicyCap weight/latency-class authority, and Phase E SchedulingContext bind/revoke, budget, donation/return, and depletion notification are implemented per docs/changelog.md (Phase D closed 2026-05-10) and docs/proposals/scheduler-evolution-proposal.md. Cross-service donation policy, priority inheritance broader than scheduling contexts, explicit scheduling-cap fairness across principals, and full nohz activation remain proposal-shaped.
  • Tracker. docs/proposals/smp-proposal.md, docs/proposals/scheduler-evolution-proposal.md, docs/backlog/scheduler-evolution.md, docs/proposals/resource-accounting-proposal.md, docs/architecture/scheduling.md.
  • Status. Partial. The base CPU accounting and scheduling-context model is implemented through Phase E; the surrounding policy (cross-service donation, full nohz activation, isolation leases, fairness across principals) is the remaining decision.

Q10 – IOMMU requirement for userspace networking

  • Current answer. docs/dma-isolation-design.md selects a runtime fail-closed backend: direct remapping only when capOS can discover and program trusted translation authority, otherwise a labeled brokered bounce-buffer fallback or unsupported. The current GCP/no-IOMMU userspace-driver evidence uses the brokered bounce path.
  • Tracker. docs/dma-isolation-design.md, docs/proposals/networking-proposal.md, docs/proposals/cloud-deployment-proposal.md.
  • Status. Answered for the current no-IOMMU cloud path. Future direct-remapping, vIOMMU, or hostile-hardware isolation claims require their own evidence and remain outside the brokered-bounce production authority closeout.

Q11 – Capability persistence model

  • Current answer. All capabilities are runtime-only today; sealed/stored caps and namespace-mediated reconstitution are storage-proposal scope.
  • Tracker. docs/proposals/storage-and-naming-proposal.md, docs/proposals/volume-encryption-proposal.md, docs/paper/plan.md (paper-scoped persistence Tier-1 prerequisite).
  • Status. Open.

Q12 – Least-privilege shell command invocation

  • Current answer. capos-shell runs commands using broker-issued bundles; the broker, not the shell, is the policy decision point. RestrictedShellLauncher keeps remote shell launches off raw spawn authority.
  • Tracker. docs/proposals/shell-proposal.md, docs/proposals/userspace-authority-broker-proposal.md, docs/proposals/boot-to-shell-proposal.md.
  • Status. Direction agreed, complete migration to broker-only authority for every shell-driven invocation is open.

Q13 – Formal properties to prove

  • Current answer. Existing bounded proofs cover cap-table non-forgery, frame-bitmap invariants, transfer rollback, and ring producer-consumer invariants. seL4-style full functional refinement is explicitly out of scope.
  • Tracker. docs/proposals/security-and-verification-proposal.md, docs/security/verification-workflow.md, docs/proposals/formal-mac-mic-proposal.md.
  • Status. Partially answered. A definitive list of “what we will keep proving” vs “what we will keep testing” should be added when the next Kani/Loom obligation set is concrete.

Q14 – Threat model coverage

  • Current answer. docs/proposals/security-and-verification-proposal.md now contains a threat actor matrix for local physical attackers, malicious DMA devices, malicious boot manifests, compromised init/supervisors, compromised narrow services, hostile network peers, and malicious build dependencies.
  • Tracker. docs/security/trust-boundaries.md, docs/proposals/security-and-verification-proposal.md, docs/dma-isolation-design.md, docs/trusted-build-inputs.md.
  • Status. Answered at design level. Remaining work is implementation/proof through the relevant task records.

Q15 – Language runtimes integration model

  • Current answer. capos-rt is the canonical no_std Rust runtime. Go, Python, Lua, JavaScript/TypeScript, WASI, C/C++, and POSIX-shaped software are future tracks. The current documentation separates native runtime adapters, capability-native bindings, POSIX compatibility adapters, and WASI host adapters instead of treating “compatibility layer” as one shared ABI.
  • Tracker. docs/programming-languages.md, docs/proposals/userspace-binaries-proposal.md, docs/proposals/go-runtime-proposal.md, docs/proposals/lua-scripting-proposal.md.
  • Status. Open. A common ABI layer vs per-runtime generated clients has not been decided; the current default is per-runtime or adapter-specific clients backed by explicit capabilities.

Device Driver Specifications

The pages under docs/devices/ are per-device driver references. Each one captures the authoritative hardware/protocol specification a capOS device driver is built from, the subset of that specification the driver actually implements, and how the device binds onto capOS’s reviewed userspace hardware-authority gate.

A device page is a navigational / provenance document, not a re-spec. It cites the spec (name, version, source), summarizes only the wire-format subset the driver actually implements, and points into the implementation with file + symbol references (the function, type, or constant name – not line numbers, which drift) so the doc maps to the code. Do not copy the full spec or dump exhaustive register tables: if something is in the spec and not specially handled by the driver, link to it rather than transcribing it.

Depth scales to maturity and risk. Transitional or stable in-kernel drivers get a concise provenance map – do not over-document stable code. Actively developed or higher-risk drivers (new DMA paths, cloud NICs/storage behind the userspace-authority gate) get fuller treatment.

These pages are the provenance map of record for device-driver work. Landing or modifying a device driver requires creating or updating the matching docs/devices/<device>.md page as part of the same change; it is part of that change’s acceptance, not an afterthought. Each page reads as a reader-facing capability map – the driver’s currently-implemented subset in present tense and what is future or not yet implemented – not a per-slice development log.

docs/devices/ is distinct from the other device-adjacent doc areas:

  • docs/research/ holds OS-design research deep-dives (capability models, IPC, scheduling, IOMMU prior art). It informs architecture; it does not specify a concrete device.
  • docs/*-design.md and docs/proposals/ describe capOS subsystem designs (the DMA isolation model, the device-manager refactor, the userspace-driver authority gate). They define the framework a driver binds into; a device page maps one device onto that framework.
  • docs/devices/<device>.md is the narrow, per-device contract: which external spec, which wire-format fields, and which capOS grants and fail-closed rules the driver depends on.

Three-part structure

Every device page follows the same three sections. See Device Spec Template for the blank form and virtio-net for a worked example.

  1. Spec basis – the authoritative specification(s) the driver is built from: name, version, and source (URL or ref). For open vendor devices without a freely published register spec (for example AWS ENA or Azure MANA), cite the upstream open-source driver and any published datasheet as the basis of record.
  2. Wire format (relevant subset) – the registers / BARs, queues / rings, descriptor and completion formats, and admin / management commands that the driver actually implements. Document the subset, not the whole spec.
  3. capOS mapping – how the device binds (note transitional in-kernel status and any pending userspace move where applicable); its DeviceMmio / Interrupt / DMAPool usage; the fail-closed and validation rules it relies on (stale-generation rejection, bounds checks, doorbell scoping); and what is QEMU-emulable versus hardware-only. The last point drives whether the driver carries a QEMU proof or a host-side conformance gate plus a deferred live proof.

Pages

  • Device Spec Template – the blank three-part form for a new device page.
  • virtio-net – the in-tree modern virtio-net PCI NIC: the worked first example, sourced from the kernel virtio transport and the public virtio specification.
  • NVMe – the queue-base/PRP register and descriptor subset the conditional kernel on-notify DMA validator scans on the NVMe doorbell path, plus the no-IOMMU brokered-DMA correction (validator mechanism + bounded hostile-scan proof + brokered controller bring-up).
  • AWS Nitro EBS (NVMe storage) – the AWS cloud-shape classification on top of the shared NVMe foundation: EBS exposed as NVMe namespaces, the Nitro IOMMU-availability DMA-backend policy, and the local make run-pci-nvme precursor proof.
  • Azure managed disk (NVMe storage) – the Azure cloud-shape classification on the same shared NVMe foundation: Azure Boost managed disks exposed as NVMe namespaces, the Azure IOMMU-availability DMA-backend policy, why the older-family Hyper-V/SCSI path is out of scope, and the local make run-pci-nvme precursor proof.
  • GCP Persistent Disk (storage) – the GCP cloud-shape classification on the same shared NVMe foundation: PD exposed as NVMe namespaces on current GCE generations, the GCE IOMMU-availability DMA-backend policy, why the older-family virtio-scsi PD path is out of scope, and the production storage-bind proof (cloud-prod-storage-bound-local-proof) that precedes a billable live-GCE storage driver bind.
  • GCE gVNIC – a grounding map for the Google Virtual NIC: spec basis from the public gVNIC docs and the GVE Linux driver, the wire-format subset (BARs, admin queue, MSI-X interrupt classes, GQI/DQO formats, QPL/RDA addressing, reset) a future reusable capOS driver would implement, and the DDF authority mapping. capOS has live-GCE inventory, admin-queue/register, bounded GQI/QPL raw-frame TX/RX, and typed Nic-adaptation proofs for the 1ae0:0042 PCI function, but no reusable gVNIC provider service, QEMU model, DQO/RDA path, or host conformance suite; it is a separate GCE portability lane, not a blocker for the virtio-net Web UI proof.

<Device> Driver Specification

Copy this file to docs/devices/<device>.md, set the front matter (status, description, last_reviewed, topics), add the page to docs/SUMMARY.md, and fill in the three sections below. Document only the subset the driver actually implements; cite, do not transcribe, the full spec.

1. Spec basis

  • Device: name, PCI/MMIO class and IDs (vendor/device), instance shapes.
  • Authoritative spec(s): name, version, and source (URL or ref). For open vendor devices without a published register spec, cite the upstream open-source driver and any datasheet as the basis of record, and say so explicitly.
  • Reference driver(s) (optional): upstream implementations cross-checked for behavior.

2. Wire format (relevant subset)

  • Registers / BARs: BAR layout, register map offsets, doorbell offsets the driver reads or writes.
  • Queues / rings: queue kinds (admin/management vs I/O), ring layout, sizes.
  • Descriptor + completion formats: the descriptor and completion entry fields the driver encodes/decodes, including flags and status codes.
  • Admin / management commands: feature negotiation, identify/configure, and lifecycle commands the driver issues.

3. capOS mapping

  • Authority gate: how the device is enumerated, claimed, and bound through the reviewed userspace-driver hardware-authority gate and the device-manager ownership ledger.
  • DeviceMmio: which BAR ranges are mapped, with what page attributes (device-uncacheable, NX), and how register/doorbell writes are scoped.
  • Interrupt: MSI/MSI-X vector binding, completion-IRQ waiter model.
  • DMAPool: queue/buffer DMA allocation, the selected DMA backend (direct IOMMU vs labeled bounce buffer), quiesce/scrub-before-reuse rules, and the host-physical-address / IOVA non-exposure policy.
  • Fail-closed / validation rules: stale-generation rejection, BAR bounds, doorbell scoping, malformed-descriptor handling, release/reset/driver-death teardown.
  • QEMU-emulable vs hardware-only: which parts are end-to-end provable in QEMU (and the make run-* target) versus hardware-only (host-conformance gate now, deferred live proof when the hardware is provisioned).

virtio-net (modern PCI NIC)

This is a provenance map for the in-tree virtio-net driver: it cites the spec, summarizes only the wire-format subset the code actually implements, and points into the implementation. It is not a re-spec – where the spec is implemented unchanged, it links rather than transcribes. The driver is mature and transitional (in-kernel today, slated to move to a userspace network-stack process), so the treatment is a concise map rather than exhaustive register tables.

1. Spec basis

  • Device: virtio network device, modern (virtio 1.x) PCI transport. PCI vendor 0x1af4; device 0x1041 (modern) / 0x1000 (transitional). IDs at kernel/src/pci.rs (VIRTIO_VENDOR_ID, VIRTIO_NET_MODERN_DEVICE_ID, VIRTIO_NET_TRANSITIONAL_DEVICE_ID; matched by PciDevice::is_virtio_net).
  • Authoritative spec: Virtual I/O Device (VIRTIO) Version 1.2, OASIS Committee Specification 01 (2022-07-01). Source: https://docs.oasis-open.org/virtio/virtio/v1.2/virtio-v1.2.html. Relevant sections: 4.1 (virtio over PCI bus), 2.7 (split virtqueues), 5.1 (network device).
  • Reference: cross-checked against the Linux virtio_net and virtio_pci_modern drivers for the modern-transport handshake and split-ring layout.

2. Wire format (implemented subset)

Only the modern split-ring subset the driver uses is summarized here; feature bits and structures the spec defines but the driver does not specially handle are linked, not transcribed.

  • PCI capabilities / BAR layout: virtio modern PCI vendor capabilities (common / notify / ISR / device / PCI cfg) parsed from the capability list; type constants VIRTIO_PCI_CAP_COMMON_CFG / ..._NOTIFY_CFG / ..._ISR_CFG / ..._DEVICE_CFG / ..._PCI_CFG and length floors VIRTIO_PCI_CAP_MIN_LEN / VIRTIO_COMMON_CFG_MIN_LEN in kernel/src/virtio.rs; common-config register offsets are the transport::COMMON_* constants (COMMON_DEVICE_FEATURE, COMMON_QUEUE_SELECT, COMMON_QUEUE_NOTIFY_OFF, …). The notify capability carries notify_off_multiplier (ModernTransport::notify_off_multiplier) used to compute per-queue notify addresses.
  • Split-ring layout: 16-byte descriptors (transport::VIRTQ_DESC_SIZE), available and used ring offsets, and the transport::VIRTQ_DESC_F_NEXT / transport::VIRTQ_DESC_F_WRITE flags. Descriptor lifecycle is generation-tracked through a bounded transport::VIRTQ_DESCRIPTOR_TRACKING_SLOTS slot array (DescriptorTrackingSlot).
  • Queues: RX queue (VIRTIO_NET_RX_QUEUE), TX queue (VIRTIO_NET_TX_QUEUE), negotiated to a bounded target size (VIRTIO_NET_QUEUE_TARGET_SIZE); the target size must not exceed the tracking slot count (compile-time assert! against transport::VIRTQ_DESCRIPTOR_TRACKING_SLOTS).
  • Net header / framing: 12-byte virtio_net header prepended to frames (VIRTIO_NET_HDR_LEN); proof TX buffers carry the header plus a minimum Ethernet frame (TX_PROOF_BUFFER_LEN, TX_PROOF_ETHERNET_OFFSET).
  • Feature negotiation: device/driver feature select/read registers in the common config; the driver negotiates transport::VIRTIO_F_VERSION_1 / transport::VIRTIO_F_ACCESS_PLATFORM (generic, in transport) plus the net-specific VIRTIO_NET_F_MAC (1 << 5) and acknowledges VIRTIO_NET_F_MRG_RXBUF (1 << 15).

3. capOS mapping

  • Binding (transitional): virtio-net is currently driven in the kernel. PCI/MSI-X transport discovery, the split-ring transport, smoltcp, TCP listeners, the line discipline, and the Telnet IAC filter live in kernel/src/virtio.rs and kernel/src/cap/network.rs. This is explicitly transitional: Phase C of the networking proposal (docs/proposals/networking-proposal.md) moves the NIC driver and stack into a userspace network-stack process once the userspace-driver authority gate applies to it. Until then it does not bind through the DeviceMmio/Interrupt/DMAPool provider grants the DDF cloud-NIC drivers use; the sections below describe its kernel-owned equivalents.

  • MMIO: modern-transport common/notify/ISR/device config regions are mapped from the device BARs and accessed through the transport MMIO helpers (kernel/src/virtio.rs transport module). Doorbell (queue-notify) writes are scoped to the per-queue notify address computed from notify_off_multiplier; the DDF DeviceMmio cap (kernel/src/cap/device_mmio.rs) is the userspace successor surface.

  • Interrupt: MSI-X vectors are programmed for config and per-queue interrupts; route records and vector dispatch are tracked by the kernel-owned device-interrupt ledger (kernel/src/device_interrupt.rs). The make run-net smoke asserts MSI-X metadata selection, vector-pool/exhaustion policy, masked route lifecycle, queue vector assignment, descriptor guards, ARP, and ICMP. Device-autonomous delivery proofs live in the dedicated userspace-provider MSI-X gates, not in the retired kernel L4 owner.

  • DMA: ring pages and TX/RX buffers are allocated and accounted through the net-keyed kernel DMA ledger (kernel/src/device_dma.rs). make run-net runs without an emulated IOMMU, so DMA uses the intended bounce-buffer fallback; no host physical address or IOVA is exposed beyond the kernel boundary.

  • Production cloud build cfg surgery (DMA ledger + DDF caps): kernel/src/device_dma.rs and the cap surfaces kernel/src/cap/dma_pool.rs (DmaPoolCap/DmaPoolCapInfo) and kernel/src/cap/dma_buffer.rs (DmaBufferCap/DmaBufferCapInfo) compile in the non-qemu build. The cloud-prod-dmapool-bounce-buffer-grant-proof wires the first production caller through kernel/src/cap/dmapool_bounce_buffer_grant_proof.rs: it stages a parked manager-attached DMAPool record over one DMA-capable PCI function from the inventory (stage_bounce_buffer_dmapool_record in kernel/src/device_manager/stub.rs), builds a DmaPoolCap over the parked handle, allocates one bounded bounce-buffer DMABuffer through device_manager::issue_manager_attached_dmabuffer_handle_with_request (which routes to device_dma::allocate_manager_attached_dmapool_bounce_buffer_page), asserts cap-info labels (userspace_dmapool=manager-issued-bounce-buffer, allocation=single-bounce-buffer-page, real_dma=not-attempted, direct_dma=blocked, host_physical_user_visible=0, iova_export=disabled-future-only), the dma_backend::select_and_report bounce-buffer verdict, quiesce-before- release (release_dmapool_record_for_cap_release returns pending-buffer-release while the buffer is live), scrub-before-reuse (the released bounce-buffer frame is zeroed in place before the frame returns to the allocator), and stale-handle-after-detach, then emits cloudboot-evidence: dma-pool-grant <token> for the cloudboot harness. The qemu-only surface that stays gated includes the cap::dmapool_grant_source bootstrap source (kernel/src/cap/dmapool_grant_source.rs), the KernelCapSource::DmaPool grant arms in kernel/src/cap/mod.rs and kernel/src/cap/process_spawner.rs, the DmaBufferCompleteDescriptorAdmission::provider_cq_event field that carries cap::interrupt_grant_source::ProviderCompletionCqEventIdentity, and the entire kernel/src/device_manager/qemu_full.rs DDF backend (including device_dma::{begin_virtio_net_pool, allocate_virtio_net_page, ...}). The proof maps no userspace VMA, programs no real DMA, attaches no queue, programs no interrupt, and emits no provider-nic-bound / storage-bound; descendants in docs/backlog/hardware-boot-storage.md#cloud-device-tracks cover those.

  • Fail-closed rules: requested ranges are validated against device-reported geometry and destination buffer length before any device access; descriptor reuse is generation-tracked; the bounded tracking-slot array (transport::VIRTQ_DESCRIPTOR_TRACKING_SLOTS, DescriptorTrackingSlot) caps in-flight descriptors. Stale/over-range requests fail closed.

  • QEMU-emulable vs hardware-only: fully QEMU-emulable. QEMU provides virtio-net-pci; make run-net is the end-to-end proof. No hardware-only path – this is the local-binding reference the cloud NIC drivers (ENA, MANA, GCP virtio-net) mirror for their QEMU-provable halves.

  • GCP cloud-shape classification: GCP 1st/2nd-gen x86 non-Confidential machine families (e.g. n1-*, e2-*) present the virtual NIC as exactly this standard virtio-net device (vendor 0x1af4) under a no-IOMMU / SWIOTLB bounce-buffer DMA backend, so the QEMU virtio-net-pci binding is the local precursor for the GCP NIC path. The enumeration path emits a virtio-net: cloud shape classification proof line (kernel/src/pci.rs report_cloud_virtio_net_shape) classifying the enumerated function against that documented GCP surface; both make run-net and make run-ddf-provider-consumer assert it conjunctively with the GCP-mapped bounce-buffer dma: backend selection line (kernel/src/dma_backend.rs select_and_report). The GCP→bounce-buffer mapping itself is the support-policy expectation recorded in docs/research/cloud-dma-provider-evidence.md. The proof carries explicit scope flags (local_qemu_precursor=true, real_gcp_enumeration=not-claimed, gvnic=separate-driver-out-of-scope); live GCP enumeration and cloud used-ring ownership remain cloud-gcp-virtio-net-nic-driver.

  • Production cloud-boot evidence marker (dma-backend): the production boot path (the kernel built without the qemu feature, which is what make capos-cloudboot-image packages) emits the parseable cloudboot-evidence: dma-backend <token> serial marker the tools/cloudboot/ harness reads (serial_marker_tokens; “Serial evidence-marker contract” in tools/cloudboot/README.md). It is emitted by kernel/src/dma_backend.rs select_and_report (always-compiled, so it fires on the production cloud image, not just the qemu smoke build) alongside the human-readable dma: backend selection line. The marker uses the harness token namespace (direct_dma / trusted_domain / bounce_buffer), mapped from the resolved DmaBackend by cloudboot_evidence_token – deliberately not the DmaBackend Display string (direct-remapping / bounce-buffer). The current two-variant resolved backend maps to direct_dma / bounce_buffer; on the probed GCE shapes (IOMMU disabled) the value is bounce_buffer. The trusted_domain slot has no current producer and is reserved. This marker is honest read-side evidence of the boot-time DMA-backend selection; it asserts no device bind and is independent of the bound-through-authority provider-nic-bound marker, which remains the cloud-gcp-virtio-net-nic-driver claim.

  • Production cloud-boot evidence marker (device-class): the production boot path also emits the companion cloudboot-evidence: device-class <token> serial marker (one per distinct enumerated PCI base class, harness-deduped via sort -u; “Serial evidence-marker contract” in tools/cloudboot/README.md).

    • Spec basis: PCI base-class codes from the PCI Code and ID Assignment Specification (PCI-SIG); the base class is the high byte of the class-code/revision dword at config-space offset 0x08 (kernel/src/pci.rs PCI_CLASS_REVISION).
    • Implemented wire-format subset: a genuinely read-only config-space scan over the production source resolved from the boot-time MCFG probe: ECAM when MCFG validates, otherwise legacy I/O. report_cloudboot_device_class_evidence (kernel/src/pci.rs) walks each bus/device/function via for_each_enumerated_function and the read-only functions_to_scan helper, reading only the vendor-id, header-type, and class-code (PCI_CLASS_REVISION) words. The base class is the high byte of PCI_CLASS_REVISION. It deliberately does not call read_device/read_bars, which would perform transient BAR-sizing config writes. Each distinct base class is emitted once in ascending order, formatted {:#04x} (e.g. 0x02). The marker is emitted from kernel/src/main.rs run_init, so it fires on every build configuration (not only the qemu/diagnostics PCI-diagnostics path), including the non-qemu production cloud image.
    • capOS mapping: enumeration evidence only – it allocates no DeviceMmio/Interrupt/DMAPool, claims no device ownership, performs no bus-master enable, BAR mapping, BAR-sizing write, or DMA, and never emits provider-nic-bound.
    • QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by a QEMU boot of target/disk.raw (the make capos-cloudboot-image production image; README “Local boot test”), which shows base classes 0x01 (storage), 0x02 (network), 0x03 (display), and 0x06 (bridge). No GCE resources are created and no make cloudboot-test run is required.
  • Production cloud-boot evidence marker (device-inventory): the production boot path also emits a per-function PCI claim-identity inventory so later bind children discover the real device identity instead of assuming the QEMU-fixed BDF layout the --features qemu path hard-codes. It emits a human-readable pci-inventory: detail line per enumerated function plus the parseable cloudboot-evidence: device-inventory <token> marker (one per function, harness-deduped via sort -u; “Serial evidence-marker contract” in tools/cloudboot/README.md).

    • Spec basis: the PCI Local Bus Specification (PCI-SIG) Type 0 configuration header — vendor/device ids at offsets 0x00/0x02, the class-code triple (base class / subclass / prog-if) in the high three bytes of the class-code/revision dword at offset 0x08, header type at offset 0x0e, and interrupt line / pin at offset 0x3c (§6.1 “Configuration Space Organization”). BAR registers are not part of this production marker.
    • Implemented wire-format subset: report_cloudboot_device_inventory_evidence (kernel/src/pci.rs) walks each bus/device/function via for_each_enumerated_function and the read-only functions_to_scan helper. For each present function, read_cloudboot_inventory_record reads only vendor/device, class/subclass/prog-if, revision, header type, interrupt pin, and interrupt line. report_cloudboot_inventory_record formats one identity token: <seg>.<bus>.<dev>.<fn>-<vendor>.<device>-<class>.<subclass>.<progif>-rev.<rev>-hdr.<hdr>-irq.<pin>.<line>. It is emitted from kernel/src/main.rs run_init right after the device-class markers, on every build configuration including the non-qemu production cloud image.
    • capOS mapping: read-only enumeration evidence. The production marker performs no BAR-size probe, config-space write, BAR mapping, bus-master / memory-space / IO-space command-bit enable, doorbell write, DMA, or device ownership claim, and never emits provider-nic-bound. The later cloud-NIC bind children consume this inventory to resolve the real PCI function identity instead of the QEMU-fixed BDF fixtures; BAR/MMIO authority is proven by separate DeviceMmio evidence paths.
    • QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by a QEMU boot of target/disk.raw (the make capos-cloudboot-image production image; README “Local boot test”), which shows the per-function pci-inventory: lines and device-inventory markers for the emulated functions (virtio-net 1af4, storage, display, bridge). No GCE resources are created and no make cloudboot-test run is required.
  • Production-build device_manager / DeviceMmio compile surface: kernel/src/device_manager/mod.rs is now always compiled, but it is a thin orchestrator that re-exports a shared subset (error.rs, handles.rs, mmio.rs, types.rsDeviceManagerError, DeviceMmioHandle / DeviceOwner / PciBdf / DeviceMmioRegion, the MMIO record / map / unmap / read32 / write32 admission types, DeviceMmioCapReleaseOutcome, ProviderNotifyDoorbellWrite) plus a feature-gated implementation: under cfg(feature = "qemu") it routes through qemu_full.rs (the full DDF surface — dma_buffer.rs / dma_pool.rs / interrupt.rs / proofs.rs, NVMe brokered controller registers, IOMMU domain ledgers, virtio TX/RX ring publication); under cfg(not(feature = "qemu")) it routes through stub.rs, which now carries a bounded one-slot parked-region path used by the production bar-readback proof (stage_bar_readback_region, validate_devicemmio_record, read_devicemmio_u32, detach_devicemmio_record_for_cap_release, trigger_*_for_devicemmio). The DMA/write/notify/map shims still report DeviceMmioStaleHandle because no production caller exists yet for those; the descendant slices in docs/backlog/hardware-boot-storage.md un-gate them through the reviewed grant path. kernel/src/cap/device_mmio.rs and its super::hardware_audit / super::hardware_release_log audit hooks are likewise always compiled. The KernelCapSource::DeviceMmio user-facing grant arm in kernel/src/cap/mod.rs stays cfg(feature = "qemu")-gated; the production bar-readback proof builds its DeviceMmioCap from boot (cap::devicemmio_bar_readback, see below) without going through that user-facing grant arm. The crate::iommu module and the real kernel/src/virtio.rs stay cfg(feature = "qemu")-gated. The crate::device_dma module compiles in both builds for the dmapool-grant proof, and the crate::device_interrupt module compiles in both builds for the interrupt route/source allocation proof below; their KernelCapSource::Interrupt user-facing grant arm and interrupt_grant_source bootstrap-grant module in kernel/src/cap/mod.rs stay cfg(feature = "qemu")-gated.

  • Production cloud-boot evidence marker (device-mmio-bar-read): the production boot path also exercises one PCI function’s first memory BAR through the reviewed DeviceMmioCap read32 surface and emits a parseable cloudboot-evidence: device-mmio-bar-read <token> marker.

    • Spec basis: PCI Local Bus Specification (PCI-SIG) Type 0 memory BAR semantics. The marker carries the function’s BDF, the BAR index, the 32-bit value read at offset 0, and the kernel-mapped window length. The kernel-side cache policy is device-uncacheable (UC) + NX + GLOBAL + WRITABLE, matching the existing mem::paging::map_kernel_mmio_range contract for MMIO windows.
    • Implemented wire-format subset: cap::devicemmio_bar_readback::report (kernel/src/cap/devicemmio_bar_readback.rs) enumerates PCI functions via pci::enumerate(), picks the first with a memory BAR of at least 4 KiB at a non-zero base, maps the first 4 KiB of that BAR through mem::paging::map_kernel_mmio_range, stages a parked region through device_manager::stage_bar_readback_region (one slot, mapping generation monotonic), constructs a DeviceMmioCap over the resulting DeviceMmioHandle, and calls cap.read32(0). The read goes through the same validate_devicemmio_record → range/alignment check → read_volatile path as the qemu DDF surface; on the production path the parked region’s recorded kernel virtual address backs the read. The marker token shape is <seg>.<bus>.<dev>.<fn>-b<bar>-<value>-len.<len> (value in 32-bit hex with 0x prefix, length in hex bytes), inside the harness grammar [A-Za-z0-9._-]+.
    • Fail-closed assertions: the proof immediately retries read32 at exactly length and asserts range_result != "ok" (out-of-range read is rejected with no MMIO side effect), then detaches the parked record through detach_devicemmio_record_for_cap_release and asserts the next read32(0) fails closed at the device manager (DeviceMmioStaleHandle). Both outcomes are logged on a devicemmio-bar-readback: range_bounding ... / stale_generation ... line so a regression trips the boot log alongside the missing marker.
    • capOS mapping: the mapping is boot-only kernel-half (no userspace VMA is exposed by this proof); revocation drops the parked slot, which invalidates the cap-side identity without removing the kernel mapping itself (the boot-only mapping stays installed for the rest of the boot). The descendant userspace-driver slices in docs/backlog/hardware-boot-storage.md#cloud-device-tracks add the userspace VMA path with TLB shootdown on revoke.
    • QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by a QEMU boot of target/disk.raw (the make capos-cloudboot-image production image; README “Local boot test”), which shows the marker for the emulated virtio function. No GCE resources are created and no make cloudboot-test run is required. The qemu build keeps the existing make run-devicemmio-grant smoke as the end-to-end DDF proof; the bar-readback caller in cap::devicemmio_bar_readback is gated to the production (non-qemu) build so it does not collide with the qemu DDF surface’s own DeviceMmio claim path.
  • Production cloud-boot evidence marker (interrupt-route-allocated): the production boot path also exercises one PCI function’s MSI-X capability through the reviewed device_interrupt vector pool and emits a parseable cloudboot-evidence: interrupt-route-allocated <token> marker.

    • Spec basis: PCI Local Bus Specification (PCI-SIG) 3.0 §6.8.2 / PCI Express Base Specification 4.0 §7.7.2.2 MSI-X capability structure. The capability header dword exposes Control (function-mask, table-size-1, enable), the Table BIR/Offset dword exposes the BAR index in the low 3 bits and the byte offset in the upper bits (each table entry is 16 bytes), and the PBA BIR/Offset dword exposes the Pending Bit Array location. The marker carries the function’s BDF, the selected MSI-X table entry, the kernel-pool MSI vector, and the route/source generation pair allocated for the entry. No live MSI-X table write or device interrupt is performed on this path.
    • Implemented wire-format subset: cap::interrupt_route_alloc::report (kernel/src/cap/interrupt_route_alloc.rs) enumerates PCI functions via pci::enumerate(), walks each function’s capability list through pci::capabilities, parses MSI-X capability fields through pci::interrupt_capabilities / parse_msix_capability (offset, control, table_size, table_bir, table_offset, pba_bir, pba_offset, both validated through the existing MSI-X region BAR checks), picks the first MSI-X capability with table_size >= 1, and allocates a kernel-owned MSI vector + interrupt source/route record over its first table entry (SELECTED_TABLE_ENTRY = 0) through the production device_interrupt::register_pci_msix_route_by_bdf vector pool (kernel/src/device_interrupt.rs, lapic::DEVICE_MSI_VECTOR_BASE = 0x50, 16 device-MSI vectors). It then device_interrupt::claim_routes the route under DeviceInterruptDriver::ManagerGrantSource. The marker token shape is <seg>.<bus>.<dev>.<fn>-entry.<n>-vector.<hex>-src.<id>.gen.<g>-route.gen.<g> (vector in 2-digit hex, source-id and generations decimal), inside the harness grammar [A-Za-z0-9._-]+.
    • Fail-closed assertions: the proof asserts three invariants inline before emitting the marker. (1) Claimed-state visibility: validate_claimed_route succeeds for the correct ManagerGrantSource owner and fails closed with WrongOwner for a distinct KernelIoApicProof owner – the route is owner-scoped. (2) Duplicate-source rejection: a second register_pci_msix_route_by_bdf against the same (bdf, table_entry) while the original route is live is rejected with DuplicateSource – the source identity is unique. (3) Stale-after-release: release_claimed_route clears the slot and a subsequent validate_claimed_route on the same handle fails closed with StaleRoute – no stale handle can re-enter the route table. Each outcome is logged on an interrupt-route-alloc: claimed_state ... / duplicate_source ... / stale_after_release ... line so a regression trips the boot log alongside the missing marker.
    • capOS mapping: route/source-allocation evidence only. The proof parses the MSI-X capability and consumes one slot from the kernel-owned device-MSI vector pool, then returns it on release; it does NOT map the MSI-X table or PBA BAR window, write a table entry, program a LAPIC dispatch slot for live delivery, raise/handle a device interrupt, install a waiter, acknowledge an EOI, or exercise mask/unmask/reset on the live vector. No provider-nic-bound or storage-bound marker. The follow-on live-delivery proof (interrupt-route-delivered) extends this surface; see the next section. The cap::interrupt_route_alloc caller is gated to the production (non-qemu) build so it does not collide with the qemu DDF surface’s own Interrupt claim path.
    • QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by a QEMU boot of target/disk.raw (the make capos-cloudboot-image production image; README “Local boot test”), which shows the marker for the emulated virtio function (QEMU’s modern virtio-pci front-end exposes a per-function MSI-X capability). No GCE resources are created and no make cloudboot-test run is required. The qemu build keeps the existing make run-interrupt-grant and make run-net smokes as the end-to-end DDF and virtio-net MSI-X proofs.
  • Production cloud-boot evidence marker (interrupt-route-delivered): the production boot path then extends the route-allocation proof to live MSI-X delivery: it programs the table entry, attaches the route to device manager, arms the deferred-LAPIC-EOI gate, injects one grant- source dispatch, retires the deferred EOI, masks and re-injects to prove no stale wake, reassigns to bump the route generation, asserts the stale handle + stale pending token both fail closed, then releases. Emits one cloudboot-evidence: interrupt-route-delivered <token> marker.

    • Spec basis: PCI Local Bus Specification 3.0 §6.8.2 / PCI Express Base Specification 4.0 §7.7.2.2 MSI-X table entry layout (16-byte entries: 64-bit message address, 32-bit message data, 32-bit vector control with bit 0 = entry mask) plus the per-spec mask-first write ordering (the entry mask must be asserted before message address/data are torn). Intel SDM Vol. 3A §10.8 LAPIC EOI semantics for the deferred-EOI write retired by acknowledge_deferred_lapic_eoi_for_route against the LAPIC EOI register (arch::x86_64::lapic::eoi).
    • Implemented wire-format subset: cap::interrupt_delivery_proof::report (kernel/src/cap/interrupt_delivery_proof.rs) reuses pci::map_msix_table to map the MSI-X table BAR window kernel-side (UC + NX + GLOBAL + WRITABLE through mem::paging::map_kernel_mmio_range) and pci::write_msix_table_entry to program entry 0 with the route’s message_address (from arch::x86_64::lapic::current_device_msi_delivery) and message_data (the allocated kernel-pool vector) under per-spec mask-first ordering. It then attaches the route through device_interrupt::attach_claimed_route_to_device_manager, enables the deferred-LAPIC-EOI gate via device_interrupt::enable_deferred_lapic_eoi_for_route, unmasks the route through device_interrupt::unmask_device_manager_attached_route and the table entry through pci::set_msix_table_entry_mask, drives one injected dispatch through device_interrupt::handle_lapic_delivery (the same dispatch slot the qemu make run-interrupt-grant proof and nvme-admin-interrupt-delivery exercise), retires the deferred EOI via device_interrupt::acknowledge_deferred_lapic_eoi_for_route, masks both surfaces and re-injects through device_interrupt::record_lapic_delivery, reassigns via device_interrupt::reassign_claimed_route to bump the route generation, and asserts stale-handle / stale-pending-token rejection through device_interrupt::validate_claimed_route / device_interrupt::check_pending_lapic_token.
    • Fail-closed assertions: five inline assertions gate the marker. (1) Live delivery: handle_lapic_delivery returns a Delivered { .. } outcome bound to the live route’s (source_id, source_generation, route_generation, owner), delivery_count advances by 1, eoi_deferred=true, and pending_deferred_eoi_count >= 1. (2) Ordered acknowledge: acknowledge_deferred_lapic_eoi_for_route reports eoi_written=true, ack_delta=1, and pending_after=0 – each pending unit retires exactly one LAPIC EOI through the counter- based exclusion device_interrupt.rs documents at acknowledge_deferred_lapic_eoi_for_route / close_deferred_eoi_gate_and_drain. (3) Masked no-wake: after mask, record_lapic_delivery returns Masked { state: ClaimedMasked, .. } and delivery_count does not advance. (4) Reassign generation bump + stale handle: the prior handle’s validate_claimed_route returns StaleRoute; the stale pending token’s check_pending_lapic_token reports wake_blocked=true with either Unregistered (the live evidence case: reassign’s first_available_vector runs before clear_dispatch_slot retires the old vector, so the next pool slot is chosen and the stale token names an unregistered vector) or SourceRouteGenerationMismatch (the single-slot-pool degenerate case where reassign reused the same vector); and a fresh injected dispatch under the reassigned route + vector lands on the new generation while leaving the stale token blocked. (5) Release: release_claimed_route clears the slot and validate_claimed_route on the reassigned handle now fails closed with StaleRoute. Each outcome is logged on interrupt-delivery: live_delivery ... / ordered_acknowledge ... / masked_no_wake ... / reassign_stale ... / release ... lines so a regression trips the boot log alongside the missing marker.
    • capOS mapping: route/source allocation + live delivery + ordered acknowledge + mask/unmask + reset/reassignment + stale-route-generation rejection, all on the production cloud kernel. The MSI-X table entry is programmed but the PCI function-level MSIX_CONTROL_ENABLE bit is intentionally NOT toggled (the proof never enables MSI-X on the function, so no real device-autonomous interrupt can fire on the programmed entry); the proof exits with the table entry re-masked. There is no userspace Interrupt waiter on the production cloud kernel yet, so the proof’s “waiter wake” boundary is the kernel-side dispatch slot a real provider waiter would consume — the marker reports waiter_wake=kernel-side-proxy rather than overclaiming a provider-cap-side wake. No provider-nic-bound or storage-bound marker. The cap::interrupt_delivery_proof caller is gated to the production (non-qemu) build so it does not collide with the qemu DDF surface’s own Interrupt claim path; the qemu build keeps make run-interrupt-grant as the broader end-to-end exercise of the interrupt grant surface with the full DDF backend.
    • QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by a QEMU boot of target/disk.raw (the make capos-cloudboot-image production image; README “Local boot test”), which shows the marker for the emulated virtio function. No GCE resources are created and no make cloudboot-test run is required. PBA handling is recorded by including the pba_bir and pba_offset from MsixCapabilityInfo in the proof’s ok line; the kernel does not read or clear PBA bits (devices set them, and this proof never enables the function so no PBA bit can be set in practice).
  • Production cloud-boot evidence marker (provider-nic-bound): the gate the billable GCE run consumes through tools/cloudboot/ NIC_PROOF_MARKER / --require-provider-nic-proof. It is sourced from real userspace driver progress: the marker fires only after the always-built polled virtio-net provider cap::virtio_net_polled_provider has completed a TX+RX over the live function and observed the RX completion by polling the latched used ring (zero kernel-injected interrupts). The predecessor staged its own DeviceMmio

    • DMAPool/DMABuffer + MSI-X Interrupt grant surfaces at boot and proved the “queue-completion handoff” by calling device_interrupt::handle_lapic_delivery — a kernel-side dispatch-slot proxy (the inject_real_lapic_int_for_proof precedent). That proxy is removed as the source of the gate: cap::provider_nic_bind_proof::report now runs once at boot and emits no marker (it records the deferral to the real provider’s completion); the marker is emitted later from cap::provider_nic_bind_proof::report_real_completion, called from the provider’s release-time completion path.
    • Spec basis: virtio 1.2 §2.7 split-ring used-ring semantics (the device writes a used element; the driver observes used.idx advance) — the completion the provider polls; virtio 1.2 §5.1.6 virtio-net receiveq frame layout (12-byte modern header + ethernet frame) for the EtherType read-back; inherited MSI-X table layout / mask-first ordering (PCI 3.0 §6.8.2) only for the release-time route assertion chain, which never delivers an interrupt on the completion path.
    • Implemented wire-format subset: cap::virtio_net_polled_provider (staged when the booted manifest declares the cloud-provider-nic-bound-real-polled-driver-smoke binary) drives the modern virtio status sequence to DRIVER_OK, materializes the RX virtqueue (queue 0) + TX stimulus virtqueue (queue 1), holds the PCI function-level MSI-X enable mask-first, maps the notify region, and programs the RX MSI-X route over table entry 0 (used only by the release-time assertion chain). Its attempt_rx_submit (admitted from the userspace DMABuffer.submitDescriptor(queue=0)) publishes the RX descriptor (VIRTQ_DESC_F_WRITE), drives the ARP TX stimulus, polls the latched used ring for the one real device->host RX DMA, and resets the device; its invoke_wait reads the latched PublishedRx with delivery_count unchanged. report_real_completion then sources the provider-nic-bound token from that PublishedRx (used.idx, used[0].id, used[0].len, observed EtherType) plus the picked function identity.
    • Fail-closed assertions: report_real_completion re-asserts the real RX completion facts independently of the provider’s own gate before the marker is emitted. (1) Real device->host RX DMA: the latched used ring advanced exactly once (used.idx == 1), the completion is the posted descriptor (used[0].id == 0), the device wrote a non-empty frame (used[0].len > 0), and the provider read back a non-zero EtherType. (2) Polled, not injected: the provider’s Interrupt.wait advanced no kernel dispatch (provider_observed_dispatch == 0) and retired no deferred LAPIC EOI (provider_observed_ack == 0). On any regression a provider-nic-bind: real-completion regression (no marker): ... line trips the boot log and no marker is emitted, so provider.json’s provider_nic_proof stays null and --require-provider-nic-proof fails closed.
    • capOS mapping: the marker is now backed by real userspace driver progress on the production cloud kernel. It carries the real-provider labels waiter_wake=polled-used-ring, rx_completion=polled-used-ring, int_injected=0, userspace_driver_authority=present-real-polled-provider, virtio_common_config_write=performed, provider_tx_rx=completed, device_autonomous_raise=not-claimed, host_physical_user_visible=0, direct_dma=blocked, iova_export=disabled-future-only, and live_cloud=not-attempted — never the predecessor’s waiter_wake=kernel-side-proxy / userspace_driver_authority=absent-on-non-qemu. The RX queue_msix_vector stays VIRTIO_MSI_NO_VECTOR and the PCI function mask stays held, so the device cannot autonomously raise the MSI either; the completion stays polled. The literal system.cue fold (so a plain default cloudboot also emits provider-nic-bound from real progress without a focused manifest) is not yet implemented, to avoid perturbing the make run interactive shell/login boot; device-autonomous MSI-X is the parallel future work cloud-prod-virtio-net-rx-device-autonomous-msix-raise-local-proof.
    • QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by make run-cloud-provider-nic-bound-real-polled-driver on the default non-qemu kernel with no cloud_*_proof feature, on the make run-net device shape. No GCE resources are created; live_cloud=not-attempted.
  • Production cloud-boot evidence marker (virtio-net-device-bringup): the production boot path, under the focused-proof Cargo feature cloud_virtio_net_device_bringup_proof, drives a bounded virtio-net device bringup sequence kernel-side over the same virtio function the provider-nic-bound proof maps – but writes the virtio common-configuration status register (which provider-nic-bound never does). It is the first device-activation step toward the still-blocked cloud-gcp-virtio-net-nic-driver track. Emits one cloudboot-evidence: virtio-net-device-bringup <token> marker on the tools/cloudboot/ harness’s serial-port-1 path through make run-cloud-provider-virtio-net-bringup.

    • Spec basis: virtio 1.2 §3.1.1 device initialization (reset, ACKNOWLEDGE, DRIVER, feature discovery + driver-feature select, FEATURES_OK re-read, DRIVER_OK), §4.1 (modern virtio over PCI: common / notify / ISR / device / PCI-cfg capabilities, common-config register layout from Table 4.1).
    • Implemented wire-format subset: cap::virtio_net_device_bringup_proof::report (kernel/src/cap/virtio_net_device_bringup_proof.rs) picks the virtio-net PCI function (vendor VIRTIO_VENDOR_ID = 0x1af4, device VIRTIO_NET_TRANSITIONAL_DEVICE_ID = 0x1000 / VIRTIO_NET_MODERN_DEVICE_ID = 0x1041) from pci::enumerate, walks the modern virtio PCI vendor-capability chain through virtio_transport::parse_modern_pci_transport_capabilities, maps the resolved common-configuration region through pci::map_bar_region (UC + NX + GLOBAL + WRITABLE – same flags as the BAR-readback path), and drives the bringup using the shared MmioRegion accessors plus virtio_transport::{read_device_features, write_driver_features, STATUS_ACKNOWLEDGE, STATUS_DRIVER, STATUS_FEATURES_OK, STATUS_DRIVER_OK, STATUS_FAILED, VIRTIO_F_VERSION_1, COMMON_NUM_QUEUES}. The selected driver feature word is exactly VIRTIO_F_VERSION_1; no other device- or net-specific bit is accepted, so the proof never crosses into the queue-setup or descriptor surface the userspace virtio-net provider will own.
    • Fail-closed assertions: four inline assertions gate the marker. (1) Negotiated feature set: the device’s offered 64-bit feature word advertises VIRTIO_F_VERSION_1, the written driver feature word equals exactly device_features & VIRTIO_F_VERSION_1. (2) Queue count visibility: the live COMMON_NUM_QUEUES read returns >= 2 (virtio-net always exposes RX + TX virtqueues, which this proof does not publish). (3) DRIVER_OK observation: the post-DRIVER_OK status read carries STATUS_ACKNOWLEDGE | STATUS_DRIVER | STATUS_FEATURES_OK | STATUS_DRIVER_OK set with STATUS_FAILED clear. (4) Final reset: a write of 0 to device_status reads back as 0. The proof wraps the status sequence so every exit path (success or any intermediate failure) writes 0 to device_status before returning, leaving the device in its post-reset state regardless of outcome. Per-stage outcomes log on virtio-net-device-bringup: ok ... / virtio-net-device-bringup: ... failed closed: ... lines so a regression trips the boot log alongside the missing marker.
    • capOS mapping: focused-proof child of provider-nic-bound that extends the proven bind composition with virtio’s status sequence, kernel-side, over the same mapped BAR. The PCI function-level MSIX_CONTROL_ENABLE bit stays untoggled, no queue is published, no descriptor is written, no doorbell is rung, and no userspace virtio-net provider cap is issued. The marker’s trailing labels (queue_setup=not-attempted, tx_descriptor=not-published, userspace_cap=not-issued, msix_function_enable=not-toggled, device_autonomous_raise=not-attempted, live_cloud=not-attempted) re-anchor those bounds. Queue setup, descriptor publication, doorbell writes, and a userspace virtio-net provider on the production cloud boot manifest stay deferred to the still-blocked cloud-gcp-virtio-net-nic-driver track. The cap::virtio_net_device_bringup_proof caller is gated to cfg(all(not(feature = "qemu"), not(feature = "cloud_provider_cap_waiter_proof"), feature = "cloud_virtio_net_device_bringup_proof")); the qemu build keeps make run-net / make run-ddf-provider-consumer as the end-to-end exercise of the same surface with the full driver.
    • QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by make run-cloud-provider-virtio-net-bringup, which boots the focused-proof cloudboot kernel + manifest under QEMU and asserts the marker on serial. No GCE resources are created.
  • Production cloud-boot evidence marker (nic-driver-userspace-features-ok): the userspace virtio handshake step of the Phase C NIC-driver relocation track. Under the focused-proof Cargo feature cloud_virtio_net_userspace_features_ok_proof, the cap::devicemmio_grant_source_prod source stages the picked virtio-net function’s modern virtio common-configuration window (resolved through virtio_transport::parse_modern_pci_transport_capabilities, mapped at the region’s first byte) as a writable selected-write DeviceMmio grant (stage_virtio_net_common_config). The userspace cloud-prod-nic-driver-userspace-features-ok-smoke service then drives the virtio device handshake from userspace – the authority delta from the kernel-side virtio-net-device-bringup proof, which drives the same sequence in the kernel.

    • Authority delta: the handshake registers move from kernel-internal MMIO to a userspace driver over the existing DeviceMmio.read32/write32 path. The write admission (device_manager::stub::write_devicemmio_u32 under the feature) admits exactly four common-config registers – device_feature_select (0x00), driver_feature_select (0x08), driver_feature (0x0C), and device_status (0x14, written as a single byte) – each range-checked against the decoded BAR and kernel read-back-asserted (feature-register read-backs must echo the written value; device_status is left to the driver’s own re-read since the device may legitimately diverge). This is the same selected-write + range-check + read-back discipline the notify doorbell (notifyDoorbell @5) and the NVMe CC reset write (cloud_nvme_controller_reset_proof) already enforce – not a new write primitive.
    • Fail-closed assertions: the shim drives reset -> ACKNOWLEDGE -> DRIVER -> read device features -> write the negotiated driver features (VIRTIO_F_VERSION_1 only) -> FEATURES_OK, re-reading device_status to confirm FEATURES_OK stuck, then proves a queue_desc (0x20) write fails closed (result=write-blocked register_write=blocked). The released cap fails closed on the next call. The headline cloudboot-evidence: nic-driver-userspace-features-ok <token> marker lands only after every assertion passes.
    • capOS mapping: the handshake step of the Phase C userspace NIC driver relocation. Queue/vring and IRQ ownership stay kernel-owned: queue-address registers fail closed, so no buffer address is ever programmed (the userspace-ownable vring over the DMA-isolation track is the next capability below). The marker’s trailing labels (handshake=features-ok, queue_setup=not-attempted, queue_address_write=blocked, vring=not-owned, irq=not-owned, driver_ok=not-attempted, live_cloud=not-attempted) re-anchor those bounds. The feature is mutually exclusive with qemu, cloud_provider_cap_waiter_proof, and cloud_virtio_net_device_bringup_proof.
    • QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by make run-cloud-prod-nic-driver-userspace-features-ok. No GCE resources are created.
  • Production cloud-boot evidence marker (nic-driver-userspace-ownable-vring): the userspace-owned vring step of the Phase C NIC-driver relocation track. Under the focused-proof Cargo feature cloud_virtio_net_userspace_ownable_vring_proof (which implies the handshake feature cloud_virtio_net_userspace_features_ok_proof), the cap::devicemmio_grant_source_prod source stages the writable common-config window and cap::dmapool_grant_source_prod stages a bounce-buffer DMAPool on the same virtio-net function. The userspace cloud-prod-nic-driver-userspace-ownable-vring-smoke service drives the handshake to FEATURES_OK, then allocates and owns its own virtqueue rings.

    • Authority delta: the queue-address-class registers move from kernel-internal MMIO (the virtio-net-tx-queue-materialization proof programs them in the kernel) to a userspace driver over the same DeviceMmio.write32 path. The write admission (device_manager::stub::write_devicemmio_u32 under the feature, admit_virtio_queue_address_write) admits queue_select (0x16) and queue_size (0x18) as range-checked pass-through selected writes, and the 64-bit queue_desc (0x20) / queue_driver (0x28) / queue_device (0x30) base registers via a token-resolve selected write: the driver writes the opaque per-buffer device-usable handle it learned from DMABuffer.info (deviceIova, scope bounce-handle), and the kernel resolves it against the live DMAPool grant ledger (resolve_virtio_vring_device_address) to the real bounce host-physical address, programs that address (never the handle, never an address the driver authored), and read-back-asserts. Reads of the queue-address base registers (0x20..0x38) are refused in read_devicemmio_u32, so the resolved host-physical address is never exposed to userspace (host_physical_user_visible stays 0). queue_enable stays fail-closed (it is armed by the queue-enable/DRIVER_OK capability below).
    • Reuses landed DMA isolation: the ring pages are manager-owned DMAPool bounce buffers under the landed scrub-before-free / owner+slot generation / quiesce-before-release discipline (kernel/src/device_dma.rs); the no-host-physical-exposure posture (host_physical_user_visible=0, iova_export=disabled-future-only) is unchanged. This capability is wiring, not a new isolation backend. The opaque device-usable handle is a deterministic, non-address encoding of the buffer’s manager-owned identity under a fixed tag, so it can never collide with a page-aligned host-physical address and carries no host-physical information.
    • Fail-closed assertions: the shim allocates its descriptor / available / used ring pages, programs each handle, then proves a queue-address read is refused and that an out-of-grant handle, a raw host-physical-looking value (0x40000000), and a stale (freed-buffer) handle each fail closed (result=write-blocked register_write=blocked) before any MMIO write. The released DeviceMmio cap fails closed on the next call. The headline cloudboot-evidence: nic-driver-userspace-ownable-vring <token> marker lands only after every assertion passes.
    • capOS mapping: the userspace-owned vring step of the Phase C userspace NIC driver relocation. The marker’s trailing labels (vring=userspace-owned, queue_address_programming=token-resolved, host_physical_user_visible=0, provider_visible_queue_address=hidden, iova_export=disabled-future-only, out_of_grant=blocked, host_physical=blocked, stale_generation=blocked, queue_enable=not-attempted, driver_ok=not-attempted, irq=not-owned, live_cloud=not-attempted) re-anchor those bounds.
    • QEMU-emulable vs hardware-only: fully QEMU-emulable (the bounce backend is the probe-selected default without a guest IOMMU). Proved locally by make run-cloud-prod-nic-driver-userspace-ownable-vring. No GCE resources are created.
  • Production cloud-boot evidence marker (nic-driver-userspace-queue-enable-driver-ok): the userspace queue-enable / DRIVER_OK step of the Phase C NIC-driver relocation track. Under the focused-proof Cargo feature cloud_virtio_net_userspace_queue_enable_driver_ok_proof (which implies the ownable-vring feature cloud_virtio_net_userspace_ownable_vring_proof), the userspace cloud-prod-nic-driver-userspace-queue-enable-driver-ok-smoke service drives the handshake to FEATURES_OK and programs its owned vring exactly as the ownable-vring capability does, then completes device bring-up from userspace: it arms its programmed TX queue and writes DRIVER_OK.

    • Authority delta: two more writes join the handshake/ownable-vring selected-write admission, both under the same range-check + read-back discipline. (1) queue_enable (0x1c, u16): a range-checked pass-through selected write, admitted by device_manager::stub::write_devicemmio_u32 (admit_virtio_queue_address_write) only when the active queue’s vring memory is live and page-fitting (selected_queue_ready_to_enable): the kernel reads the active queue_desc/queue_driver/queue_device back kernel-side and requires each to currently hold the host-physical address of a live granted DMABuffer on this device (a freed buffer’s stale address no longer matches a live buffer, so it cannot arm a use-after-free DMA target), and requires the active queue_size to fit every split-ring structure (16*size desc table, 6+8*size used ring, 6+2*size avail ring) inside one granted bounce page. An enable of an unprogrammed, freed, or oversized queue fails closed before any MMIO side effect; the enable is read-back-asserted. Once a queue is enabled its vring base registers are immutable – a queue-address repoint (even with an otherwise-valid live token) is refused (devicemmio-queue-address-immutable-after-enable) so the driver cannot mutate the vring under a running device. (2) DRIVER_OK (a bit in device-status 0x14): the device-status register is already writable (from the handshake capability), but setting DRIVER_OK is kernel-asserted – the kernel re-reads device-status and fails closed (devicemmio-driver-ok-not-observed) unless the device latched the full ACKNOWLEDGE | DRIVER | FEATURES_OK | DRIVER_OK byte exactly (rejecting FAILED 0x80, DEVICE_NEEDS_RESET 0x40, and any reserved bit), so a userspace driver cannot claim a brought-up device the hardware did not accept.
    • Reuses landed DMA isolation: this capability adds no new register write primitive, no new isolation backend, and no host-physical exposure. It reuses the ownable-vring bounce / DMAPool / DeviceMmio grants and writable window unchanged; queue-address reads (0x20..0x38) stay refused (host_physical_user_visible=0). The enable binds to live, page-fitting, post-enable-immutable queue memory, so the device is never armed at a freed, oversized, or mutated vring.
    • Bounded residual (handled by the RX bring-up capability below): the enable’s live + page-fit check is point-in-time and matches by host-physical address, not buffer identity; it does not pin the ring buffers against freeBuffer / process-teardown release while the queue is enabled. Both are use-after-free-DMA hazards only once a descriptor is posted and the doorbell rung – which this capability never does (frame_tx=not-attempted; the RX queue is never enabled; the TX queue is kick-driven), so no device DMA is reachable here and DMA stays confined to the granted bounce pool. Buffer-identity binding and pinning are the data path’s responsibility (vring_buffer_pinning=deferred-slice-4); tracked by the userspace RX/DMA task records.
    • Fail-closed assertions: the shim proves the ownable-vring out-of-grant / host-physical / stale (freed-throwaway-buffer) queue-address writes fail closed and an enable of the unprogrammed RX queue (index 0) fails closed, then arms the programmed TX queue (queue_enable=1, register_write=performed), sets DRIVER_OK and re-reads device-status to confirm the full brought-up byte, and proves a post-enable queue-address repoint (with an otherwise-valid live token) fails closed. The released DeviceMmio cap fails closed on the next call. The headline cloudboot-evidence: nic-driver-userspace-queue-enable-driver-ok <token> marker lands only after every assertion passes.
    • capOS mapping: the userspace queue-enable / DRIVER_OK step of the Phase C userspace NIC driver relocation. The marker’s trailing labels (vring=userspace-owned, queue_enable=performed, unprogrammed_queue_enable=blocked, device_brought_up=driver-ok, status_full=0f, driver_ok=observed, vring_live_bound=enforced, queue_size_fits_grant=enforced, post_enable_immutable=blocked, host_physical_user_visible=0, provider_visible_queue_address=hidden, frame_tx=not-attempted, nic_cap=not-implemented, irq=not-owned, live_cloud=not-attempted) re-anchor those bounds. The Nic-cap TX/RX round-trip (no frame crosses the wire here) and userspace IRQ ownership are later capabilities below.
    • QEMU-emulable vs hardware-only: fully QEMU-emulable (the bounce backend is the probe-selected default without a guest IOMMU). Proved locally by make run-cloud-prod-nic-driver-userspace-queue-enable-driver-ok. No GCE resources are created.
  • Production cloud-boot evidence marker (nic-driver-userspace-rx-bringup): the userspace RX-queue bring-up step of the Phase C NIC-driver relocation track. Under cloud_virtio_net_userspace_rx_bringup_proof (implies the queue-enable feature) the cloud-prod-nic-driver-userspace-rx-bringup-smoke service brings up the RX virtqueue (index 0) over its own vring – the handshake/vring/enable capabilities above brought up only the TX queue (index 1); the queue_enable admission is queue-agnostic, so RX bring-up reuses it.

    • capOS mapping: the kernel (device_manager::stub) retains each programmed queue’s vring physes + originating DMABuffer handle identity on ProductionDeviceRecord (admit_virtio_queue_address_write), binds queue_enable to that identity (selected_queue_identity_bound; a freed/realloc’d handle fails closed with devicemmio-queue-enable-identity-mismatch), and pins the ring buffers against freeBuffer / process-teardown release while the queue is enabled (blocked_pinned_enabled_vringdmabuffer-pinned-enabled-vring), released only on queue disable/reset with device quiesce. This closes the queue-enable capability’s pre-migration buffer-lifetime/identity residual at the bring-up boundary. Marker labels: rx_queue_brought_up=driver-ok, buffer_identity_bound=enforced, vring_buffer_pinning=enforced, pinning_free_while_enabled=blocked, int_injected=0, nic_cap=not-implemented, irq=not-owned, live_cloud=not-attempted.
    • First real RX DMA: the same feature also drives the first real RX DMA from the shim-owned vring. The shim also brings up TX queue 1 over its own vring, posts one device-writable RX receive buffer on queue 0 (DMABuffer.submitDescriptor), and rings the production DeviceMmio.notifyDoorbell @5. capOS mapping: the kernel maps the notify region kernel-side and captures the per-queue notify slot offsets (cap::devicemmio_grant_source_prod), and provider_notify_doorbell_write_for_cap (was Err(stale_handle)) is now a live drive; the RX-DMA flow (cap::virtio_net_userspace_rx_dma_proof, byte-level vring helpers duplicated from cap::virtio_net_polled_provider to protect run-net) writes the RX descriptor + avail over the shim’s retained RX vring physes, rings the RX doorbell, submits a kernel-half ARP request over the shim’s retained TX vring physes, polls one real device->host completion (used_len > 0, observed EtherType 0x0806), resets the device (quiescing both queues and releasing the ring-buffer pins via mark_retained_vring_queue_disabled), and latches the used-ring index. Completion stays kernel-latched used-ring polled (int_injected=0, no Interrupt cap). Marker labels add tx_queue_brought_up=driver-ok, frame_rx=performed, rx_used_ring=kernel-latched. The kernel emits one virtio-net-userspace-rx-dma: rx_dma=performed ... used_len=<n> ethertype=0x0806 device_reset=ok queues_cleared=ok int_injected=0 evidence line.
    • Not yet implemented: the deterministic freed-then-reallocated-frame identity negative (identity_realloc_negative=deferred-needs-allocator-reuse-seam). The capos-lib FrameBitmap is next-fit and free_frame does not rewind next_hint, so the allocation after a free never returns the just-freed frame; a deterministic same-phys realloc (needed to reach the buffer-identity gate rather than the host-physical gate) requires an allocator reuse seam. Tracked by cloud-prod-nic-driver-userspace-rx-dma-identity-realloc-negative-local-proof.
    • Future work: the Nic-cap round-trip (the next capability below, unblocked by this data path) and userspace IRQ ownership.
    • QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by make run-cloud-prod-nic-driver-userspace-rx-bringup. No GCE resources are created.
  • Production cloud-boot evidence marker (nic-driver-userspace-nic-cap-roundtrip): the Nic-cap round-trip step of the Phase C NIC-driver relocation track. It implements the handshake-step Nic interface stub as a live CapObject. Under the cloud_virtio_net_userspace_nic_cap_roundtrip_proof feature (implies the RX-bring-up feature) the cloud-prod-nic-driver-userspace-nic-cap-roundtrip-smoke service brings the device fully up from userspace (RX queue 0 + TX queue 1 enabled, DRIVER_OK), then holds a typed Nic cap and round-trips two sequential frames. capOS mapping:

    • The new nic KernelCapSource (registered in capos-config manifest.rs
      • lib.rs::NIC_INTERFACE_ID + the nic @49 schema/capos.capnp KernelCapSource enum value; client NicClient in capos-rt) is granted from cap::nic_grant_source_prod, which maps the picked virtio-net function’s device-config window kernel-side for macAddress/linkStatus and binds the Nic cap to that BDF.
    • transmit/receive drive the shim’s retained vring physes through cap::virtio_net_userspace_rx_dma_proof::{nic_transmit, nic_receive} (reusing the RX-bring-up byte-level vring helpers) with manager-owned kernel bounce payloads – not a shim-submitted DMABuffer – so a frame crosses the cap boundary as inline Data with host_physical_user_visible = 0 and no device-usable handle exported. receive drives the coupled ARP-TX-stimulus + RX-poll and returns the frame inline + observed EtherType; transmit stages a frame into a manager-owned TX page and rings notify_doorbell @5.
    • The device is left live for the cap’s lifetime (a monotonic per-queue avail cursor lets transmit and receive compose without re-enabling) and quiesced once on cap release (nic_quiesce: device reset + queues-cleared assertion + mark_retained_vring_queue_disabled to release the enabled-vring pins). Completion stays kernel-latched used-ring polled (int_injected = 0, no Interrupt cap); no new selected-write register beyond the landed handshake / ownable-vring / queue-enable set. The kernel emits two virtio-net-userspace-nic-cap: receive ... used_len=<n> ethertype=0x0806 int_injected=0 host_physical_user_visible=0 evidence lines and a virtio-net-userspace-nic-cap: quiesce ... device_reset=ok queues_cleared=ok line on release. The proof also covers lifecycle ordering: a DMAPool cap release while ring buffers are still live records pending-buffer-release, an early release of one pinned ring DMABuffer records dmabuffer-pinned-enabled-vring, Nic quiesce replays that buffer detach after the queues are reset, and the pending parent pool release completes only after the remaining ring buffers are freed.
    • Future work: the clean independent TX/RX split and userspace IRQ ownership (both later capabilities below).
    • QEMU-emulable vs hardware-only: fully QEMU-emulable (the RX reply is QEMU SLIRP’s ARP answer to the kernel-half stimulus). Proved locally by make run-cloud-prod-nic-driver-userspace-nic-cap-roundtrip. No GCE resources are created.
  • Production cloud-boot evidence marker (nic-driver-userspace-irq-ownership): the userspace RX-interrupt-lifecycle ownership step of the Phase C NIC-driver relocation track. It gives the userspace NIC driver real RX-interrupt-lifecycle ownership. The Nic-cap round-trip capability above has int_injected = 0 and no Interrupt cap on the data path; this capability adds a real Interrupt cap whose wait/acknowledge/mask/unmask drive the route’s MSI-X vector-control + deferred LAPIC EOI (the frame bytes still arrive via Nic.receive’s used-ring read). Under the cloud_virtio_net_userspace_irq_ownership_proof feature (implies the nic-cap-roundtrip feature) the cloud-prod-nic-driver-userspace-irq-ownership-smoke service holds a DeviceMmio + DMAPool + Nic + Interrupt cap on the same virtio-net function. capOS mapping:

    • A new Interrupt grant source (cap::virtio_net_userspace_irq_ownership_proof) replaces the admission-only interrupt_grant_source_prod source via the KernelCapSource::Interrupt arm under this feature. At boot it programs the staged virtio-net function’s RX MSI-X route (table entry 0) mask-first through the landed always-built cap::interrupt_programmed::program_attach_arm_unmask (route register / claim / MSI-X table map+write / device-manager attach / deferred-LAPIC-EOI arm / unmask) and tears it down (teardown) on cap release.
    • The Interrupt cap’s methods are real for this device RX route: wait blocks on a real interrupt dispatch through the route’s MSI-X / LAPIC dispatch slot (device_interrupt::wait_kernel_injected_dispatch; delivery_count advances, so int_injected flips from 0 – the Nic-cap round-trip capability had no Interrupt cap on the data path at all). The wake is a bounded kernel-injected dispatch (not yet a device-autonomous raise causally tied to a frame), and Nic.receive still reads the frame bytes from the used ring, so the delta is IRQ-lifecycle ownership (real wait/acknowledge/mask/unmask), not interrupt-coalesced RX completion. acknowledge retires exactly one deferred LAPIC EOI through device_interrupt::acknowledge_deferred_lapic_eoi_for_route (hardwareDispatchAckDelta = 1, the one-ack-per-delivery / hardware_eoi_delta invariant); mask/unmask toggle the route’s own MSI-X vector-control bit (mask-first per PCI 3.0 §6.8.2: table-mask then route-state on mask; route-state then table-unmask on unmask) through pci::set_msix_table_entry_mask + device_interrupt::{mask,unmask}_device_manager_attached_route (driver-unmasked <-> claimed-masked).
    • The driver brings the device up from userspace (the nic-cap-roundtrip bring-up verbatim), drives the owned RX-interrupt lifecycle (info/wait/acknowledge/mask/ unmask/release), and reads the completed frame back through Nic.receive (inline Data, host_physical_user_visible = 0). The PCI function-level MSI-X enable bit is not toggled and no device-autonomous raise is attempted (device_autonomous_raise=not-attempted, waiter_wake=kernel-injected-dispatch); the landed DMA isolation, the owned-vring grants, and buffer-identity / ring-buffer pinning are reused unchanged (queue-address reads still refused). No new Interrupt interface or method.
    • Future work: the clean independent TX/RX split (the next capability below); the device-autonomous MSI-X raise (program the device RX queue_msix_vector + clear the PCI function mask) and the smoltcp network-stack relocation.
    • QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by make run-cloud-prod-nic-driver-userspace-irq-ownership (one cloudboot-evidence: nic-driver-userspace-irq-ownership <token> marker). No GCE resources are created.
  • Production cloud-boot evidence marker (nic-driver-userspace-clean-tx-rx-split): the independent-TX/RX step of the Phase C NIC-driver relocation track. It decouples the last data-path coupling – the userspace NIC driver’s Nic.transmit and Nic.receive become truly independent. In the nic-cap-roundtrip / IRQ-ownership capabilities, Nic.receive (virtio_net_userspace_rx_dma_proof::nic_receive) self-stimulated by submitting a kernel-half ARP TX over the retained TX vring inside the same call. Under the cloud_virtio_net_userspace_clean_tx_rx_split_proof feature (implies the irq-ownership feature) the Nic cap’s receive @1 dispatches instead to nic_receive_independent. capOS mapping:

    • nic_receive_independent posts a manager-owned device-writable RX buffer on the retained RX vring, rings the RX doorbell, waits on the driver’s OWNED RX interrupt route (the IRQ-ownership device_interrupt::wait_kernel_injected_dispatch dispatch slot, resolved through virtio_net_userspace_irq_ownership_proof::owned_rx_route; int_injected flips from 0), retires the deferred LAPIC EOI (acknowledge_deferred_lapic_eoi_for_route), then polls the RX used ring and reads the completed frame – with no internal ARP-TX self-stimulus (it never submits to the TX vring; the kernel diagnostic reports tx_submissions=0 self_stimulus=removed).
    • The RX frame is driven by an external stimulus: the consumer’s preceding independent Nic.transmit of a real broadcast ARP request (who-has the QEMU SLIRP gateway 10.0.2.2). SLIRP answers; the inbound reply is held in the host net queue until receive posts the RX buffer + kicks the RX queue.
    • Nic.transmit stays independent: it submits the caller’s frame to the TX vring and rings the TX doorbell with no RX involvement (the kernel diagnostic reports rx_polls=0 rx_submissions=0). Neither call performs the other’s submission.
    • The wake stays the bounded kernel-injected dispatch the IRQ-ownership capability owns (waiter_wake=kernel-injected-dispatch, device_autonomous_raise=not-attempted). The landed owned-vring / owned-IRQ / DMA-isolation, the writable common-config window, and the buffer-identity / ring-buffer pinning are reused unchanged: no new selected-write register, no new MSI-X surface, no new Nic/Interrupt method, no host-physical / handle exposure (host_physical_user_visible = 0, queue-address reads refused).
    • Follow-up work: DHCP/IPv4 configuration, legacy kernel socket-path retirement, kernel smoltcp / virtio-net hot-path removal, and the device-autonomous MSI-X raise. The 7c-ii(b) serve-from-userspace local proof is now landed.
    • QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by make run-cloud-prod-nic-driver-userspace-clean-tx-rx-split (one cloudboot-evidence: nic-driver-userspace-clean-tx-rx-split <token> marker). No GCE resources are created.
  • Production cloud-boot evidence marker (nic-driver-userspace-sustained-receive-pool): Phase C slice 7d (DONE 2026-06-04) adds the sustained-receive Nic ABI the multi-frame TCP path (7c-iii) needs. The landed receive @1 is single-frame + reset-on-empty-poll; this adds a non-resetting poll over a kernel-owned bounce RX pool. Under the cloud_virtio_net_userspace_sustained_receive_pool_proof feature (implies the clean-split feature) the Nic cap serves receivePoll @4 (cap::nic_grant_source_prod -> virtio_net_userspace_rx_dma_proof::nic_receive_poll). capOS mapping:

    • Arm. On first receivePoll the kernel allocates NIC_RX_POOL_SIZE manager-owned bounce RX frames (frame::alloc_frame_zeroed), posts one device-writable descriptor + avail entry per frame on the retained RX vring, publishes avail.idx, and rings the RX doorbell. The device masters only into these kernel-private pages; no host-physical or device-usable address is exported (host_physical_user_visible = 0).
    • Drain one per poll. Each receivePoll re-kicks the RX doorbell (so QEMU flushes a queued inbound frame into an armed buffer during the MMIO VM exit) and reads the RX used ring. If it advanced, the kernel copies the frame out into the inline Data reply (bounded by the posted buffer length) and recycles that bounce slot.
    • The per-buffer invariant replacing reset-before-reclaim. A bounce slot is re-exposed to the device only after its copy-out completes and its slot generation is bumped, with the slot scrubbed before the re-post – the production handle-epoch slot identity (docs/dma-isolation-design.md) applied at recycle granularity instead of device-reset granularity. The device is not reset per frame (device_reset=none); teardown (on_release via nic_quiesce, or an unprovable in-flight-DMA error) still quiesces (reset + queues cleared) and scrubs + frees the pool.
    • No frame yet. If the used ring did not advance, receivePoll returns framePresent = false with no reset and the device stays armed (device_armed=true) – the cheap speculative poll a smoltcp phy::Device RX token needs. receive @1 semantics are unchanged.
    • QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by make run-cloud-prod-nic-driver-userspace-sustained-receive-pool (one cloudboot-evidence: nic-driver-userspace-sustained-receive-pool <token> marker after draining more than one frame with at least one non-resetting empty poll). No GCE resources are created.
    • Follow-up work: DHCP/IPv4 configuration consumes the served socket path; later cleanup removes or fixture-gates the legacy kernel socket path and kernel smoltcp / virtio-net hot path. The 7c-ii(b) production manifest proof now consumes the userspace-served TcpListenAuthority, TcpListener, and TcpSocket substrate locally.
  • Production cloud-boot evidence marker (network-stack-process-smoltcp-skeleton): Phase C slice 7a (first increment, DONE 2026-06-03) is the first time a real TCP/IP stack runs outside the kernel over the relocated NIC authority. A userspace network-stack process builds an smoltcp Interface (Ethernet medium, MAC from Nic.macAddress, static IPv4 10.0.2.15/24) over a phy::Device adapter whose RX/TX is the slice-6 independent Nic.receive/Nic.transmit, clocked by the Timer cap monotonic source (monotonic_ns). capOS mapping:

    • The phy::Device adapter is buffered: outbound frames smoltcp produces queue in a process-local Vec that the poll loop drains and submits via Nic.transmit; one inbound frame fetched via Nic.receive is handed back for smoltcp to consume. The adapter holds no vring, DMA handle, or host-physical address – every frame is a process-local byte buffer crossing the cap boundary as inline Data through the manager-owned bounce page (host_physical_user_visible = 0).
    • The proof is that smoltcp – not hand-rolled frame code – drives the exchange: a UDP datagram queued to the on-link gateway makes smoltcp emit an ARP request (out through Nic.transmit), the SLIRP ARP reply is consumed (in through Nic.receive, EtherType 0x0806), and – with the neighbour now resolved – smoltcp emits the queued IPv4/UDP datagram, so the neighbour cache observably advances (smoltcp_tx_arp>=1, smoltcp_rx_consumed>=1, smoltcp_tx_ipv4>=1). The internal smoltcp UDP socket is only an egress stimulus; no socket capability is exposed.
    • Implementation note: the landed Nic cap rides on the userspace driver shim’s retained vring (the kernel does not own the vring), so the skeleton process performs the slice-1-6 bring-up itself before running smoltcp. Relocating the bring-up to a separate long-lived NIC-driver service is folded into the slice-7c contract relocation.
    • Out of scope: the socket caps (slice 7b), the cap/network.rs contract relocation (slice 7c, virtio_stub.rs stays fail-closed), and the kernel smoltcp / virtio-net removal (slice 8).
    • QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by make run-cloud-prod-network-stack-process-smoltcp-skeleton (one cloudboot-evidence: network-stack-process-smoltcp-skeleton <token> marker). No GCE resources are created.
  • Production cloud-boot evidence marker (network-stack-smoltcp-socket-caps): Phase C slice 7b (DONE 2026-06-03) adds a userspace UdpSocket cap layer on top of the slice-7a substrate: the userspace network-stack process now implements the UdpSocket schema’s sendTo/recvFrom semantics (UdpSocketCapLayer) over the same smoltcp Interface and proves one bounded UDP request/response through it. capOS mapping:

    • The socket layer drives the slice-7a phy::Device/Nic pump: sendTo resolves the destination’s on-link ARP (one Nic.receive for the guaranteed ARP reply, EtherType 0x0806) and transmits the datagram through Nic.transmit; recvFrom fetches the single solicited reply datagram through Nic.receive (EtherType 0x0800) and returns it. Frames stay process-local byte buffers (host_physical_user_visible = 0); a queue-address read stays refused.
    • The request/response is a DNS A query for example.com to SLIRP’s built-in resolver at 10.0.2.3:53 (the same resolver the C posix-dns-resolver smoke uses); the decoded response is returned through recvFrom and the proof asserts source 10.0.2.3:53, the transaction-id/QR/RCODE correlation, and a decoded A record. The landed Nic.receive resets the device on an empty poll, so the proof only receives when a reply is guaranteed pending and spaces a Timer pre-delay before the datagram receive.
    • Honest boundary: the socket layer is in-process – it implements the socket interface semantics over the userspace stack but does not yet serve them as inter-process transferable capabilities, and it does not touch the production kernel/src/cap/network.rs contract (virtio_stub.rs stays fail-closed). Preserving that contract behind a userspace network-stack service is slice 7c.
    • QEMU-emulable vs hardware-only: fully QEMU-emulable (relies on SLIRP’s DNS forwarder). Proved locally by make run-cloud-prod-network-stack-smoltcp-socket-caps (one cloudboot-evidence: network-stack-smoltcp-socket-caps <token> marker). No GCE resources are created.
  • Production cloud-boot evidence marker (userspace-network-stack-smoltcp): Phase C slice 7c, first increment (DONE 2026-06-03) serves the slice-7b UdpSocketCapLayer as a real inter-process transferable capability. capOS mapping:

    • A network-stack server process holds the bring-up caps plus an exported Endpoint; after bring-up it serves the UdpSocket schema (sendTo/recvFrom/close) over that endpoint, driving the same UdpSocketCapLayer on its own ring (decoding/encoding the capnp params and results). A separate client process holds only Console and the served cap; it re-interprets the imported Endpoint as a UdpSocket and drives one bounded DNS A query/response through the production UdpSocketClient.
    • smoltcp still moves every frame through the Nic cap (ARP reply EtherType 0x0806 + DNS reply 0x0800 through Nic.receive); host_physical_user_visible = 0 is preserved and a queue-address read stays refused. On close the server releases its owned RX Interrupt (route_torn_down=ok).
    • Honest boundary: the UdpSocket contract lives behind a userspace network-stack service. Later Phase C increments added the TcpListener / TcpSocket substrate, inter-process serving, and the local 7c-ii(b) serve-from-userspace manifest proof for TcpListenAuthority. DHCP/IPv4, Web UI L4, private GCE reachability, public ingress, and legacy kernel-socket cleanup remain separate work.
    • QEMU-emulable vs hardware-only: fully QEMU-emulable (relies on SLIRP’s DNS forwarder). Proved locally by make run-cloud-prod-network-stack-smoltcp-udp-socket-cap-ipc (one cloudboot-evidence: userspace-network-stack-smoltcp <token> marker). No GCE resources are created.
  • Production cloud-boot evidence marker (virtio-net-tx-authority-bundle): under the focused-proof Cargo feature cloud_virtio_net_tx_authority_bundle_proof, the cloudboot kernel layers a bundle observer (cap::virtio_net_tx_authority_bundle_proof) on top of the three existing production grant sources (devicemmio_grant_source_prod, dmapool_grant_source_prod, interrupt_grant_source_prod). Under the feature, the DeviceMmio source filters its PCI candidate to the same virtio/NVMe-class function the DMAPool and Interrupt sources already match, so all three grants land on the same virtio-net function. Exposed through make run-cloud-provider-virtio-net-tx-authority-bundle.

    • Implemented wire-format subset: no new MMIO/DMA/IRQ writes. The bundle reuses the existing prod sources’ grant + per-cap on-release surfaces and asserts the bundle identity over their issue and release notifications via the record_devicemmio_grant/record_dmapool_grant/ record_interrupt_grant/record_devicemmio_release/ record_dmapool_release/record_interrupt_release hooks called from the existing build_cap_for_grant / on_release / release_cap paths.
    • Fail-closed assertions: the userspace cloud-provider-virtio-net-tx-authority-bundle-smoke service calls info on each of the three caps and asserts they all report the same BDF. The kernel-side bundle observer records each grant’s (bdf, generation) identity at issue and at release; the headline cloudboot-evidence: virtio-net-tx-authority-bundle <token> marker is emitted only after all three caps have been issued and released and same_dm/same_dp/same_ir/same_bdf all evaluate true. A BDF mismatch logs virtio-net-tx-authority-bundle: assertion regression: ... and leaves the marker unprinted. Per-cap stale-handle fail-closed is inherited from the existing prod sources’ validate_*_record paths; the smoke re-tests it explicitly after each release.
    • capOS mapping: bundle authority composition over the DeviceMmio + DMAPool + Interrupt grant arms; first child of the blocked cloud-prod-virtio-net-userspace-provider-tx-local-proof parent. No virtio queue setup, no descriptor publication, no notify doorbell, no PCI function-level MSI-X enable, no Interrupt.wait, no TX completion claim, no live cloud traffic. The marker’s trailing labels (same_bdf=true, queue_setup=not-attempted, tx_descriptor=not-published, notify=not-rung, msix_function_enable=not-toggled, tx_completion=not-claimed, live_cloud=not-attempted) re-anchor those bounds. The bundle feature is mutually exclusive with qemu, cloud_provider_cap_waiter_proof, and cloud_virtio_net_device_bringup_proof at the cap::mod.rs activation site.
    • QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by make run-cloud-provider-virtio-net-tx-authority-bundle. No GCE resources are created.
  • Production cloud-boot evidence marker (virtio-net-tx-queue-materialization): under the focused-proof Cargo feature cloud_virtio_net_tx_queue_materialization_proof, the cloudboot kernel runs cap::virtio_net_tx_queue_materialization_proof (kernel/src/cap/virtio_net_tx_queue_materialization_proof.rs) over the same virtio-net function the authority bundle picks. The proof materializes one manager-owned TX virtqueue: it allocates three zeroed physical frames from the kernel frame allocator, programs the TX queue’s common-configuration QUEUE_DESC / QUEUE_DRIVER / QUEUE_DEVICE + QUEUE_ENABLE = 1, asserts the device read-backs match the manager-authored host-physical addresses, then writes 0 to device_status and asserts every TX queue-state register has cleared to 0. Exposed through make run-cloud-provider-virtio-net-tx-queue-materialization.

    • Spec basis: virtio 1.2 §2.1.2 (reset clears all virtqueue state), §2.7 (split-ring queue layout), §4.1.4.3 (common configuration queue registers), §5.1.2 (virtio-net advertises receiveq1=0, transmitq1=1).
    • Implemented wire-format subset: the proof drives the modern virtio status sequence through reset / ACK / DRIVER / feature select (VIRTIO_F_VERSION_1 only) / FEATURES_OK, asserts COMMON_NUM_QUEUES >= 2, writes COMMON_QUEUE_SELECT = 1 (TX), reads COMMON_QUEUE_SIZE, clamps to a power-of-two bound (MAX_QUEUE_SIZE = 256, so each region fits in one 4 KiB frame), allocates desc/avail/used frames through mem::frame::alloc_frame_zeroed, programs COMMON_QUEUE_DESC / COMMON_QUEUE_DRIVER / COMMON_QUEUE_DEVICE with the resulting host-physical addresses + COMMON_QUEUE_ENABLE = 1, reads every queue register back through the MmioRegion accessors (the proof grew a read_u64 companion to the existing write_u64) and asserts the values match, sets DRIVER_OK, then writes 0 to device_status and asserts post-reset COMMON_QUEUE_ENABLE / ..._DESC / ..._DRIVER / ..._DEVICE are all 0 after re-selecting queue 1. Token grammar: <seg>.<bus>.<dev>.<fn>-vendor.<v>-dev.<d>-bar.<b>-len.<hex>-q.<index>-size.<u>-desc.<hex>-drv.<hex>-dev.<hex>.
    • Fail-closed assertions: five inline assertions gate the marker. (1) Initial reset reads back as 0. (2) Negotiated feature set matches exactly VIRTIO_F_VERSION_1. (3) Post-DRIVER_OK status reads back with ACK|DRIVER|FEATURES_OK|DRIVER_OK set and FAILED clear. (4) Programmed queue addresses + enable read back exactly as written. (5) Post-reset re-read of the TX queue state reports every queue-state register cleared to 0. The proof wraps the materialization so every exit path (success or any intermediate failure) writes 0 to device_status and frees every allocated frame back to the bitmap before returning. Per- stage outcomes log on the virtio-net-tx-queue-materialization: ok ... / ... failed closed: ... lines so a regression trips the boot log alongside the missing marker.
    • capOS mapping: focused-proof child of the TX authority bundle that extends the proven bundle composition with one round of real common-configuration queue setup + reset cleanup. The same boot still spawns the cloud-provider-virtio-net-tx-authority-bundle-smoke userspace service, which receives the bundle of caps over the same BDF and asserts same-BDF + per-cap stale-handle from userspace; the bundle observer compiles in (the picker filter is_bundle_candidate_class fires under either feature) so every grant + release identity still pairs up for the debug trail, but the bundle’s headline marker is intentionally suppressed under this feature because its queue_setup=not-attempted claim would be inaccurate now. The queue-materialization marker’s trailing labels (tx_descriptor=not-published, notify=not-rung, msix_function_enable=not-toggled, tx_completion=not-claimed, provider_visible_queue_address=hidden, iova_export=disabled-future-only, live_cloud=not-attempted) re-anchor the bounds the descendant slices (descriptor publication, notify doorbell, MSI-X function enable, userspace submit, used-ring polling, live cloud) carry. The cap::virtio_net_tx_queue_materialization_proof caller is mutually exclusive with qemu, cloud_provider_cap_waiter_proof, cloud_virtio_net_device_bringup_proof, and cloud_virtio_net_tx_authority_bundle_proof at the cap::mod.rs activation site.
    • QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by make run-cloud-provider-virtio-net-tx-queue-materialization. No GCE resources are created.
  • Production cloud-boot evidence marker (virtio-net-rx-queue-materialization): under the focused-proof Cargo feature cloud_virtio_net_rx_queue_materialization_proof, the cloudboot kernel runs cap::virtio_net_rx_queue_materialization_proof (kernel/src/cap/virtio_net_rx_queue_materialization_proof.rs) over the same virtio-net function the authority bundle picks. It is the structural mirror of the TX queue-materialization proof, one virtqueue index over: it materializes one manager-owned RX virtqueue (queue index 0) instead of the TX virtqueue (queue index 1). The proof allocates three zeroed physical frames from the kernel frame allocator, programs the RX queue’s common-configuration QUEUE_DESC / QUEUE_DRIVER / QUEUE_DEVICE + QUEUE_ENABLE = 1, asserts the device read-backs match the manager-authored host-physical addresses, then writes 0 to device_status and asserts every RX queue-state register has cleared to 0. Exposed through make run-cloud-provider-virtio-net-rx-queue-materialization.

    • Spec basis: virtio 1.2 §2.1.2 (reset clears all virtqueue state), §2.7 (split-ring queue layout), §4.1.4.3 (common configuration queue registers), §5.1.2 (virtio-net advertises receiveq1=0, transmitq1=1).
    • Implemented wire-format subset: identical to the TX queue-materialization proof except it writes COMMON_QUEUE_SELECT = 0 (RX) instead of 1 (TX) and re-selects queue 0 for the post-reset read-back. The proof drives the modern virtio status sequence through reset / ACK / DRIVER / feature select (VIRTIO_F_VERSION_1 only) / FEATURES_OK, asserts COMMON_NUM_QUEUES >= 2, reads COMMON_QUEUE_SIZE, clamps to a power-of-two bound (MAX_QUEUE_SIZE = 256, so each region fits in one 4 KiB frame), allocates desc/avail/used frames through mem::frame::alloc_frame_zeroed, programs COMMON_QUEUE_DESC / COMMON_QUEUE_DRIVER / COMMON_QUEUE_DEVICE + COMMON_QUEUE_ENABLE = 1, reads every queue register back through the MmioRegion accessors and asserts the values match, sets DRIVER_OK, then writes 0 to device_status and asserts post-reset COMMON_QUEUE_ENABLE / ..._DESC / ..._DRIVER / ..._DEVICE are all 0 after re-selecting queue 0. Token grammar: <seg>.<bus>.<dev>.<fn>-vendor.<v>-dev.<d>-bar.<b>-len.<hex>-q.<index>-size.<u>-desc.<hex>-drv.<hex>-dev.<hex> (with q.0 for RX).
    • Fail-closed assertions: the same five inline assertions gate the marker as in the TX proof — initial reset reads back 0; negotiated feature set is exactly VIRTIO_F_VERSION_1; post-DRIVER_OK status has ACK|DRIVER|FEATURES_OK|DRIVER_OK set and FAILED clear; programmed queue addresses + enable read back exactly as written; post-reset re-read of the RX queue state reports every queue-state register cleared to 0. The proof wraps the materialization so every exit path (success or any intermediate failure) writes 0 to device_status and frees every allocated frame back to the bitmap before returning. Per-stage outcomes log on the virtio-net-rx-queue-materialization: ok ... / ... failed closed: ... lines.
    • capOS mapping: focused-proof sibling of the TX queue-materialization proof that drives the same kernel-side queue setup + reset cleanup against the receive virtqueue. The same boot still spawns the cloud-provider-virtio-net-tx-authority-bundle-smoke userspace service, which receives the bundle of caps over the same BDF and asserts same-BDF + per-cap stale-handle from userspace; the bundle observer compiles in through its shared cfg gate so every grant + release identity still pairs up for the debug trail, but the bundle’s headline marker is intentionally suppressed under this feature (it is gated on cloud_virtio_net_tx_authority_bundle_proof, which this feature does not enable) because its queue_setup=not-attempted claim would be inaccurate now. The RX-queue-materialization marker’s trailing labels (rx_buffer=not-posted, avail=not-published, notify=not-rung, rx_completion=not-claimed, msix_function_enable=not-toggled, provider_visible_queue_address=hidden, iova_export=disabled-future-only, device_autonomous_raise=not-claimed, live_cloud=not-attempted) re-anchor the bounds the descendant slices (receive-buffer post, avail publication, notify doorbell, used-ring consumption, MSI-X function enable, live cloud) carry. The cap::virtio_net_rx_queue_materialization_proof caller is mutually exclusive with qemu, cloud_provider_cap_waiter_proof, cloud_virtio_net_device_bringup_proof, cloud_virtio_net_tx_authority_bundle_proof, and every TX proof feature at the cap::mod.rs activation site.
    • QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by make run-cloud-provider-virtio-net-rx-queue-materialization. No GCE resources are created. Live-GCE RX stays under cloud-gcp-virtio-net-nic-driver.
  • Production cloud-boot evidence marker (virtio-net-rx-buffer-post): under the focused-proof Cargo feature cloud_virtio_net_rx_buffer_post_polled_completion_proof (which implies cloud_virtio_net_rx_queue_materialization_proof so the bundle observer

    • production grant-source pickers + userspace bundle smoke keep their plumbing), the cloudboot kernel runs cap::virtio_net_rx_buffer_post_polled_completion_proof (kernel/src/cap/virtio_net_rx_buffer_post_polled_completion_proof.rs) over the same virtio-net function the authority bundle picks. It is the RX analogue of the TX submit-doorbell -> polled-completion progression: it materializes the RX virtqueue (queue index 0) AND the TX virtqueue (queue index 1, the SLIRP stimulus path), sets DRIVER_OK, posts ONE manager-owned device-writable receive buffer to the RX avail ring, rings the RX notify doorbell once, fills and TX-submits ONE broadcast ARP request as the SLIRP stimulus, rings the TX notify doorbell once, then polls the manager-owned RX used ring with a bounded spin budget until used.idx == 1 and asserts used[0].id == 0 and used[0].len > 0 — ONE real device->host RX DMA landed in the manager-owned bounce page. Exposed through make run-cloud-provider-virtio-net-rx-buffer-post.
    • Spec basis: virtio 1.2 §2.1.2 (reset clears virtqueue state), §2.7 (split-ring queue layout), §2.7.6 (available ring: flags @0, idx @2, ring @4), §2.7.7 (VIRTQ_AVAIL_F_NO_INTERRUPT), §2.7.8 (used ring layout), §4.1.4.3 (common configuration queue registers), §4.1.5.2 (notify doorbell), §5.1.2 (virtio-net advertises receiveq1=0, transmitq1=1), §5.1.6 (12-byte modern virtio-net header).
    • Implemented wire-format subset: stages 1-9 materialize the RX queue (index 0) and the TX queue (index 1) identically to the queue- materialization proof (modern status sequence to FEATURES_OK, VIRTIO_F_VERSION_1 only, COMMON_NUM_QUEUES >= 2, clamp COMMON_QUEUE_SIZE to a power of two <= MAX_QUEUE_SIZE = 256, allocate desc/avail/used frames per queue, program the per-queue registers + QUEUE_ENABLE = 1, read-back assert), then DRIVER_OK. The RX-DMA delta authors RX descriptor slot 0 over the HHDM (addr = rx_payload_phys, len = 2048, flags = VIRTQ_DESC_F_WRITE, next = 0), sets the RX avail ring (flags = VIRTQ_AVAIL_F_NO_INTERRUPT, ring[0] = 0, release fence, idx = 1), maps the modern notify region bounded to the smallest page covering both per-queue notify slots, and rings the RX-queue notify doorbell. The stimulus mirrors virtio.rs::write_arp_request_frame: it fills one TX payload frame with a broadcast ARP request for the SLIRP gateway IP (10.0.2.2), authors TX descriptor slot 0 (flags = 0, device-readable), sets the TX avail ring (also VIRTQ_AVAIL_F_NO_INTERRUPT), and rings the TX-queue notify doorbell. The proof then polls the RX used.idx with a bounded spin budget (POLL_USED_RING_BUDGET = 50_000_000, an order of magnitude above the in-kernel ARP_RX_POLL_LIMIT = 500_000) and reads the device-authored used[0].(id, len) plus the observed EtherType. Token grammar: <seg>.<bus>.<dev>.<fn>-vendor.<v>-dev.<d>-bar.<b>-len.<hex>-q.<index>-size.<u>-rxnotify.bar.<b>.off.<hex>.mult.<u>.addr.<hex>-rxdesc.<hex>-rxdrv.<hex>-rxdev.<hex>-rxpay.<hex>-rxlen.<u>-availidx.<u>-usedidx.<u>-usedid.<u>-usedlen.<u>-ethertype.<hex>.
    • Fail-closed assertions: the queue-materialization assertions gate both queue setups (initial reset reads 0; negotiated features exactly VIRTIO_F_VERSION_1; post-DRIVER_OK status has ACK|DRIVER|FEATURES_OK|DRIVER_OK set and FAILED clear; programmed queue addresses + enable read back exactly as written). The RX-DMA delta adds: the polled used.idx reaches 1 within the spin budget (else fail closed, no marker); the post-completion RX avail.idx reads back 1; used[0].id == 0 (the published descriptor head); used[0].len > 0 (a real device->host frame); and the post-reset re-read of both the RX and TX queue-state registers reports every register cleared to 0. The proof resets the device on every exit path (success or any intermediate failure) and frees the eight manager-owned frames (RX desc/avail/used, TX desc/avail/used, RX payload, TX payload) only after a confirmed reset read-back of 0; if reset cannot be confirmed the frames stay retained so the device cannot DMA into a freed page. Per-stage outcomes log on the virtio-net-rx-buffer-post: ok ... / ... failed closed: ... lines.
    • capOS mapping: focused-proof child of the RX queue-materialization proof that drives the first real RX DMA on the production cloud kernel. Completion is polled only: MSI-X stays disabled, no MSI-X table entry is programmed, no Interrupt waiter is installed, no dispatch slot is claimed, and both avail rings carry VIRTQ_AVAIL_F_NO_INTERRUPT so the device does not raise a queue-completion interrupt. The same boot still spawns the cloud-provider-virtio-net-tx-authority-bundle-smoke userspace service, which receives the bundle of caps over the same BDF and asserts same-BDF + per-cap stale-handle from userspace; both companion headline markers (virtio-net-tx-authority-bundle and virtio-net-rx-queue-materialization) are intentionally suppressed under this feature because this proof is the new headline owner. The marker’s trailing labels (rx_buffer=posted, avail=published, notify=rung-once, rx_completion=polled-used-ring, msix_rx_function_enable=not-toggled, msix_table_write=not-performed, device_autonomous_raise=not-claimed, provider_visible_queue_address=hidden, provider_rx_submit=kernel-proxy-bounded, iova_export=disabled-future-only, live_cloud=not-attempted) re-anchor the bounds the descendant slices (RX MSI-X wait/ack, provider-driven RX submit, live cloud) carry. The cap::virtio_net_rx_buffer_post_polled_completion_proof caller is mutually exclusive with qemu, cloud_provider_cap_waiter_proof, cloud_virtio_net_device_bringup_proof, cloud_virtio_net_tx_authority_bundle_proof, and every TX proof feature at the cap::mod.rs activation site (inherited through the implied RX-materialization feature’s compile_error!s).
    • QEMU-emulable vs hardware-only: fully QEMU-emulable; the SLIRP -netdev user backend delivers the ARP reply that drives the RX DMA. Proved locally by make run-cloud-provider-virtio-net-rx-buffer-post. No GCE resources are created. Live-GCE RX stays under cloud-gcp-virtio-net-nic-driver.
  • Production cloud-boot evidence marker (virtio-net-msix-function-enable): under the focused-proof Cargo feature cloud_virtio_net_msix_function_enable_proof (which implies cloud_virtio_net_tx_queue_materialization_proof so the bundle observer

    • production grant-source pickers + userspace bundle smoke keep their plumbing), the cloudboot kernel runs cap::virtio_net_msix_function_enable_proof (kernel/src/cap/virtio_net_msix_function_enable_proof.rs) over the same virtio-net function the authority bundle and queue-materialization proofs pick. The proof re-drives the modern virtio status sequence to DRIVER_OK, materializes one manager-owned TX virtqueue (identical to the queue-materialization proof), then walks the PCI MSI-X capability mask-first: it reads the Message Control register, writes FUNCTION_MASK = 1 first, reads back, writes ENABLE = 1 while keeping the function mask set, reads back, then cleans up by clearing both bits and reads back to assert PCI config-space MSI-X state is restored. Exposed through make run-cloud-provider-virtio-net-msix-function-enable.
    • Spec basis: PCI SIG MSI-X §6.8.2 Message Control Register (bits 14 = Function Mask, 15 = MSI-X Enable); virtio 1.2 §2.1.2 (reset clears virtqueue state), §4.1.4.3 (common configuration queue registers), §5.1.2 (virtio-net advertises receiveq1=0, transmitq1=1).
    • Implemented wire-format subset: stages 1-11 mirror the queue- materialization proof. Stages 12-15 are this proof’s delta: pci::interrupt_capabilities + MsixCapabilityInfo.offset resolve the capability header; pci::try_read_config_u16 reads the Message Control register at capability_offset + 0x02; the proof asserts the pre-state has MSIX_CONTROL_ENABLE clear, performs the mask-first write through pci::try_write_config_u16, reads back and asserts FUNCTION_MASK = 1, ENABLE = 0, performs the enable write keeping the mask, reads back and asserts both bits are set, then performs the cleanup write that clears both bits, reads back and asserts both are clear. The proof never programs an MSI-X table entry, never claims an interrupt-dispatch slot, and never raises a device-autonomous interrupt. Token grammar: <seg>.<bus>.<dev>.<fn>-vendor.<v>-dev.<d>-bar.<b>-len.<hex>-q.<index>-size.<u>-msix.cap.<hex>-msix.tsize.<u>-pre.<hex>-mask.<hex>-en.<hex>-cleanup.<hex>.
    • Fail-closed assertions: stages 1-11 inherit the queue- materialization proof’s five inline assertions. The MSI-X delta adds four more. (6) Pre-state read-back has MSIX_CONTROL_ENABLE clear. (7) Post-mask-write read-back has FUNCTION_MASK = 1 and ENABLE = 0. (8) Post-enable-write read-back has both bits set. (9) Post-cleanup-write read-back has both bits clear. Every exit path (success or any intermediate failure) runs a best-effort pci::try_write_config_u16 that clears MSIX_CONTROL_ENABLE and MSIX_CONTROL_FUNCTION_MASK regardless of the result chain, then writes 0 to device_status and frees every allocated queue frame. Per-stage outcomes log on the virtio-net-msix-function-enable: ok ... / ... failed closed: ... lines so a regression trips the boot log alongside the missing marker.
    • capOS mapping: focused-proof child of the TX queue materialization proof that extends the same kernel-side activation surface with one round of canonical mask-first MSI-X function-level enable + cleanup. The same boot still spawns the cloud-provider-virtio-net-tx-authority-bundle-smoke userspace service, which receives the bundle of caps over the same BDF and asserts same-BDF + per-cap stale-handle from userspace; both companion headline markers (virtio-net-tx-authority-bundle and virtio-net-tx-queue-materialization) are intentionally suppressed under this feature because their queue_setup=not-attempted / msix_function_enable=not-toggled claims would be inaccurate now, so the MSI-X function-enable marker is the sole headline. The marker’s trailing labels (tx_descriptor=not-published, notify=not-rung, msix_function_enable=toggled-mask-first, msix_function_enable_cleanup=cleared, msix_table_write=not-performed, device_autonomous_raise=not-claimed, tx_completion=not-claimed, provider_visible_queue_address=hidden, iova_export=disabled-future-only, live_cloud=not-attempted) re-anchor the bounds the descendant slices (interrupt-dispatch slot, descriptor publication, used-ring polling, live cloud) carry. The cap::virtio_net_msix_function_enable_proof caller is mutually exclusive with qemu, cloud_provider_cap_waiter_proof, cloud_virtio_net_device_bringup_proof, and cloud_virtio_net_tx_authority_bundle_proof at the cap::mod.rs activation site.
    • QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by make run-cloud-provider-virtio-net-msix-function-enable. No GCE resources are created.
  • Production cloud-boot evidence marker (virtio-net-tx-submit-doorbell): under the focused-proof Cargo feature cloud_virtio_net_tx_submit_doorbell_proof (which implies cloud_virtio_net_msix_function_enable_proof and transitively cloud_virtio_net_tx_queue_materialization_proof so the bundle observer + production grant-source pickers + userspace bundle smoke + mask-first MSI-X plumbing all keep firing), the cloudboot kernel runs cap::virtio_net_tx_submit_doorbell_proof (kernel/src/cap/virtio_net_tx_submit_doorbell_proof.rs) over the same virtio-net function the authority bundle, queue-materialization, and MSI-X function-enable proofs pick. The proof re-drives the modern virtio status sequence to DRIVER_OK, materializes one manager-owned TX virtqueue, enables MSI-X function-level control mask-first, then allocates one brokered TX payload frame and fills it kernel-half as a proxy for the userspace provider’s brokered fill, writes one TX descriptor at slot 0 of the descriptor ring, publishes one avail-ring entry and advances avail.idx to 1, maps the modern virtio notify region, rings the notify doorbell exactly once for the selected TX queue, reads the post-doorbell avail.idx and device-used.idx for visibility, then cleans up MSI-X mask-first, resets the device, and frees all four manager-owned frames. Exposed through make run-cloud-provider-virtio-net-tx-submit-doorbell.

    • Spec basis: virtio 1.2 §2.7.6 (driver-area / available ring layout including idx at +2 and ring slots at +4), §2.7.8 (device-area / used ring layout including idx at +2), §4.1.4.4 (notify-cfg capability and per-queue notify address resolution as notify_bar_base + cap.bar_offset + queue_notify_off * notify_off_multiplier), §4.1.5.2 (modern virtio doorbell: u16 write of the queue index to the per-queue notify address), §5.1.6.2 (virtio-net TX descriptor layout). The submit ordering follows virtio 1.2 §2.7.13 (driver writes the descriptor head index to avail.ring[avail.idx % size], then bumps avail.idx after a suitable memory barrier).
    • Implemented wire-format subset: stages 1-14 mirror the MSI-X function-enable proof (status sequence, queue materialization, mask-first MSI-X enable). Stages 15-21 are this proof’s submit /doorbell delta: frame::alloc_frame_zeroed allocates one payload frame, frame::hhdm_offset translates the manager-owned host- physical to a kernel virtual address for the kernel-proxy fill (a minimal 12-byte modern virtio-net header followed by an 8-byte b"CAPOSTX1" body, total payload length 20 bytes); slot 0 of the descriptor ring receives addr = payload_phys, len = 20, flags = 0, next = 0 over the HHDM write; avail.ring[0] = 0 and avail.idx = 1 over the HHDM with a compiler fence between them; the notify region is mapped through pci::map_bar_region and the kernel writes the queue index 1 as a u16 to notify_vaddr + queue_notify_off * notify_off_multiplier; avail.idx is read back and asserted as 1, and the device-written used.idx is read for visibility only. The proof never polls the used ring beyond the single visibility read, never claims a TX completion, never programs an MSI-X table entry, never raises a device-autonomous interrupt, never registers an Interrupt waiter, never performs direct DMA, never programs the IOMMU, and never exports a host-physical address or IOVA. Token grammar: <seg>.<bus>.<dev>.<fn>-vendor.<v>-dev.<d>-bar.<b>-len.<hex>-q.<index>-size.<u>-msix.cap.<hex>-msix.tsize.<u>-pre.<hex>-mask.<hex>-en.<hex>-cleanup.<hex>-notify.bar.<b>.off.<hex>.mult.<u>.addr.<hex>-desc.<hex>-payload.<hex>-paylen.<u>-availidx.<u>-usedidx.<u>.
    • Fail-closed assertions: stages 1-14 inherit the MSI-X function- enable proof’s nine inline assertions. The submit/doorbell delta adds three more. (10) Notify region length must be large enough to contain queue_notify_off * notify_off_multiplier + 2. (11) Notify-region map length must cover that minimum. (12) Post- doorbell avail.idx round-trip must read back as 1. Every exit path (success or any intermediate failure) runs the best-effort MSI-X cleanup, writes 0 to device_status, asserts every TX queue- state register cleared to 0, and frees all four manager-owned frames (descriptor, avail, used, payload) regardless of the result chain. The device-used.idx read is deliberately NOT asserted: QEMU may or may not have drained the descriptor by the time the kernel reads it, and the proof’s discipline says tx_completion=not- claimed regardless of the observed value. Per-stage outcomes log on the virtio-net-tx-submit-doorbell: ok ... / ... failed closed: ... lines so a regression trips the boot log alongside the missing marker.
    • capOS mapping: focused-proof child of the MSI-X function-enable proof that extends the same kernel-side activation surface with one round of single-slot descriptor publish + single avail-ring entry + single notify doorbell ring, with no used-ring polling or completion claim. The same boot still spawns the cloud-provider-virtio-net-tx-authority-bundle-smoke userspace service, which receives the bundle of caps over the same BDF and asserts same-BDF + per-cap stale-handle from userspace; all three companion headline markers (virtio-net-tx-authority-bundle, virtio-net-tx-queue-materialization, and virtio-net-msix-function-enable) are intentionally suppressed under this feature because their tx_descriptor=not-published / notify=not-rung / queue_setup=not-attempted / msix_function_enable=not-toggled claims would all be inaccurate now, so the submit/doorbell marker is the sole headline. The marker’s trailing labels (tx_descriptor=published, notify=rung-once, msix_function_enable=toggled-mask-first, msix_function_enable_cleanup=cleared, msix_table_write=not-performed, device_autonomous_raise=not-claimed, tx_completion=not-claimed, provider_visible_queue_address=hidden, provider_fill=kernel-proxy-bounded, iova_export=disabled-future-only, live_cloud=not-attempted) re-anchor the bounds the descendant slices (used-ring polling, provider waiter/ack, interrupt-dispatch slot claim, live cloud) carry. The cap::virtio_net_tx_submit_doorbell_proof caller is mutually exclusive with qemu, cloud_provider_cap_waiter_proof, cloud_virtio_net_device_bringup_proof, and cloud_virtio_net_tx_authority_bundle_proof at the cap::mod.rs activation site.
    • QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by make run-cloud-provider-virtio-net-tx-submit-doorbell. No GCE resources are created.
  • Kernel-half TX polled-completion proof (predecessor of virtio-net-userspace-provider): under the focused-proof Cargo feature cloud_virtio_net_tx_polled_completion_proof (which implies cloud_virtio_net_tx_submit_doorbell_proof and transitively cloud_virtio_net_msix_function_enable_proof / cloud_virtio_net_tx_queue_materialization_proof so every shared plumbing gate keeps firing), the cloudboot kernel runs cap::virtio_net_tx_polled_completion_proof (kernel/src/cap/virtio_net_tx_polled_completion_proof.rs) over the same virtio-net function the authority bundle, queue-materialization, MSI-X function-enable, and submit/doorbell proofs pick. The proof re-drives the modern virtio status sequence to DRIVER_OK, materializes one manager-owned TX virtqueue, enables MSI-X function-level control mask-first, allocates one brokered TX payload frame and fills it kernel-half as a proxy for the userspace provider’s brokered fill, publishes one TX descriptor + one avail-ring entry over the manager-owned ring frames, rings the notify doorbell exactly once for the selected TX queue, polls the device-authored used.idx from the manager-owned used-ring frame with a bounded retry budget until it reaches 1 (one consumed TX descriptor), reads the post-completion avail.idx and the device-authored used[0].(id, len), then cleans up MSI-X mask-first, resets the device, and frees all four manager-owned frames. The module is the predecessor of the userspace-submit polled-completion proof and is dropped from the compile set when the live-publish feature is on (the live-publish proof is the new headline owner of virtio-net-userspace-provider and exercises the same polled completion path through the userspace cap method instead of the kernel-half proxy).

    • Spec basis: inherits the submit/doorbell proof’s basis (virtio 1.2 §2.7.6 / §2.7.8 / §4.1.4.4 / §4.1.5.2 / §5.1.6.2 / §2.7.13). The polled-completion delta uses §2.7.8 (used-ring layout: idx at +2 and 8-byte (id, len) slots at +4) for both the bounded used.idx poll and the device-authored used[0] slot read.
    • Implemented wire-format subset: stages 1-20 mirror the submit/doorbell proof (status sequence, queue materialization, mask-first MSI-X enable, payload kernel-proxy fill, descriptor publish, avail bump, notify doorbell ring). Stages 21-23 are this proof’s polled-completion delta: the manager-owned used-ring used.idx HHDM read is wrapped in a bounded retry loop (with core::hint::spin_loop() between iterations) that converges on the target completion count 1, the post-completion avail.idx HHDM round-trip is asserted as 1, and the device-authored used[0].id / used[0].len are read with an Acquire compiler fence on the success path so the slot data is observed consistently with the used.idx bump. Token grammar: <seg>.<bus>.<dev>.<fn>-vendor.<v>-dev.<d>-bar.<b>-len.<hex>-q.<index>-size.<u>-msix.cap.<hex>-msix.tsize.<u>-pre.<hex>-mask.<hex>-en.<hex>-cleanup.<hex>-notify.bar.<b>.off.<hex>.mult.<u>.addr.<hex>-desc.<hex>-payload.<hex>-paylen.<u>-availidx.<u>-usedidx.<u>-polled.iter.<u>-usedid.<u>-usedlen.<u>.
    • Fail-closed assertions: stages 1-20 inherit the submit/doorbell proof’s twelve inline assertions. The polled-completion delta adds three more. (13) The bounded used.idx poll must converge on the target 1 within the retry budget; budget exhaustion fails closed and reports the last observed value. (14) The post-completion avail.idx HHDM round-trip must still read back as 1. (15) The device-authored used[0].id must equal the published descriptor head index 0; used[0].len is recorded for visibility but is deliberately NOT asserted (virtio-net leaves len at 0 for TX device-readable chains, but the kernel does not gate the proof on that). Every exit path (success or any intermediate failure) runs the best-effort MSI-X cleanup, writes 0 to device_status, asserts every TX queue-state register cleared to 0, and frees all four manager-owned frames (descriptor, avail, used, payload) only after the final reset read-back is confirmed; if reset cannot be confirmed the frames stay retained rather than being returned while the device may still DMA them. Per-stage outcomes log on the virtio-net-userspace-provider: ok ... / ... failed closed: ... lines so a regression trips the boot log alongside the missing marker.
    • capOS mapping: focused-proof child of the submit/doorbell proof that extends the same kernel-side activation surface with one round of bounded used.idx polling + one accounted completion + one device-authored used[0] slot read, paired with the userspace bundle smoke’s Interrupt cap handle-lifecycle discipline on the same MSI-X BDF (Interrupt.info round-trip identity assertion + release + post-release Interrupt.info fail-closed). That cap-side pairing covers cap-handle identity and post-release stale-handle rejection on the production Interrupt cap; it deliberately does NOT exercise Interrupt.wait/acknowledge, because the production InterruptCapProd::wait and InterruptCapProd::acknowledge paths are unimplemented in the non-qemu cloud kernel and fail closed (kernel/src/cap/interrupt_prod.rs). Real waiter/ack pairing on the virtio-net TX MSI-X route is deferred to a future child that either ports the cap::provider_cap_waiter_proof kernel-injected-dispatch + deferred-EOI discipline onto this route or programs an actual MSI-X table entry + dispatch slot. All four companion headline markers (virtio-net-tx-authority-bundle, virtio-net-tx-queue-materialization, virtio-net-msix-function-enable, and virtio-net-tx-submit-doorbell) are intentionally suppressed under this feature because their tx_descriptor=not-published / notify=not-rung / queue_setup=not-attempted / msix_function_enable=not-toggled / tx_completion=not-claimed claims would all be inaccurate now, so the polled-completion marker is the sole headline. The marker’s trailing labels (tx_descriptor=published, notify=rung-once, msix_function_enable=toggled-mask-first, msix_function_enable_cleanup=cleared, msix_table_write=not-performed, device_autonomous_raise=not-claimed, tx_completion=polled-used-ring, provider_visible_queue_address=hidden, provider_fill=kernel-proxy-bounded, iova_export=disabled-future-only, live_cloud=not-attempted) re-anchor the bounds the descendant slices (interrupt-dispatch slot claim, real Interrupt.wait waiter, live cloud) carry. The cap::virtio_net_tx_polled_completion_proof caller is mutually exclusive with qemu, cloud_provider_cap_waiter_proof, cloud_virtio_net_device_bringup_proof, and cloud_virtio_net_tx_authority_bundle_proof at the cap::mod.rs activation site. The marker keeps device_autonomous_raise=not-claimed because the proof never enables the per-vector MSI-X table entry, never registers an Interrupt.wait waiter, and observes the completion strictly through the device-authored used-ring update.
    • QEMU-emulable vs hardware-only: predecessor-only. The active make run-cloud-provider-virtio-net headline target switched to the userspace-submit polled-completion proof below, which exercises the same polled-completion path through the userspace cap method and supersedes this kernel-half proxy.
  • Production cloud-boot evidence marker (virtio-net-userspace-provider): under the focused-proof Cargo feature cloud_virtio_net_tx_dmabuffer_live_publish_proof (which implies cloud_virtio_net_tx_polled_completion_proof and transitively the submit/doorbell, MSI-X function-enable, queue-materialization, and authority-bundle proofs so every shared plumbing gate stays compiled in), the cloudboot kernel runs cap::virtio_net_tx_dmabuffer_live_publish_proof (kernel/src/cap/virtio_net_tx_dmabuffer_live_publish_proof.rs) over the same virtio-net function the predecessor picks. Unlike the kernel-half polled-completion predecessor, the kernel-side proof here splits the work into two phases: at-boot init() stages the modern virtio status sequence + TX queue materialization + MSI-X mask-first enable + notify mapping, leaving the device in DRIVER_OK with MSI-X enabled-but-globally-masked; the per-call attempt_live_publish runs from the non-qemu device-manager stub’s validate_dmabuffer_submit_descriptor_admission when the userspace cloud-provider-virtio-net-tx-dmabuffer-live-publish-smoke service’s DMABuffer.submitDescriptor is admitted (queue == 1, descriptor_id == 0, length <= PAGE_SIZE, no user mapping live, no in-flight submit, kernel-known DmaBufferHandle). The cap method resolves the buffer’s host-physical bounce-buffer page, authors one TX descriptor + avail-ring entry over the manager-owned ring frames, rings the notify doorbell exactly once, polls the device-authored used.idx with the same bounded budget the polled-completion predecessor uses, reads used[0].(id, len), tears the device down (MSI-X mask-first cleanup + device reset + queue-state register read-back asserted to zero, three manager-owned queue frames freed), and emits one cloudboot-evidence: virtio-net-userspace-provider <token> headline marker with provider_fill=userspace-brokered-buffer anchoring the userspace-driven submit boundary. Exposed through make run-cloud-provider-virtio-net – the terminal local harness for the virtio-net userspace-provider chain.

    • Spec basis: inherits the polled-completion proof’s basis (virtio 1.2 §2.7.6 / §2.7.8 / §4.1.4.4 / §4.1.5.2 / §5.1.6.2 / §2.7.13). The live-publish delta drives the same descriptor/avail/notify write sequence and the same bounded used.idx poll, but the descriptor’s addr field is the userspace-allocated DMABuffer’s host-physical bounce-buffer page resolved through the kernel DMA ledger, not a manager-allocated payload frame.
    • Implemented wire-format subset: at-boot init() covers stages 1-14 of the polled-completion sequence (status sequence + queue materialization + mask-first MSI-X enable + notify mapping) and stashes the staged state. Per-call attempt_live_publish covers stages 15-24: descriptor publish (with desc[0].addr = payload_phys from the userspace DMABuffer), avail-ring entry + avail.idx bump with a release compiler fence, notify doorbell ring, used-ring used.idx bounded poll, used[0] slot read with an acquire compiler fence, MSI-X cleanup, device reset, queue-state register read-back, and queue-frame release. Token grammar adds pool.<u>.gen.<u>-buf.<u>.gen.<u>-payload.<hex>-paylen.<u> so the manager-issued single-slot bounce-buffer pool’s slot/generation pair, the buffer’s slot/generation pair, and the resolved payload host-physical address are observable from the marker; the polled-completion marker’s desc.<hex> field is intentionally not in the live-publish marker because the per-call descriptor write happens after the marker emission window’s boundaries.
    • Fail-closed assertions: stages 1-14 inherit the polled-completion proof’s assertions for status/queue/MSI-X bring-up. Stages 15-24 inherit its assertions for descriptor publish, doorbell, polled completion, MSI-X cleanup, and reset. The per-call admission gate adds five more, surfaced through the cap-side DmaBufferSubmitDescriptorAdmission shape: (1) queue != 1 fails closed with dmabuffer-tx-queue-required / non-tx-queue-rejected (RX is rejected explicitly; queue >= 2 trips the standard queue-out-of-range request gate). (2) descriptor_id != 0 fails closed with descriptor-id-out-of-range. (3) length > PAGE_SIZE fails closed with length-exceeds-buffer. (4) A live userspace VMA fails closed with dmabuffer-mapping-live (the cap-side block_submit_for_live_mapping short-circuit handles this before the device-manager runs; the stub defends in depth). (5) A second submitDescriptor on the same buffer without an intervening freeBuffer fails closed with dmabuffer-descriptor-already-inflight; the in-flight slot is dropped only when the parked-buffer record drops on freeBuffer. A post-freeBuffer submitDescriptor fails closed with the standard stale-handle error.
    • capOS mapping: terminal local headline that flips the descriptor addr source from a manager-allocated kernel-proxy payload frame to the userspace-allocated DMABuffer’s host-physical bounce-buffer page resolved through the kernel DMA ledger. The marker’s trailing label provider_fill=userspace-brokered-buffer replaces the kernel-half polled-completion predecessor’s provider_fill=kernel-proxy-bounded to reflect the change. All four companion headline markers (virtio-net-tx-authority-bundle, virtio-net-tx-queue-materialization, virtio-net-msix-function-enable, and virtio-net-tx-submit-doorbell) are suppressed because the userspace-submit polled-completion proof is the new headline owner; the predecessor cap::virtio_net_tx_polled_completion_proof module is dropped from the compile set under this feature so its competing emission of the same virtio-net-userspace-provider marker cannot fire. The cap::virtio_net_tx_dmabuffer_live_publish_proof caller is mutually exclusive with qemu, cloud_provider_cap_waiter_proof, cloud_virtio_net_device_bringup_proof, and cloud_virtio_net_tx_authority_bundle_proof at the cap::mod.rs activation site. The marker keeps device_autonomous_raise=not-claimed, msix_table_write=not-performed, and live_cloud=not-attempted because the proof never enables the per-vector MSI-X table entry, never registers an Interrupt.wait waiter, and observes the completion strictly through the device-authored used-ring update.
    • QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by make run-cloud-provider-virtio-net. No GCE resources are created.
  • MSI-X wait/ack (cap::virtio_net_tx_msix_wait_ack_proof): Carries the userspace-submit polled-completion delta one authority step further: same brokered TX submit boundary, but the userspace-observed completion event is the provider-cap-side wake from a kernel-injected dispatch on the bound virtio-net TX MSI-X route. Active under the non-qemu cloud kernel built with the Cargo feature cloud_virtio_net_tx_msix_wait_ack_proof (which implies cloud_virtio_net_tx_dmabuffer_live_publish_proof and its predecessors). The wait/ack proof’s at-boot init runs after the live-publish proof’s init: it registers + claims an MSI-X route on the same virtio-net BDF under the ManagerGrantSource owner, maps the MSI-X table BAR kernel-side, writes table entry PROOF_TABLE_ENTRY = 1 mask-first per PCI 3.0 §6.8.2, attaches the route to the device manager, arms the deferred-LAPIC-EOI gate, and unmasks the route + entry. The PCI function-level MSI-X enable bit stays set with the function mask still asserted (held by the live-publish proof’s mask-first toggle), so the virtio-net device cannot autonomously raise an interrupt on the bound route. The cloudboot manifest spawns the cloud-provider-virtio-net-tx-msix-wait-ack-smoke userspace service, which receives one Console + DeviceMmio + DMAPool + Interrupt bundle (the Interrupt source resolves through the wait/ack proof’s grant source, replacing interrupt_grant_source_prod under this feature), asserts the Interrupt.info identity + labels (bootstrap_grant=virtio-net-tx-msix-wait-ack-proof, wait=kernel-injected-dispatch-wait, acknowledge=kernel-injected-deferred-eoi-acknowledge), drives the same brokered DMABuffer.submitDescriptor chain the predecessor exercises, then calls Interrupt.wait (the cap’s invoke_wait runs device_interrupt::handle_lapic_delivery and returns one delivery with delivery_count_after == delivery_count_before + 1 plus one armed deferred LAPIC EOI), calls Interrupt.acknowledge (the cap retires the deferred LAPIC EOI through acknowledge_deferred_lapic_eoi_for_route, ack_delta == 1, pending_after == 0), frees the DMABuffer, and releases the Interrupt cap. The kernel-side on_release then runs the masked-no-wake + reassign + stale-handle assertion chain on the bound route (mirroring cap::provider_cap_waiter_proof’s discipline) and emits exactly one cloudboot-evidence: virtio-net-userspace-provider <token> headline marker combining the publish outcome (recorded in PUBLISH_OUTCOME by the predecessor’s attempt_live_publish when the feature is on) with the wait/ack delivery counts, the reassigned route generation, and the stale-handle / stale-token assertion booleans. The marker’s trailing labels differ from the polled-completion predecessor in two places: tx_completion=msix-wait-ack-injected replaces tx_completion=polled-used-ring (the userspace-observed completion event is the cap-waiter dispatch; the polled used-ring still runs kernel-side as defence-in-depth), and msix_table_write=performed-masked-first replaces msix_table_write=not-performed (the wait/ack proof’s init programmed one MSI-X table entry). All other discipline labels are preserved: device_autonomous_raise=not-claimed, provider_visible_queue_address=hidden, provider_fill=userspace-brokered-buffer, iova_export=disabled-future-only, live_cloud=not-attempted. The predecessor live-publish proof’s standalone marker emission is suppressed under this feature so the headline marker name cannot fire twice. The cap::virtio_net_tx_msix_wait_ack_proof activation site is mutually exclusive with qemu, cloud_provider_cap_waiter_proof, cloud_virtio_net_device_bringup_proof, and cloud_virtio_net_tx_authority_bundle_proof. Device-autonomous MSI-X delivery (programming the virtio queue’s queue_msix_vector for a hardware-raised TX completion interrupt and the broader production dispatch-slot proof), RX path, multi-queue operation, full NIC readiness, and any live-GCE evidence stay out of scope.

    • QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by make run-cloud-provider-virtio-net-tx-msix-wait-ack. No GCE resources are created.
  • RX MSI-X wait/ack (cap::virtio_net_rx_msix_wait_ack_proof): the RX analogue of the TX MSI-X wait/ack proof above. Active under the non-qemu cloud kernel built with the Cargo feature cloud_virtio_net_rx_msix_wait_ack_proof (which implies cloud_virtio_net_rx_buffer_post_polled_completion_proof and its predecessors). The RX completion is staged entirely kernel-side at boot: the RX buffer-post proof’s report() stages the device through DRIVER_OK, posts one manager-owned device-writable RX buffer, drives the ARP TX SLIRP stimulus, polls one real device->host RX DMA (used.idx == 1, used[0].len > 0), and – under this feature – additionally holds the PCI function-level MSI-X enable mask-first (hold_msix_function_enable_mask_first: FUNCTION_MASK = 1 then ENABLE = 1, held, not cleaned up) and records the publish outcome into the wait/ack proof’s PUBLISH_OUTCOME slot instead of emitting its standalone virtio-net-rx-buffer-post headline. The wait/ack proof’s at-boot init then drives the graduated always-built cap::interrupt_programmed::program_attach_arm_unmask over MSI-X table entry 0 (the RX queue’s per-queue config vector, virtio-pci §4.1.5.1.2) on the same virtio-net BDF under the ManagerGrantSource owner: register + claim + write table entry 0 mask-first per PCI 3.0 §6.8.2 + manager attach + deferred-LAPIC-EOI arm + route + entry unmask, tearing the route back down via teardown on any error or lost-init race. The device’s RX queue_msix_vector stays VIRTIO_MSI_NO_VECTOR and the function mask stays asserted, so the device cannot autonomously raise an interrupt on the bound route. The cloudboot manifest spawns the cloud-provider-virtio-net-rx-msix-wait-ack-smoke userspace service, which receives one Console + Interrupt bundle (the RX completion is staged kernel-side, so – unlike the TX wait/ack provider – it needs no DMAPool/DeviceMmio cap; the Interrupt source resolves through the wait/ack proof’s grant source, replacing interrupt_grant_source_prod under this feature), asserts the Interrupt.info identity + labels (bootstrap_grant=virtio-net-rx-msix-wait-ack-proof, wait=kernel-injected-dispatch-wait, acknowledge=kernel-injected-deferred-eoi-acknowledge), calls Interrupt.wait (the cap’s invoke_wait runs the graduated device_interrupt::wait_kernel_injected_dispatch and returns one delivery with delivery_count_after == delivery_count_before + 1 plus one armed deferred LAPIC EOI), calls Interrupt.acknowledge (the cap retires the deferred LAPIC EOI through acknowledge_deferred_lapic_eoi_for_route, ack_delta == 1, pending_after == 0), and releases the Interrupt cap. The kernel-side on_release then runs the masked-no-wake + reassign + stale-handle assertion chain on the bound RX route and emits exactly one cloudboot-evidence: virtio-net-userspace-provider <token> headline marker combining the RX publish outcome with the wait/ack delivery counts, the reassigned route generation, and the stale-handle / stale-token assertion booleans. The marker’s trailing labels differ from the RX buffer-post predecessor in three places: rx_completion=msix-wait-ack-injected replaces rx_completion=polled-used-ring (the userspace-observed completion event is the cap-waiter dispatch; the polled used-ring still runs kernel-side as defence-in-depth), msix_rx_function_enable=toggled-mask-first replaces msix_rx_function_enable=not-toggled (the staging now holds the function-level MSI-X enable mask-first), and msix_table_write=performed-masked-first replaces msix_table_write=not-performed (the wait/ack proof’s init programmed one MSI-X table entry). All other discipline labels are preserved: device_autonomous_raise=not-claimed, provider_visible_queue_address=hidden, provider_rx_submit=kernel-proxy-bounded, iova_export=disabled-future-only, live_cloud=not-attempted. The predecessor RX buffer-post proof’s standalone marker emission is suppressed under this feature so the headline marker name cannot fire twice. The cap::virtio_net_rx_msix_wait_ack_proof activation site is mutually exclusive with qemu, cloud_provider_cap_waiter_proof, cloud_virtio_net_device_bringup_proof, cloud_virtio_net_tx_authority_bundle_proof, and every TX/NVMe Interrupt-source proof feature. Device-autonomous RX MSI-X delivery (programming the virtio queue’s RX queue_msix_vector), provider-driven RX submit, multi-queue operation, full NIC readiness, and any live-GCE evidence stay out of scope.

    • QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by make run-cloud-provider-virtio-net-rx-msix-wait-ack. No GCE resources are created.
  • RX provider-driven buffer submit (cap::virtio_net_rx_userspace_submit_proof): the RX analogue of the TX DMABuffer live-publish proof, carried one authority step past the RX MSI-X wait/ack proof above. Active under the non-qemu cloud kernel built with the Cargo feature cloud_virtio_net_rx_userspace_submit_proof (which implies cloud_virtio_net_rx_buffer_post_polled_completion_proof and its predecessors). Unlike the RX MSI-X wait/ack proof, the RX receive buffer is no longer a manager-owned bounce page filled kernel-side: it is the userspace provider’s brokered DMABuffer, posted to the RX avail ring through DMABuffer.submitDescriptor(queue=0). The feature drops the RX buffer-post module’s at-boot kernel-proxy report(); instead this proof’s self-contained init stages the device (status sequence + RX queue 0 + TX queue 1 materialization + held mask-first MSI-X function enable + notify map), allocates NO RX payload frame, and programs the kernel-injected RX MSI-X route over table entry 0 through the graduated cap::interrupt_programmed::program_attach_arm_unmask surface (same as the wait/ack proof). The cloudboot manifest spawns the cloud-provider-virtio-net-rx-userspace-submit-smoke userspace service, which receives one Console + DeviceMmio + DMAPool + Interrupt bundle (the RX provider, unlike the kernel-proxy RX wait/ack provider, needs the DMAPool/DeviceMmio caps to allocate and submit its brokered DMABuffer). The provider asserts Interrupt.info identity + labels (bootstrap_grant=virtio-net-rx-userspace-submit-proof), allocates one brokered bounce-buffer DMABuffer (NOT mapped or written before submit – the device is the RX writer), and calls DMABuffer.submitDescriptor(queue=0, descriptor_id=0, length=2048). The non-qemu device-manager admission gate matches the parked bounce-buffer handle, validates the request shape (queue == 0, descriptor_id == 0, length <= PAGE_SIZE, no live user mapping, no in-flight submit), resolves the buffer’s kernel-known host-physical bounce-buffer page, and drives attempt_rx_submit: it authors the RX desc[0] = (provider_buffer_phys, length, flags=VIRTQ_DESC_F_WRITE, next=0) + avail-ring entry over the manager-owned RX ring frames, rings the RX notify doorbell once, drives the ARP TX SLIRP stimulus (kernel-half), polls one real device->host RX DMA (used.idx == 1, used[0].len > 0), reads the observed EtherType, resets the device, and frees the manager-owned RX/TX ring + TX payload frames. The provider then observes the completion through Interrupt.wait (kernel-injected dispatch, delivery_count_after == delivery_count_before + 1) and Interrupt.acknowledge (deferred LAPIC EOI retired, ack_delta == 1), re-maps its DMABuffer R/O and reads a non-zero received EtherType through its own mapping, unmaps, frees the buffer, and releases the Interrupt cap. The kernel-side on_release runs the masked-no-wake + reassign + stale-handle assertion chain on the bound RX route and emits exactly one cloudboot-evidence: virtio-net-userspace-provider <token> headline marker combining the RX publish outcome with the wait/ack delivery counts.

    • Spec basis: inherits the RX buffer-post / MSI-X wait/ack basis (virtio 1.2 §2.7.6 / §2.7.8 / §4.1.5.2 / §5.1.6, virtio-pci §4.1.5.1.2, PCI 3.0 §6.8.2). The userspace-submit delta drives the same RX descriptor/avail/notify write sequence and the same bounded used.idx poll, but the descriptor’s addr field is the userspace-allocated DMABuffer’s host-physical bounce-buffer page resolved through the kernel DMA ledger, not a manager-allocated payload frame.
    • Implemented wire-format subset: at-boot init() covers the status sequence + RX/TX queue materialization + mask-first MSI-X function enable + notify mapping + RX MSI-X route program. Per-call attempt_rx_submit covers the RX descriptor publish (desc[0].addr = payload_phys from the userspace DMABuffer, flags = VIRTQ_DESC_F_WRITE), avail-ring entry + avail.idx bump with a release compiler fence, RX notify doorbell ring, the ARP TX stimulus, the used-ring used.idx bounded poll, used[0] slot read with an acquire compiler fence, the observed EtherType read, device reset, queue-state register read-back, and queue-frame release. Token grammar replaces the wait/ack marker’s rxpay.<hex> field with pool.<u>.gen.<u>-buf.<u>.gen.<u>-payload.<hex>-rxlen.<u> so the manager-issued single-slot bounce-buffer pool’s slot/generation pair, the buffer’s slot/generation pair, the resolved payload host-physical address, and the requested receive length are observable from the marker.
    • Fail-closed assertions: inherits the RX buffer-post proof’s assertions for status/queue/MSI-X bring-up and the RX descriptor publish + ARP stimulus + polled completion + reset, and the wait/ack proof’s masked-no-wake + reassign + stale-handle / stale-token assertion chain. The per-call admission gate adds the DMABuffer.submitDescriptor request-shape checks surfaced through the cap-side DmaBufferSubmitDescriptorAdmission shape: queue != 0 fails closed with dmabuffer-rx-queue-required / non-rx-queue-rejected (TX is rejected explicitly; queue >= 2 trips the standard queue-out-of-range request gate), descriptor_id != 0 fails closed with descriptor-id-out-of-range, length > PAGE_SIZE fails closed with length-exceeds-buffer, a live userspace VMA fails closed with dmabuffer-mapping-live, and a second submitDescriptor without an intervening freeBuffer fails closed with dmabuffer-descriptor-already-inflight. The marker’s trailing labels flip provider_rx_submit from kernel-proxy-bounded to userspace-brokered-buffer and add host_physical_user_visible=0 / direct_dma=blocked; the RX device-write DMA discipline (device_autonomous_raise=not-claimed, provider_visible_queue_address=hidden, iova_export=disabled-future-only, live_cloud=not-attempted) is preserved. Teardown confirms a device reset BEFORE the provider’s freeBuffer scrubs/frees the brokered buffer page.
    • QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by make run-cloud-provider-virtio-net-rx-userspace-submit. No GCE resources are created. Device-autonomous RX MSI-X delivery (programming the virtio queue’s RX queue_msix_vector), multi-queue operation, full NIC readiness, and any live-GCE evidence stay out of scope.
  • RX production-IDT-dispatch waiter wake (cap::virtio_net_rx_production_idt_dispatch_proof): carries the RX userspace-submit proof one authority step further. Active under the non-qemu cloud kernel built with the Cargo feature cloud_virtio_net_rx_production_idt_dispatch_proof (which implies cloud_virtio_net_rx_userspace_submit_proof and its predecessors). The RX publish half – device staging, provider-submitted brokered receive buffer, SLIRP stimulus, one real device->host RX DMA, polled used.idx – is reused unchanged from the userspace-submit predecessor (its module is dropped and this proof becomes the new headline owner; the device-manager admission routes attempt_rx_submit here). The load-bearing change is the production IDT dispatch wiring: the non-qemu arch::x86_64::lapic::handle_device_interrupt arm previously discarded real device-MSI vectors with a bare eoi(), so a real interrupt-gate entry could never reach a deferred-EOI dispatch slot or wake an Interrupt.wait. This proof wires that arm to record an IDT handler entry and route the vector through device_interrupt::handle_lapic_delivery, honoring eoi_deferred (the deferred-EOI path owns the EOI write, retired by acknowledge) and keeping the bare eoi() fallback for unregistered/out-of-pool vectors. The Interrupt.wait cap method then fires ONE real INT $vector on the bound RX route’s vector (IF cleared – the syscall context runs IF-cleared by SFMASK design and INT n ignores IF; see the Fail-closed assertions bullet below) – graduating the qemu-only arch::lapic::inject_real_lapic_int_for_proof mechanic to this proof feature – so the waiter wakes through a real CPU interrupt-gate entry, not the synchronous device_interrupt::wait_kernel_injected_dispatch call every prior RX/cap-waiter proof used.

    • Spec basis: Intel SDM Vol. 3 interrupt-gate semantics (an interrupt gate clears EFLAGS.IF on entry) and Vol. 2 INT n description (which ignores EFLAGS.IF); inherits the RX userspace-submit basis (virtio 1.2 §2.7.6 / §2.7.8 / §4.1.5.2 / §5.1.6, virtio-pci §4.1.5.1.2, PCI 3.0 §6.8.2) for the unchanged publish half.
    • Implemented wire-format subset: identical to the userspace-submit proof for the publish half. The new surface is kernel interrupt-path wiring (no new device wire-format): the production handle_device_interrupt non-qemu arm (kernel/src/arch/x86_64/lapic.rs), a per-vector IDT handler-entry counter (device_interrupt::record_idt_handler_entry / idt_handler_entry_count), and the graduated inject_real_lapic_int_for_proof. The device’s RX queue_msix_vector stays VIRTIO_MSI_NO_VECTOR and the PCI function mask stays held; the INT is fired by this proof, NOT by the device.
    • Fail-closed assertions: inherits the userspace-submit proof’s publish + masked-no-wake + reassign + stale-handle / stale-token chain, and adds: wait asserts delivery_count_after == delivery_count_before + 1, the per-vector IDT handler-entry count advanced by exactly one (idt_handler_observed), the real-delivery delta equals the IDT-entry delta (direct_dispatch_call_count_unchanged, i.e. no fallback synchronous dispatch was used), and one deferred LAPIC EOI is pending; the masked-route assertion now fires a real INT through the masked route and asserts NO delivery_count advance and NO deferred-EOI pending/ack change. The cap-dispatch syscall context runs with EFLAGS.IF cleared by SFMASK design (arch::x86_64::syscall) and INT n ignores IF, so int_fired_with_if is recorded as observed evidence only (false in this build) and is NOT a gating condition. The headline marker flips rx_completion to real-idt-interrupt-gate-wake and adds waiter_wake=real-idt-interrupt-gate, idt_dispatch=production-wired, plus the trailing -idthandler.1-directcall.1-iffired.<0|1>-maskedint.1 token booleans; device_autonomous_raise=not-claimed and live_cloud=not-attempted are preserved.
    • QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by make run-cloud-provider-virtio-net-rx-production-idt-dispatch. No GCE resources are created. Flipping the device’s RX queue_msix_vector + clearing the function mask so the DEVICE raises the MSI – reusing this proof’s now-proven production dispatch path – is now covered by the device-autonomous MSI-X proof below. Live-GCE RX evidence remains future work.
  • RX device-autonomous MSI-X delivery proof (cap::virtio_net_rx_device_autonomous_msix_proof): carries the production-IDT-dispatch proof one authority step further and proves a device-raised virtio-net RX MSI-X reaches the production IDT path under local QEMU/KVM. Active under the non-qemu cloud kernel built with cloud_virtio_net_rx_device_autonomous_msix_proof (which implies cloud_virtio_net_rx_userspace_submit_proof and reuses the same brokered RX publish path). The module enables PCI memory-space decoding and bus mastering, stages RX queue 0 and TX queue 1, enables MSI-X function-level control mask-first, programs RX queue 0 COMMON_QUEUE_MSIX_VECTOR = 0, programs MSI-X table entry 0 through cap::interrupt_programmed::program_attach_arm_unmask, clears the PCI function mask, and then submits one userspace-owned RX bounce buffer plus the ARP TX stimulus. The RX DMA succeeds (used[0].len > 0, observed EtherType 0x0806), proving the data path remains the same brokered provider path.

    • Spec basis: virtio 1.2 §4.1.5.1.2 (modern per-queue MSI-X vector is the MSI-X table entry index), virtio 1.2 §2.7.6 / §2.7.8 / §5.1.6 for the RX descriptor/avail/used path, and PCI 3.0 §6.8.2 for MSI-X table entry and Message Control masking semantics.
    • Implemented wire-format subset: the proof writes only the PCI COMMAND memory-space/bus-master bits, the RX queue config-vector selector, the same split-ring RX/TX descriptors and avail entries as the userspace-submit proof, one RX and one TX notify, one MSI-X table entry, and the PCI MSI-X function mask bit. It does not expose host-physical or IOVA addresses to userspace, does not program an IOMMU, and does not add multi-queue or full-NIC readiness.
    • Proof assertions: make run-cloud-provider-virtio-net-rx-device-autonomous-msix now asserts pci_command=0x0107, one device-raised Interrupt.wait delivery on vector 0x50 with int_injected=0, delivery_count_before=0, delivery_count_after=1, idt_handler_observed=true, eoi_deferred=true, and one deferred-EOI Interrupt.acknowledge (ack_delta=1). The final cloudboot-evidence: virtio-net-userspace-provider marker includes pcicmd.0107, idthandler.1, directcall.1, devraise.1, intinjected.0, and rx_completion=device-autonomous-msix. Closeout validation also keeps the RX production-IDT dispatch, RX userspace-submit, provider cap-waiter, run-net, and default boot-smoke gates green under local QEMU.
    • QEMU/KVM diagnosis: earlier bpftrace evidence showed QEMU reached msix_notify(vector=0) with an unmasked MSI-X entry and prepared 0xfee00000/0x50, but KVM did not accept vector 0x50. The missing precondition was explicit PCI COMMAND bus-master enablement in this proof path; after the proof enables memory-space decoding + bus mastering, local QEMU/KVM delivers the MSI-X to the guest IDT path.
  • RX polled used-ring completion (no injected dispatch) (cap::virtio_net_rx_polled_completion_proof): the first virtio-net proof whose RX completion signal is real driver progress, not an injected proxy. Active under the non-qemu cloud kernel built with the Cargo feature cloud_virtio_net_rx_polled_completion_proof (which implies cloud_virtio_net_rx_userspace_submit_proof and its predecessors). The RX publish half – device staging, provider-submitted brokered receive buffer, SLIRP stimulus, one real device->host RX DMA, polled used.idx – is reused unchanged from the userspace-submit predecessor (its module is dropped and this proof becomes the new headline owner; the device-manager admission routes attempt_rx_submit here). The load-bearing change is on the completion path: every prior virtio-net/cap-waiter proof signalled the Interrupt.wait completion through device_interrupt::wait_kernel_injected_dispatch (a kernel-side dispatch-slot proxy) or, in the IDT-dispatch proof, a fired INT $vector – neither produced by real driver progress. Here virtio_net_rx_polled_completion_proof::invoke_wait instead reports the completion from the already-latched polled used-ring state captured during attempt_rx_submit (the PublishedRx used_id == 0 / used_len > 0 / polled_used_idx >= POLL_TARGET_USED_IDX, latched from the predecessor’s reused poll_used_idx under its Acquire fence): there is NO wait_kernel_injected_dispatch call and NO inject_real_lapic_int_for_proof anywhere in the wait/ack path, and zero kernel-injected interrupts. invoke_acknowledge is a poll-confirmation no-op (no deferred LAPIC EOI to retire, since no interrupt was taken). The bound RX MSI-X route is still programmed at boot but is used ONLY by the release-time masked-no-wake/stale-handle assertion chain.

    • Spec basis: virtio 1.2 §2.7.8 (used ring is a memory-visible structure the device advances) and §2.7.10 (the VIRTQ_AVAIL_F_NO_INTERRUPT driver flag the predecessor already sets, so the device performs no MSI either); inherits the RX userspace-submit basis (virtio 1.2 §2.7.6 / §4.1.5.2 / §5.1.6, virtio-pci §4.1.5.1.2, PCI 3.0 §6.8.2) for the unchanged publish half.
    • Implemented wire-format subset: identical to the userspace-submit proof for the publish half; no new device wire-format and no new kernel interrupt-path wiring. The completion is a pure memory read of the latched used-ring state plus a device_interrupt::snapshot_dispatch_slot before/after delivery_count comparison.
    • Fail-closed assertions: inherits the userspace-submit proof’s publish + masked-no-wake + reassign + stale-handle / stale-token chain, and replaces the wait/ack injection assertions with their polled inverse: wait asserts the latched used_id == 0 / used_len > 0 / polled_used_idx >= POLL_TARGET_USED_IDX (completion_observed) AND delivery_count_after == delivery_count_before (int_injected=0, no kernel dispatch advanced); acknowledge asserts no deferred LAPIC EOI was pending and none was retired (hardware_dispatch_ack_delta == 0, eoi_written=false); and on_release requires provider_observed_dispatch == 0 and provider_observed_ack == 0 (the inverse of the injected predecessor’s >= 1). The headline marker flips rx_completion to polled-used-ring and adds waiter_wake=polled-used-ring, int_injected=0, with the trailing -deliv.0-ack.0 token booleans; device_autonomous_raise=not-claimed and live_cloud=not-attempted are preserved.
    • QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by make run-cloud-provider-virtio-net-rx-polled-completion. No GCE resources are created. Graduating this polled provider off the per-proof feature onto the default system.cue cloudboot manifest, programming the device’s RX queue_msix_vector for device-autonomous delivery, and any live-GCE RX evidence are future work.
  • Polled RX+TX provider, always-built off the per-proof feature (cap::virtio_net_polled_provider): graduates the polled provider above into the production compile set. The module is always-built in the default non-qemu cloud kernel (cfg(not(feature = "qemu")), no cloud_*_proof feature), derived from cap::virtio_net_rx_polled_completion_proof with the proof gate removed and the feature-gated virtio_net_tx_authority_bundle_proof bundle-observer calls dropped (the per-grant identity is still recorded through the always-built hardware_audit cap-audit). The polled completion behaviour (read the latched poll_used_idx used-ring state in invoke_wait, no wait_kernel_injected_dispatch, no inject_real_lapic_int_for_proof, no-op invoke_acknowledge) is identical; only the activation switch changes from a Cargo feature to a manifest-observable condition. kernel::run_init calls virtio_net_polled_provider::init only when the booted manifest declares the cloud-provider-virtio-net-polled-provider-default-smoke binary, so on the literal system.cue, run-cloud-interrupt-grant, and every other default cloudboot manifest the provider is never staged (is_staged()==false) and is inert. The interrupt cap is granted through the unchanged production interrupt_grant_source_prod (no new KernelCapSource arm, no proof-only grant-source replacement); that source delegates its cap to virtio_net_polled_provider::build_cap_for_grant while the provider is staged, otherwise it keeps its admission-check-only skeleton. The always-built device_manager::stub submit-admission preview and accepted path admit RX queue 0 and route DMABuffer.submitDescriptor to virtio_net_polled_provider::attempt_rx_submit only while staged.

    • Marker: emits a DISTINCT cloudboot-evidence: virtio-net-polled-provider <token> headline (vs the proof’s virtio-net-userspace-provider) so the two manifests are distinguishable, adding the labels provider_build=always-built-default-kernel, provider_feature_gate=none, and grant_source=production-despecialized to the polled-completion label set. The cap::provider_nic_bind_proof::report re-point landed (the provider-nic-bound marker above now fires from this provider’s real polled TX+RX completion via report_real_completion on the cloud-provider-nic-bound-real-polled-driver-smoke manifest); the literal system.cue fold is not yet implemented.
    • QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by make run-cloud-provider-virtio-net-polled-provider-default on the default non-qemu kernel with no cloud_*_proof feature. No GCE resources are created; live_cloud=not-attempted.
  • Polled provider teardown / stale-authority (clean cap-op-release) – the always-built polled provider now carries an asserted S.11.2 teardown/stale-authority chain over its DMA + MMIO + IRQ authority on the clean cap-op-release path, not only the IRQ route. When the dedicated teardown manifest is booted (run_init calls virtio_net_polled_provider::arm_teardown_report because the cloud-provider-virtio-net-polled-teardown-smoke binary is declared), the provider’s complete_after_release (kernel/src/cap/virtio_net_polled_provider.rs, run_teardown_assertions + emit_teardown_evidence) re-validates the brokered DMA + DeviceMmio authority the smoke released before the Interrupt cap and emits one combined cloudboot-evidence: virtio-net-polled-teardown <token> headline.

    • Mechanisms reused: device_manager::validate_dmabuffer_record / validate_dmapool_record (stale DMA handle / stale pool-allocate rejected fail-closed), device_manager::last_bounce_page_release_evidence (the scrub-before-free / ledger-removed ordering stamped by detach_dmabuffer_record_for_cap_release), device_manager::validate_devicemmio_record over devicemmio_grant_source_prod::last_issued_handle_and_owner (the granted DeviceMmio cap’s record is detached on release, so a stale access fails closed), and the inherited device_interrupt masked-no-wake / reassign / stale-handle chain folded into the same marker.
    • Marker labels: stale_dma_buffer_blocked=true, dma_page_scrubbed_before_free=true, dma_ledger_removed_after_scrub=true, stale_dma_pool_alloc_blocked=true, stale_mmio_blocked=true, mmio_handle_invalidated=true, masked_no_wake=true, reassign_generation_bumped=true, stale_token_wake_blocked=true, stale_route_handle_blocked=true, int_injected=0, host_physical_user_visible=0, direct_dma=blocked, iova_export=disabled-future-only. The marker is suppressed fail-closed if any leg regresses. The MMIO leg rides the granted DeviceMmio cap; the provider’s own pci::map_bar_region BAR mapping stays boot-only with no kernel invalidation API by design.
    • Scope boundary: the clean cap-op-release marker carries driver_death_teardown=not-attempted-this-slice. The process-exit-under-active-authority teardown trigger is its own proof (see the next entry); it crosses the process-lifecycle authority boundary, and the single-shot provider (one Interrupt cap, single-shot attempt_rx_submit) cannot drive both teardown paths in one boot, so it has its own manifest/boot.
    • QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by make run-cloud-provider-virtio-net-polled-teardown on the default non-qemu kernel with no cloud_*_proof feature. live_cloud=not-attempted.
  • Polled provider DRIVER-DEATH / process-exit teardown – the always-built polled provider’s release-time teardown chain now also covers the process-exit-under-active-authority trigger. When the dedicated driver-death manifest is booted (run_init calls virtio_net_polled_provider::arm_driver_death_report because the cloud-provider-virtio-net-polled-driver-death-smoke binary is declared), the smoke drives the same real polled RX submit/wait/ack + RX read-back, scrubs+frees its DMABuffer, and then exits while still holding its DMAPool, DeviceMmio, and Interrupt caps. The kernel’s CapReleaseReason::ProcessExit cap-teardown reclaims all three in cap-table slot order (device_mmio then dmapool before the interrupt cap), so the provider’s complete_after_release process-exit arm (run_teardown_assertions + emit_driver_death_evidence) re-validates the now-stale DMA + DeviceMmio authority and runs the IRQ masked-no-wake / reassign / stale-handle chain over the route, emitting one cloudboot-evidence: virtio-net-polled-driver-death <token> headline.

    • Mechanisms reused: identical to the clean cap-op-release entry above, but the DMAPool / DeviceMmio / Interrupt records are detached by the kernel’s ProcessExit cap-teardown rather than explicit Interrupt.release / DMAPool.release / DeviceMmio.release calls. The runtime-allocated DMABuffer cap is torn down AFTER the manifest-granted Interrupt cap in slot order, so its page is scrubbed+freed by the smoke’s freeBuffer before exit (the normal DMABuffer lifecycle); the buffer is then re-validated stale with its scrub-before-free ordering intact.
    • Marker labels: same DMA / MMIO / IRQ / no-export discipline labels as the clean-release marker, plus driver_death_teardown=no-live-authority and release_path=process-exit. The marker is suppressed fail-closed if any leg regresses (polled_completion_clean, the DMA/MMIO stale re-validation, or the IRQ chain). The clean-release virtio-net-polled-teardown headline and the cap-op-release-gated virtio-net-polled-provider headline both stay absent on this manifest (the Interrupt cap is reclaimed via ProcessExit, not an explicit release).
    • QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by make run-cloud-provider-virtio-net-polled-driver-death on the default non-qemu kernel with no cloud_*_proof feature. live_cloud=not-attempted.
  • Provider-chain closeout: the parent cloud-prod-virtio-net-userspace-provider-local-proof is closed by the decomposed child chain above and the legacy/transitional bind below. The local non-qemu cloudboot/QEMU evidence now includes modern TX/RX userspace-provider proofs, the always-built polled provider, real-polled-driver provider-nic-bound, clean-release/process-exit stale-authority proofs, and the legacy-polled path that later passed the real-GCE provider-nic-bound gate. This closeout does not claim L4 socket/smoltcp relocation, literal system.cue provider fold, reusable full-NIC/multiqueue readiness, or device-autonomous MSI-X delivery; those remain separate lanes.

4. Legacy / transitional virtio 0.9 PIO transport (cloud bind)

Everything above is the modern (virtio 1.x) transport: vendor capability windows in MMIO BARs + MSI-X. Real GCE presents the NIC as a legacy / transitional virtio 0.9 device instead (run 1780377997-281b, 2026-06-02): PCI 1af4:1000, no modern vendor capability windows, no usable MMIO memory BAR, legacy INTx, no MSI-X. The whole legacy virtio config block lives in a PIO (I/O space) BAR0 register window, which the modern transport discovery cannot represent, so the modern polled provider selects no candidate. This section maps the legacy PIO transport subset the kernel implements to bind that device. It has two parts: selection + brokered PIO config (kernel/src/cap/virtio_net_legacy_select_proof.rs) and the legacy single-PFN contiguous-queue polled TX/RX data path + provider-nic-bound (kernel/src/cap/virtio_net_legacy_datapath_proof.rs); both are implemented and proved locally.

  • Spec basis: Virtual I/O Device (VIRTIO) legacy interface — the pre-1.0 “Virtio PCI” I/O-BAR register layout (virtio 0.9.5 / the legacy appendix of the OASIS 1.x spec, §4.1.4.8 “Legacy Interfaces”). Cross-checked against QEMU hw/virtio/virtio-pci.c (virtio_pci_config_* legacy I/O ops) and the Linux virtio_pci_legacy driver.
  • Implemented wire-format subset (no-MSI-X legacy I/O register block, kernel/src/cap/virtio_net_legacy_select_proof.rs LEGACY_* offset constants): device features (0x00, u32 RO), guest/driver features (0x04, u32 RW), queue PFN (0x08, used by the data path), queue size (0x0c, u16 RO), queue select (0x0e, u16 RW), queue notify (0x10, u16 RW), device status (0x12, u8 RW), ISR status (0x13, u8 RO). The device-specific config (MAC, …) follows at 0x14 in the no-MSI-X layout (offsets shift by 4 only when MSI-X is enabled, which this polled path never does). Feature negotiation is the 32-bit legacy feature word: VIRTIO_F_VERSION_1 (a high feature bit) is unrepresentable and absent, and VIRTIO_NET_F_MAC (1 << 5) is required and acknowledged. The legacy ring uses the single-PFN contiguous virtqueue layout (descriptor table + avail ring + padding to VIRTIO_PCI_VRING_ALIGN (4096) + used ring, addressed by one page-frame number written to LEGACY_QUEUE_PFN as physical_address >> 12) — materialized by the data path (virtio_net_legacy_datapath_proof::materialize_queue).
  • Legacy single-PFN data path (kernel/src/cap/virtio_net_legacy_datapath_proof.rs): after the status + feature handshake, both virtqueues are materialized as single physically contiguous, page-aligned regions (frame::alloc_contiguous) whose descriptor/avail/used sub-addresses are computed from the contiguous base, so the in-ring desc/avail/used manipulation reuses the modern provider’s helpers (write_desc_slot_0, write_avail_*, poll_used_idx, read_used_ring_slot_0) unchanged — only the transport differs. The doorbell is a PIO write of the queue index to LEGACY_QUEUE_NOTIFY (no modern MMIO notify region). The virtio-net header is the 10-byte legacy header (no VIRTIO_NET_F_MRG_RXBUF; the modern path uses 12). The reset is polled with a bounded settle (real legacy hardware acknowledges reset asynchronously; QEMU clears synchronously) rather than a single-shot == 0. A real polled TX (queue 1) + RX (queue 0) completes by reading the used rings (no MSI-X route programmed, no interrupt taken or injected); the device is then reset clean and all DMA frames are freed.
  • Device-fixed queue size + contiguous-allocation bound (materialize_queue, MAX_LEGACY_QUEUE_SIZE, vring_layout): legacy virtio queue size is device-fixed and read-only (LEGACY_QUEUE_NUM, 0x0c) — the driver cannot shrink it, so it must materialize whatever single-PFN vring the device advertises. materialize_queue reads that size and rejects a zero, non-power-of-two, or over-bound value cleanly (the vring layout requires a power-of-two size; the bound caps the contiguous allocation), then sizes the contiguous region via vring_layout: 256 → 3 pages, 1024 → 8 pages, and the live GCE Andromeda virtio-net’s 4096-entry queue → ~28 pages per queue. MAX_LEGACY_QUEUE_SIZE is the virtio spec maximum (32768, a power of two), so the bound admits any spec-legal device-fixed size — including GCE’s 4096 — while still failing closed above it; an alloc_contiguous that cannot satisfy the request fails closed (no panic) on the existing alloc … contiguous frames … failed arm. QEMU’s legacy virtio-net advertises 256 by default and caps queue size at 1024 (VIRTQUEUE_MAX_SIZE), and locks tx_queue_size at 256 for the non-vhost SLIRP device, so the largest local shape is rx_queue_size=1024 (an 8-page RX vring, exercised by make run-cloud-provider-nic-bound-legacy-large-queue); the exact 4096-entry materialization is only verifiable on real GCE (the billable live-GCE run).
  • GCE-viable RX stimulus / completion (fill_dhcp_discover_legacy, read_device_mac, poll_rx_used_wall_clock): the TX stimulus is a broadcast DHCP DISCOVER sourced from the device’s real MAC (read from legacy device-config space at 0x14), not the modern path’s ARP “who-has 10.0.2.2” from a hardcoded spoofed source. This is required for a real cloud NIC: GCE’s Andromeda SDN enforces MAC/IP anti-spoofing (egress from a non-assigned MAC/IP is dropped), 10.0.2.2 does not exist on the VPC, and no responder answers an ARP-for-the-gateway. A legitimately-sourced DHCP DISCOVER is answered by both QEMU SLIRP’s built-in DHCP server and the GCE SDN DHCP responder, giving a real device->host RX frame. The completion model is accept-any inbound frame (any non-empty frame with a readable EtherType satisfies RX, so an ambient gateway ARP/RA on GCE counts too), polled against a wall-clock budget (monotonic_ns deadline, 5 s) rather than a fixed spin count sized for SLIRP’s instantaneous reply. Interrupts are masked during this boot-time proof, so the wall-clock budget relies on the TSC-calibrated clocksource (the QEMU and GCE case); a tick-derived clock is frozen here and a fixed iteration ceiling is the fail-closed backstop. The egress MAC is re-asserted non-zero / non-broadcast before the marker is emitted, and the marker token carries it (-srcmac.<12hex>).
  • Persistent legacy Nic-cap runtime (virtio_net_legacy_datapath_proof::legacy_nic_runtime, kernel feature cloud_gce_legacy_virtio_webui_serving_proof): unlike the one-shot proofs, this runtime brings the legacy device up once at boot and keeps it DRIVER_OK for the whole boot, backing the same typed Nic cap methods the modern shim path serves (transmit @0, macAddress @2, linkStatus @3, receivePoll @4; receive @1 fails closed). RX keeps a small posted buffer pool (RX_POOL_SIZE descriptors, recycled in place after each copy-out); receivePoll is non-blocking and compares the device-written used.idx against a consumed cursor (read_used_idx), so a frame burst that advances the index past cursor + 1 is drained one completion per call instead of being missed by an equality poll. TX publishes one frame at descriptor slot 0 and drains its completion with the same bounded advanced-past-cursor check; an unresolved or divergent completion (and any other ring-integrity violation on either queue) is a fatal error that tears the runtime down through a reset-confirmed fail-stop, after which every cap call fails closed. Frames cross the cap boundary as inline Data with the 10-byte legacy header added/stripped kernel-side; PIO, vring, and DMA-frame ownership stay kernel-side, and release quiesces the device (reset-confirmed before frames are freed). VIRTIO_NET_F_STATUS is not negotiated, so linkStatus reports assumed-up while the runtime is staged. This is the serving bridge the Phase C userspace network stack uses on the GCE NIC shape; proof make run-cloud-gce-legacy-virtio-webui-serving (host HTTP peer fetches the remote-session Web UI bundle through QEMU hostfwd over this datapath).
  • capOS mapping (brokered PIO config access): capOS device authority (DeviceMmio/DDF) is MMIO/memory-BAR based and there is no I/O-port capability. The legacy config window stays kernel-owned: the only sanctioned path to a device’s legacy I/O BAR is the bounds-checked pci::LegacyIoBar accessor (kernel/src/pci.rs), reached through pci::io_bar(device, bar). Every access is range-checked against the device’s claimed I/O BAR window and the 16-bit x86 port space, so a caller cannot reach a port outside the BAR; there is no ambient in/out authority and no port-I/O surface exposed to userspace. PCI I/O decoding is enabled per device via pci::enable_io_space_and_bus_master before any access.
  • Candidate gate (MSI-X not required): the legacy candidate selection (virtio_net_legacy_select_proof::pick_legacy_candidate) accepts a transitional virtio-net function (1af4:1000, network class) whose modern common-config window does not resolve and which exposes a usable I/O BAR0 — without requiring an MSI-X capability, because the polled data path does not depend on interrupt delivery. This is the deliberate relaxation of the modern gate (virtio_net_polled_provider::candidate_from_device, which requires both the modern transport and MSI-X).
  • Fail-closed rules: the brokered status handshake fails closed on any out-of-window access, any device-status regression, a missing VIRTIO_NET_F_MAC, a guest-feature write-back mismatch, or a zero queue-0 size. The data path additionally fails closed on a device-MAC read failure (out-of-window, all-zero, or broadcast), an out-of-window queue PFN/notify access, a PFN read-back mismatch, an advertised queue size that is zero or exceeds the materialization bound, a TX used-ring poll-budget exhaustion, an RX wall-clock-budget exhaustion, a used-ring id/len regression, a zero EtherType, or a final reset that does not settle to 0x00. On any such failure no provider-nic-bound marker is emitted (the gate stays null/fail-closed), and the device is reset and its DMA frames freed regardless of outcome. The completion is observed by polling the legacy used ring with no MSI-X route programmed and no interrupt taken or injected; the provider-nic-bound marker (provider_nic_bind_proof::report_real_completion_legacy) carries honest transport=legacy-pio-virtio-0.9, interrupt_model=polled-no-msix, and userspace_driver_authority=kernel-brokered-legacy-polled labels, and nothing is exported to userspace (host_physical_user_visible=0, direct_dma=blocked).
  • QEMU-emulable vs hardware-only: the legacy shape is QEMU-emulable via qemu-system-x86_64 -device virtio-net-pci,disable-modern=on,vectors=0 (legacy I/O BAR0, INTx, no MSI-X — the faithful GCE shape). Proved locally by make run-cloud-provider-virtio-net-legacy-select on the default non-qemu kernel: the kernel selects the legacy NIC over its I/O BAR0, runs the brokered device-status handshake + 32-bit feature read (observed device_features=0x79bf8064, VIRTIO_NET_F_MAC set) + queue-0 size read (256), and emits cloudboot-evidence: virtio-net-legacy-candidate-selected <token>. The data path is proved by make run-cloud-provider-nic-bound-legacy on the same device shape: the kernel reads the device’s real MAC (52:54:00:12:34:56 under QEMU), materializes the legacy single-PFN virtqueues, TX-submits a broadcast DHCP DISCOVER from that MAC, and completes a real polled RX against the wall-clock budget (observed src_mac=52:54:00:12:34:56, rx_used_len=600, ethertype=0x0800 — the SLIRP DHCP OFFER, an IPv4 frame — tx_used_idx=1, rx_used_idx=1, rx_clock_usable=true, final_status=0x00), then emits exactly one cloudboot-evidence: provider-nic-bound <token> sourced from that completion (token carries -ethertype.0800 and -srcmac.525400123456). The DHCP-discover-from-real-MAC stimulus is the GCE-viable path: GCE’s Andromeda SDN drops egress from a spoofed source MAC/IP and has no 10.0.2.2 ARP responder, so the modern path’s spoofed ARP-to-SLIRP-gateway stimulus would time out on a real NIC; a real-MAC DHCP DISCOVER is answered by both SLIRP and the GCE SDN. The follow-up billable real-GCE run cloud-prod-gce-billable-boot-real-polled-nic-bound passed on 2026-06-02 15:03 UTC through this legacy path: the live 1af4:1000 NIC bound at 00:04.0, materialized the 4096-entry RX/TX vrings, transmitted DHCP DISCOVER from the device MAC, received a real IPv4 frame (rx_used_len=532, ethertype=0x0800), and emitted provider-nic-bound from report_real_completion_legacy. This remains a bounded raw-frame bind proof, not L4 networking or a reusable userspace provider service.
  • kernel/src/virtio.rs – PCI transport discovery, split-ring transport, feature negotiation, framing.
  • kernel/src/cap/network.rs – accepted-socket cap state and the network capability surface.
  • docs/proposals/networking-proposal.md – the userspace network-stack move (Phase C) and the transitional-kernel status.
  • docs/dma-isolation-design.md – the DMA backend and isolation model the userspace successor binds into.

virtio-blk (modern PCI block device)

This is a provenance map for the in-tree virtio-blk driver: it cites the spec, summarizes only the wire-format subset the code actually implements, and points into the implementation. It is not a re-spec – where the spec is implemented unchanged it links rather than transcribes. The driver was the first real BlockDevice CapObject, so the treatment is a concise map rather than exhaustive register tables. It reuses the modern split-ring transport seam introduced for virtio-net (virtio-net); this page covers only the block-specific additions.

Status: QEMU fixture, not the production storage route. The kernel-owned virtio-blk driver, its BlockDevice cap arm, and its PCI discovery are all gated behind the qemu cargo feature (diagnose_qemu_virtio_blk in kernel/src/pci.rs; the BlockDeviceBackend::Virtio arm in kernel/src/cap/block_device.rs). The default non-qemu production kernel never enumerates, claims, or binds virtio-blk, and its block_device grant source resolves to the userspace-brokered NVMe BlockDevice arm (BlockDeviceBackend::NvmeBrokered) instead, failing closed when no verified NVMe controller and live device_mmio grant are present. virtio-blk remains as a named local fixture / regression test only – a fully QEMU-emulable end-to-end BlockDevice proof and the substrate the storage-layer (read-only / persistent / writable filesystem) QEMU proofs read through. It is not an ambiguous forward production driver. The kernel broker responsibilities it exercises (PCI claim arbitration, MMIO/IRQ/DMA admission, bounce/IOMMU isolation, stale-generation rejection, and revocation) are the same ones the production userspace storage driver binds into; see §3 capOS mapping.

The driver lives in the virtio-blk section of kernel/src/virtio.rs (VirtioBlkDriver) and the cap surface in kernel/src/cap/block_device.rs (BlockDeviceCap).

1. Spec basis

  • Device: virtio block device, modern (virtio 1.x) PCI transport. PCI vendor 0x1af4; device 0x1042 (modern) / 0x1001 (transitional). IDs at kernel/src/pci.rs (VIRTIO_VENDOR_ID, VIRTIO_BLK_MODERN_DEVICE_ID, VIRTIO_BLK_TRANSITIONAL_DEVICE_ID; matched by PciDevice::is_virtio_blk). Up to device_dma::MAX_VIRTIO_BLK_DEVICES functions are bound, each in its own const-generic driver slot (VIRTIO_BLK_DRIVER_0 / VIRTIO_BLK_DRIVER_1) so the two devices cannot alias DMA or queue state. The target disk is selected by manifest PCI identity; the ordinary boot/storage disk resolves to the non-target disk when both are present.
  • Authoritative spec: Virtual I/O Device (VIRTIO) Version 1.2, OASIS Committee Specification 01 (2022-07-01). Source: https://docs.oasis-open.org/virtio/virtio/v1.2/virtio-v1.2.html. Relevant sections: 4.1 (virtio over PCI bus), 2.7 (split virtqueues), 5.2 (block device).
  • Reference: cross-checked against the Linux virtio_blk driver for the request framing and the virtio_pci_modern modern-transport handshake.

2. Wire format (implemented subset)

The modern PCI capability parsing, common-config register map, split-ring descriptor layout, and feature-negotiation handshake are the shared transport seam documented in virtio-net §2 (kernel/src/virtio.rs transport module, ModernTransport, Virtqueue, DescriptorTrackingSlot). Only the block-specific subset is summarized here.

  • Feature negotiation: the driver requires and selects only VIRTIO_F_VERSION_1 (read_device_features / write_driver_features in VirtioBlkDriver::initialize); a device that does not offer it fails closed with BlkInitError::MissingRequiredFeatures. No block feature bits (read-only, multi-queue, discard, …) are negotiated, so the device is driven as a single read/write request queue.
  • Device config (capacity): the block device config space carries the capacity in 512-byte sectors as a little-endian u64 (VIRTIO_BLK_CONFIG_CAPACITY_LEN = 8 bytes, read low/high in initialize). A config region shorter than that, or a zero capacity, fails closed (BlkInitError::DeviceConfigTooSmall / ZeroCapacity).
  • Request queue: a single request virtqueue (queue 0). The negotiated size is clamped to the largest power of two not exceeding both the device-advertised COMMON_QUEUE_SIZE and VIRTIO_BLK_REQUEST_QUEUE_SIZE (8); a usable size below 4 (one request chain needs 3 descriptors) fails closed. Per-queue notify address is computed from notify_off_multiplier like any modern virtio queue.
  • Request framing (VirtioBlkDriver::issue_request): each request is a 3-descriptor chain over one bounce-buffer page (ChainSegment):
    1. headerVIRTIO_BLK_REQ_HEADER_LEN (16) bytes, device-readable: type (u32VIRTIO_BLK_T_IN = 0 read / VIRTIO_BLK_T_OUT = 1 write), a reserved u32, and the sector (u64 LBA), at VIRTIO_BLK_HEADER_OFFSET (0).
    2. data512 * count bytes at VIRTIO_BLK_DATA_OFFSET (512), device-writable for reads, device-readable for writes.
    3. status – 1 byte at VIRTIO_BLK_STATUS_OFFSET (16), device-writable; pre-seeded with VIRTIO_BLK_STATUS_SENTINEL (0xff) and checked for VIRTIO_BLK_S_OK (0) after completion (BlockDeviceRequestError::DeviceStatus otherwise).
  • Completion: QEMU completes virtio-blk requests synchronously, so the driver notifies the queue and polls the used ring (poll_used_within_ns, bounded by the real-time VIRTIO_BLK_COMPLETION_BUDGET_NS budget with the VIRTIO_BLK_COMPLETION_FALLBACK_SPIN_LIMIT spin-count backstop when the monotonic clocksource is tick-derived) rather than waiting on the request MSI-X interrupt, which is claimed but left masked (see §3). The bound is time-based because the device side includes QEMU’s host file I/O, whose latency a raw spin count does not track.

3. capOS mapping

  • Binding (qemu fixture, in-kernel): virtio-blk is driven in the kernel and only under the qemu feature. Unlike the userspace storage driver, it does not receive DeviceMmio/Interrupt/DMAPool caps; instead VirtioBlkDriver::initialize binds authority through the kernel device_manager transactions – claim_pci_function(.., DeviceOwner::VirtioBlk) then attach_dmapool_record_with_remapping / attach_devicemmio_record / attach_interrupt_source. The BlockDevice cap is the userspace-facing surface; the hardware authority stays kernel-owned. This in-kernel ownership is why the driver is kept as a qemu fixture rather than a production route: the production BlockDevice is served by the userspace-brokered NVMe provider chain (BlockDeviceBackend::NvmeBrokered, gated on a verified controller and a live device_mmio grant), where the device-specific protocol logic runs in userspace over DeviceMmio/DMAPool/Interrupt caps and the kernel retains only broker/admission/isolation/revocation.
  • MMIO: the modern-transport common/notify/ISR/device-config regions are mapped from the device BARs (map_blk_region over pci::map_bar_region) and recorded with device_manager::attach_devicemmio_record against the first decoded memory BAR. Doorbell (queue-notify) writes are scoped to the per-queue notify address computed from notify_off_multiplier. The DDF DeviceMmio cap (kernel/src/cap/device_mmio.rs) is the userspace successor surface.
  • Interrupt: one MSI-X route is registered for the request queue (VIRTIO_BLK_REQUEST_MSIX_ENTRY = 0, PciMsixInterruptRole::BlockRequestQueue), claimed (DeviceInterruptDriver::VirtioBlk) and attached to the device handle for authority binding, but left masked: completion is by polled used ring, not interrupt delivery. Route records are tracked by the kernel-owned device-interrupt ledger (kernel/src/device_interrupt.rs).
  • DMA: each bound device gets its own DMA pool (device_dma::begin_virtio_blk_pool, keyed by the const-generic DEV index via VirtioBlkDma<DEV>). Ring pages and the request bounce buffer are allocated and accounted through the blk-keyed ledger (allocate_virtio_blk_page / register_virtio_blk_queue / record_virtio_blk_submission/..._completion_for_allocation in kernel/src/device_dma.rs). DMA uses the manager-owned bounce-buffer backend; no host physical address or IOVA is exposed to userspace – the request MSI-X route is kept masked specifically so no raw address leaves the kernel boundary.
  • BlockDevice cap surface: BlockDeviceCap (kernel/src/cap/block_device.rs) is scoped to one device_index and routes the schema’s readBlocks/writeBlocks/info/flush methods (schema/capos.capnp interface BlockDevice) to that device only, failing closed when it is not bound. Under the qemu feature the block_device KernelCapSource reaches the resolved boot/storage virtio-blk disk, and the block_device_target source requires SystemConfig.blockDeviceTarget.pci (schema/capos.capnp) and resolves that PCI segment:bus:device.function selector to a bound non-boot virtio-blk device; absent, mismatched, or boot-disk selectors fail closed. In the production (non-qemu) kernel the same block_device source instead mints the NvmeBrokered arm, and block_device_target fails closed (requires the qemu feature). The read-only/ persistent/writable filesystem and store caps (readonly_fs, persistent_store, writable_fs) layer their on-disk formats over whichever BlockDevice backs the boot/storage cap – the virtio-blk fixture under qemu, the brokered NVMe arm in production.
  • Fail-closed / validation rules: VirtioBlkDriver::validate_range rejects a zero count, a count over VIRTIO_BLK_MAX_SECTORS_PER_REQUEST (7 – bounded so header + status + 512 * count fit one 4 KiB page), start_lba + count arithmetic overflow, and any range past the reported capacity_sectors, all before device access. The cap layer additionally enforces that writeBlocks data length equals count * 512 (BlockDeviceRequestError::DataLengthMismatch). A non-OK device status, a used-ring poll timeout, or a DMA accounting failure each fail closed (DeviceStatus / Completion / Accounting). Descriptor reuse is generation-tracked through the shared bounded tracking-slot array.
  • QEMU-emulable vs hardware-only: fully QEMU-emulable, and these are the fixture gates. QEMU provides virtio-blk-pci; make run-virtio-blk is the single-device end-to-end BlockDevice fixture, make run-multi-virtio-blk proves the two-device (boot + target) binding with independent per-device DMA pools, make run-blockdevice-target-identity proves manifest identity selection when PCI/BDF order would otherwise bind the intended target first, and make run-virtio-blk-failover exercises the multi-device failover path. All are --features qemu fixtures over dedicated system-virtio-blk.cue / system-multi-virtio-blk.cue / system-blockdevice-target-identity.cue manifests, not production-storage evidence. No hardware-only path. The production-storage gate is the userspace-brokered NVMe BlockDevice chain (make run-cloud-provider-nvme-blockdevice-read-graduated and the other run-cloud-provider-nvme-blockdevice-* proofs).
  • kernel/src/virtio.rs – the virtio-blk driver (VirtioBlkDriver), request framing, queue setup, and the shared modern split-ring transport.
  • kernel/src/cap/block_device.rs – the BlockDevice cap surface (BlockDeviceCap) routing schema methods to a single bound device.
  • kernel/src/device_dma.rs – the per-device virtio-blk DMA pool/queue ledger.
  • kernel/src/device_interrupt.rs – the request-queue MSI-X route record.
  • schema/capos.capnp (interface BlockDevice) – the readBlocks/writeBlocks/info/flush contract.
  • docs/dma-isolation-design.md – the DMA backend and isolation model the userspace successor binds into.

FAT32 (read-only filesystem backer)

This is a provenance map for the read-only FAT32 Directory/File backer, part of the real-filesystem role-split (docs/proposals/real-filesystem-decision.md). It is a filesystem-format reader layered over a block device, not a hardware device page; like atapi-iso9660.md it documents an on-disk format and the capOS cap surface over it, citing the spec and the vendored parser rather than re-specifying FAT.

The backer lives in kernel/src/cap/fat_fs.rs and reads through the vendored fatfs no_std crate (vendor/fatfs-no_std/). Its sector reads go through a BlockSource seam with two mutually-exclusive variants (mirroring readonly_fs.rs): a Virtio arm (compiled under storage_fat_read, reading the kernel-owned virtio-blk device) and an Nvme arm (compiled under cloud_fat_read_over_nvme_proof, reading a cloud-attached NVMe namespace through the always-built brokered read window op). The module compiles under either feature.

1. Spec basis

  • Format: FAT32, the File Allocation Table filesystem as standardized by Microsoft’s FAT: General Overview of On-Disk Format (the FAT32 File System Specification, v1.03) and the EFI FAT specification. The relevant structures are the BIOS Parameter Block (BPB) in the boot sector, the FAT32 FSInfo sector, the File Allocation Table itself (the cluster chain), and the directory-entry records (8.3 short entries plus VFAT long-file-name entries).
  • Parser provenance: full FAT parsing is delegated to the vendored fatfs crate (vendor/fatfs-no_std/rust-fatfs-0.4.0, upstream rafalh/rust-fatfs, commit pinned in vendor/fatfs-no_std/VENDORED_FROM.md, MIT). capOS supplies the block-backed storage adapter, the cap surface above it, and an independent bounded validation subset over the BPB, root chain, root entries, and root-level file FAT chains before exposing a root cap.
  • Already a boot-path format: FAT32 is the EFI System Partition format Limine reads, so it is structurally part of the boot path already (docs/backlog/hardware-boot-storage.md); this backer is the first capOS reader of a host-authored FAT32 image.

2. Wire format (implemented subset)

Only the read path is exercised; FAT write (cluster allocation, FSInfo/FAT mutation, directory-entry creation) is out of scope and fails closed.

  • Mount (fat_fs::mount_fatfs -> fatfs::FileSystem::new): capOS first performs a bounded FAT32 preflight over the boot sector / BPB and primary FAT: 512-byte sectors, FAT32 geometry, root cluster in range, bounded root directory chain, and bounded root-level file chains. It then lets fatfs mount and asserts FileSystem::fat_type() == FatType::Fat32, failing the grant closed for FAT12/FAT16 or malformed images. The mount performs no writes (it reads the boot sector + FSInfo only); a clean mkfs.fat image keeps the BPB dirty flag clear, so the fatfs Drop/unmount path performs no writes either. The virtio arm (fat_fs::mount_root) mounts eagerly at grant time; the NVMe arm (fat_fs::mount_root_nvme) defers the mount to the first Directory.list/open (the FatMount::Deferred -> Ready transition in FatMount::ensure), because the brokered NVMe controller is brought up by the userspace provider after the grant resolves – mirroring readonly_fs::mount_root_nvme.
  • Directory listing (Directory.list @1): fatfs::Dir::iter() walks the root directory entries; capOS copies each entry’s file_name() (LFN or case-normalized 8.3), len(), and is_dir() into the DirEntry reply (FatFsDirectoryCap::collect_entries). The volume-label entry is skipped by the iterator. capOS bounds the exposed root to MAX_DIRECTORY_ENTRIES (64) visible entries.
  • Open (Directory.open @0): resolves a root-level file name to its size and raw FAT directory-entry timestamp metadata (FatFsDirectoryCap::lookup_file_metadata, rejecting directories and missing names) and mints a File cap recording the name, size, and bounded timestamp metadata. The write-implying CREATE/TRUNCATE flag bits, nested (/-bearing) paths, and files larger than MAX_FILE_BYTES (64 KiB) are rejected.
  • Read (File.read @0): re-opens the file by name through the shared locked mount, seeks to the requested offset, and reads up to length bytes via fatfs::Read, walking the cluster chain. The covering read is clamped to end-of-file. fatfs resolves the FAT cluster chain; a multi-cluster file exercises the chain walk, not just the root entry. capOS preflight bounds the root-level file chain length and rejects cycles/bad/out-of-range cluster values before exposing the root cap.
  • Stat (File.stat @2): reports the file size plus FAT directory-entry created/modified timestamps when the FAT fields are valid. capOS converts the FAT date/time fields to Unix epoch nanoseconds by interpreting the timezone-free FAT local-time value as UTC for this bounded local proof. Missing, zero, or invalid FAT date/time fields map to 0, the schema’s unstamped/unsupported value, rather than inventing trusted time. The File.stat ABI remains schema-stable and carries timestamp values only; proof logs label the source as metadata_provenance=fat-directory-entry, clock_provenance=none, and trusted_clock=false. FAT modification time has two-second granularity; FAT creation time has an optional high-resolution byte (0..=199 in 10ms units), and out-of-range high-resolution values are rejected before timestamp conversion.
  • Not implemented: every mutation (Directory.mkdir/remove/sub/ create/rename, File.write/truncate/sync) returns a typed error at the cap layer; the FAT/FSInfo/long-name write paths in fatfs are never reached.

3. capOS mapping

  • Binding (kernel-owned read; behind read_only_fs_root): the FAT32 root Directory is granted through the existing KernelCapSource::ReadOnlyFsRoot source. Under a plain qemu build that source resolves to the capOS-authored CAPOSRO1 backer (cap::readonly_fs); under storage_fat_read it resolves to this FAT backer’s virtio arm (cap::fat_fs::mount_root); under cloud_fat_read_over_nvme_proof it resolves to the FAT backer’s NVMe arm (cap::fat_fs::mount_root_nvme, bound to the live device_mmio handle the production grant source staged – the device_mmio grant must precede read_only_fs_root in the manifest). All three are wired in the KernelCapSource::ReadOnlyFsRoot arms of kernel/src/cap/mod.rs (boot PID 1) and kernel/src/cap/process_spawner.rs (spawned services). Selecting the backer by feature mirrors how the same source already selects its Virtio vs NVMe backend, so no new KernelCapSource and no schema/capos.capnp change is needed – the Directory/File contract already carries every field.
  • Directory/File cap surface: FatFsDirectoryCap / FatFsFileCap implement Directory.list/open + File.read/stat/close; every mutating method fails closed. Read-only is structural (distinct CapObject types that expose no mutation), not a rights flag. open mints the File result cap Copy/SameSession, so a holder can forward a read view to a same-session spawn child without conferring write authority (the same posture readonly_fs/installable_image use).
  • MMIO / Interrupt: none authored by the backer. It holds no device registers and binds no interrupt. The virtio arm reads sectors through the kernel-owned virtio-blk BlockDevice free functions (crate::virtio::block_device_info / block_device_read_blocks / block_device_max_sectors_per_request); the NVMe arm reads through the always-built brokered read window op (device_manager::nvme_brokered_io_sync_read_window_op_for_cap) bound to the granted device_mmio handle/owner – the same read arm the NVMe BlockDevice graduation and readonly_fs.rs’s NVMe arm use. The NVMe controller bring-up (reset/enable/IDENTIFY/CREATE I/O queue) is driven by the userspace provider and the shared cap-waiter Interrupt route, not by this backer.
  • DMA: no new DMA surface. The storage adapter (fat_fs::BlockStorage) translates the fatfs byte cursor to whole sectors and reads through the active BlockSource arm’s bounce-read path: the virtio bounce path, or the NVMe brokered window op (bounded fail-closed to one 4 KiB PRP1 page per read, with a manager-owned bounce page whose PRP1 never reaches userspace). There is no host-physical/IOVA export on either arm.
  • Fail-closed / validation rules: capOS does not treat fatfs as a hostile FAT validator. The wrapper performs its own bounded preflight before granting or lazily completing the NVMe mount: BPB/device geometry must fit the active medium; the root directory FAT chain must end within the root byte budget and be cycle-free; the visible root entry count is capped at 64; each root-level regular file must be at most MAX_FILE_BYTES (64 KiB), and its FAT chain must end within that bounded file budget without cycles, bad clusters, or out-of-range cluster values. The storage adapter also clamps every read to the device byte capacity (BlockStorage::read); the virtio arm’s block_device_read_blocks and the NVMe arm’s window op each range-validate the LBA against the device/namespace geometry before issuing the request. The virtio arm queries the live virtio-blk geometry at construction; the NVMe arm’s capacity comes from the IDENTIFY Namespace claim (device_manager::nvme_namespace_geometry_for_cap, NSZE + active-LBA-format size) – the same IDENTIFY-derived geometry the readonly_fs/ persistent_store/writable_fs NVMe arms consult – and the deferred mount fails closed while that claim is unavailable or reports a non-512-byte active LBA format. The adapter’s Write impl always errors.
  • QEMU-emulable vs hardware-only: fully QEMU-emulable on both arms. The host image is built with real mkfs.fat + mcopy (tools/mkstorage-fat-read-image.py, a 64 MiB FAT32 image, >= 2 files, one multi-cluster, with deterministic FAT directory-entry timestamps on the known files). The virtio arm attaches it as a virtio-blk disk (make run-storage-fat-read); the combined timestamp/provenance proof runs that virtio arm plus the NVMe arm through make run-storage-fat32-timestamp-provenance. The NVMe arm attaches the same image as a pre-populated -device nvme namespace (make run-cloud-provider-fat-read-over-nvme), reading the multi-cluster file back through Directory.open -> File.read over the NVMe BlockSource and asserting the round-tripped bytes, File.stat timestamp values and provenance proof lines, plus the fail-closed mutations. No hardware-only path.
  • kernel/src/cap/fat_fs.rs – the FAT32 Directory/File backer, the BlockSource seam (Virtio / Nvme), the BlockStorage adapter, the deferred FatMount, and the mount_root / mount_root_nvme grant entry points.
  • kernel/src/cap/fat_read_over_nvme_proof.rs – the NVMe-arm cap-waiter Interrupt route + headline marker (provider-fat-read-over-nvme) for the NVMe proof.
  • vendor/fatfs-no_std/ – the vendored fatfs no_std read parser and its VENDORED_FROM.md provenance.
  • kernel/src/cap/readonly_fs.rs – the CAPOSRO1 backer the read_only_fs_root source resolves to under a plain qemu build, and the BlockSource pattern the FAT NVMe arm mirrors.
  • docs/proposals/real-filesystem-decision.md – the role-split decision and the phased plan this read-only FAT32 backer is part of.

virtio-rng (modern PCI entropy device)

This is a provenance map for the in-tree virtio-rng path: it cites the spec, summarizes only the wire-format subset the code actually implements, and points into the implementation. It is not a re-spec – where the spec is implemented unchanged it links rather than transcribes.

Unlike virtio-net and virtio-blk, the virtio-rng device does not back a userspace-facing capability. It is a QEMU-only proof fixture, not a production driver, and not forward DDF production evidence: the entropy the device produces is consumed only by in-kernel proofs, never handed to a process. The capOS EntropySource capability is a separate, RDRAND-backed path (kernel/src/cap/entropy_source.rs, fill_random / rdrand64 / has_rdrand; per-call bound MAX_ENTROPY_FILL_BYTES) and does not touch this device. This classification is asserted, not just documented: on every cfg(qemu) boot diagnose_qemu_virtio_rng emits a deterministic marker (virtio-rng: classification=qemu-only-proof-fixture userspace_capability=none production_driver=no ...) that make run-iommu-remapping (tools/qemu-iommu-remapping-smoke.sh) requires, so a regression that promoted this path into a production-driver claim would fail the smoke. virtio-rng exists in the tree for two reasons:

  1. A DDF metadata-diagnostics path that exercises modern-transport discovery, MSI-X metadata selection, and the device-manager ownership/teardown/grant-source hooks against a real PCI function on every cfg(qemu) boot (kernel/src/virtio.rs diagnose_virtio_rng_metadata, driven from kernel/src/pci.rs diagnose_qemu_virtio_rng).
  2. An IOMMU VT-d second-level remapping hardware-DMA proof vehicle (the Slice A2/B/C proofs in kernel/src/iommu.rs, driven through kernel/src/virtio.rs prove_iommu_rng_mapped_dma / prove_iommu_rng_unmapped_dma / prove_iommu_rng_stale_dma). This is the minimal real virtqueue driver QEMU’s entropy device lets us stand up to prove a device DMA actually walks the programmed translation tables.

It reuses the modern split-ring transport seam introduced for virtio-net (virtio-net); this page covers only the rng-specific usage.

1. Spec basis

  • Device: virtio entropy device, modern (virtio 1.x) PCI transport. PCI vendor 0x1af4; device 0x1044 (modern) / 0x1005 (transitional). IDs at kernel/src/pci.rs (VIRTIO_VENDOR_ID, VIRTIO_RNG_MODERN_DEVICE_ID, VIRTIO_RNG_TRANSITIONAL_DEVICE_ID; matched by PciDevice::is_virtio_rng). QEMU exposes it as virtio-rng-pci-non-transitional (see §3).
  • Authoritative spec: Virtual I/O Device (VIRTIO) Version 1.2, OASIS Committee Specification 01 (2022-07-01). Source: https://docs.oasis-open.org/virtio/virtio/v1.2/virtio-v1.2.html. Relevant sections: 4.1 (virtio over PCI bus), 2.7 (split virtqueues), 5.4 (entropy device).
  • Reference: cross-checked against the Linux virtio_rng driver for the single-request-queue model and the virtio_pci_modern modern-transport handshake.

2. Wire format (implemented subset)

The modern PCI capability parsing, common-config register map, split-ring descriptor layout, and feature-negotiation handshake are the shared transport seam documented in virtio-net §2 (kernel/src/virtio.rs transport module, ModernTransport, the COMMON_* register offsets, VIRTQ_DESC_F_WRITE). The rng path discovers that transport through discover_virtio_rng_metadata_transport and maps regions with map_region. Only the rng-specific subset is summarized here.

  • Device shape: the transport discovery reports whether the function is a modern device id or a transitional id that still exposes modern capabilities (DeviceShape::Modern / DeviceShape::TransitionalWithModernCaps); both are driven through the modern path. The entropy device has no device-specific config space and no device-specific feature bits.
  • Single request queue: the entropy device exposes one virtqueue, the requestq (VIRTIO_RNG_REQUEST_QUEUE = queue 0). The IOMMU proof drives it at a deliberately small VIRTIO_RNG_PROOF_QUEUE_SIZE (2) – a single in-flight descriptor is enough to prove a DMA through translation, and a power of two keeps the ring layout legal. The per-queue notify address is computed from notify_off_multiplier like any modern virtio queue.
  • Request framing: each request is a single device-writable descriptor (VIRTQ_DESC_F_WRITE) pointing at a buffer the device fills with entropy. The proof requests VIRTIO_RNG_PROOF_REQUEST_LEN (64) bytes (rng_publish_descriptor_and_notify writes the 16-byte descriptor and bumps the available ring; completion is read from the used ring’s { id:u32, len:u32 } entry). There is no request header or status byte – the entropy device just writes bytes into the supplied buffer.
  • Feature negotiation: virtio-rng offers no device-specific features. The metadata path negotiates nothing; the IOMMU hardware-DMA proof requires both VIRTIO_F_VERSION_1 (modern transport) and VIRTIO_F_ACCESS_PLATFORM – the latter is what makes QEMU route the device’s DMA through the platform IOMMU and consume the IOVAs the driver programs into the ring registers, rather than treating them as host-physical addresses. A device that does not offer both fails the proof closed (rng-missing-access-platform-feature).
  • Completion: the proof polls the used ring (hhdm_read_u16 of used.idx, bounded by VIRTIO_RNG_USED_POLL_LIMIT) rather than waiting on the request interrupt; the MSI-X path is exercised at the metadata level only (see §3).

3. capOS mapping

  • Binding (transitional, in-kernel, no userspace cap): virtio-rng is driven entirely in the kernel and is not exposed to userspace at all – there is no RandomNumberGenerator/EntropySource-style cap routed to this device. The metadata-diagnostics path runs on every cfg(qemu) boot from kernel/src/pci.rs diagnose_qemu_virtio_rng; the hardware-DMA proofs run under the run-iommu-remapping target only.
  • Device-manager authority (metadata path): diagnose_virtio_rng_metadata binds authority through the kernel device_manager against DeviceOwner::VirtioRng – it proves QEMU ownership (prove_qemu_ownership), teardown triggers, and the DeviceMmio/DMAPool/DMABuffer cap release / driver-crash / reset-disable hooks, then logs the devicemmio / dmapool / interrupt grant-source status (devicemmio_grant_source::log_status and the dmapool / interrupt equivalents). This is the same DDF ledger the cloud-NIC and block drivers bind through; virtio-rng is the function the bring-up hooks are proved against.
  • MMIO: the modern-transport common/notify/ISR/device-config regions are mapped from the device BARs (map_region over pci::map_bar_region) into the device-uncacheable (NO_CACHE) window; the metadata path additionally logs each decoded region (log_device_region). Doorbell (queue-notify) writes are scoped to the per-queue notify address computed from notify_off_multiplier.
  • Interrupt: MSI-X is handled at the metadata level – the request queue uses VIRTIO_RNG_MSIX_METADATA_ENTRY (0) and requires VIRTIO_RNG_MSIX_REQUIRED_ENTRIES (1) usable table entries; the plan is selected by select_virtio_rng_msix_plan and the route programming is proved by prove_virtio_rng_msix_metadata_route. The hardware-DMA proof completes by polling the used ring, so it does not arm a completion-IRQ waiter.
  • DMA: the IOMMU proof’s descriptor table, available ring, used ring, and request buffer are placed at programmed IOVAs carried in the iommu::IommuRngDmaVehicle, never at host-physical addresses; once GCMD.TE is set every DMA the device issues must walk the second-level table the IOMMU module installed. The ring pages are zeroed through the HHDM before their IOVAs are handed to the device so a stale reading can never be mistaken for a completion. No host physical address or IOVA leaves the kernel boundary.
  • Fail-closed / validation rules: the proof fails closed at every step – transport discovery, bus-master enable, MMIO map, reset handshake, the required-feature check, notify-offset/​map-length overflow, queue-size floor, and queue-enable rejection each return a distinct failed(...) reason rather than proceeding. A page whose invalidation never completes is not freed (a page freed before invalidation completes would be a stale-DMA hole). The unmapped-IOVA and stale-IOVA re-drives must fault in the IOMMU (FSTS.PPF / FRCD[0].F) instead of reaching memory.
  • QEMU-emulable vs hardware-only: fully QEMU-emulable. QEMU provides virtio-rng-pci-non-transitional (the shared QEMU_SECOND_DEVICE default); make run-iommu-remapping overrides it with iommu_platform=on behind an intel-iommu device and is the end-to-end proof of the mapped-IOVA hardware DMA, the unmapped-IOVA fault, and the Slice C two-phase revocation / stale-DMA fault. The DDF metadata diagnostics emit on every cfg(qemu) boot. No hardware-only path.
  • kernel/src/virtio.rs – the rng metadata diagnostics (diagnose_virtio_rng_metadata), the IOMMU hardware-DMA proof driver (prove_iommu_rng_mapped_dma / prove_iommu_rng_unmapped_dma / prove_iommu_rng_stale_dma), and the shared modern split-ring transport.
  • kernel/src/iommu.rs – the VT-d Slice A2/B/C remapping, fault, and revocation proofs that drive this device.
  • kernel/src/cap/entropy_source.rs – the separate RDRAND-backed EntropySource capability (this device backs no capability).
  • docs/dma-isolation-design.md – the DMA backend and isolation model the IOMMU remapping proofs validate.

NVMe (NVM Express controller)

This is a provenance map for the NVMe controller wire subset the kernel Model B on-notify DMA validator scans on the doorbell/queue-arm path. It cites the spec basis, summarizes only the register and descriptor fields the validator actually reads, and points into the implementation by symbol name. It is not a re-spec.

Maturity caveat. This page documents the DMA validator mechanism, a brokered no-IOMMU bring-up through one bounded I/O read on the local QEMU make run-pci-nvme gate, and one bounded live-GCE Persistent Disk proof for the production provider-nvme-io-read path. It is still not a general production NVMe driver, not broad GCP/AWS/Azure storage readiness, and not a provider-visible address or direct-DMA claim. It also records the 2026-05-27 correction: on the current no-IOMMU gate, provider-written queue-base or PRP addresses would be host physical addresses, so the live no-IOMMU path must be brokered by the kernel/device manager unless a verified IOMMU/vIOMMU or synthetic address namespace is added. The capabilities implemented against make run-pci-nvme and the later production cloudboot gates:

  • nvme-doorbell-dma-validator is the kernel on-notify scan (kernel/src/cap/nvme_doorbell_validator.rs); it proves its invariants with a cfg(qemu) self-test (prove_qemu_on_notify_scan_contract) using synthetic owner windows in place of a live grant ledger.

  • nvme-bind-claimed-mmio-read adds the read-only userspace bind (§4): the kernel claims the enumerated controller, preseeds its BAR0 controller-register page, and stages the DMAPool/DeviceMmio/Interrupt bootstrap grant sources against it, and the userspace nvme-bringup-smoke provider reads CAP/VS/CC/CSTS through the brokered claim, proving the claim reaches a coherent NVMe BAR0 (live CAP, valid VS version). The controller is firmware-initialized under SeaBIOS NVMe boot-probe (CC.EN=1, CSTS.RDY=1), so the provider reports the observed enable/ready state rather than asserting reset.

  • nvme-controller-reset-selected-write adds the userspace controller reset (§5): the DeviceMmio grant now carries a reset-only NVMe controller-register selected-write claim scoped to CC, and the provider drives the firmware-enabled controller to a known reset state (CC.EN=0CSTS.RDY=0). This is the first genuine userspace NVMe controller-register write.

  • nvme-no-iommu-brokered-controller-enable adds the brokered no-IOMMU enable (§6): the kernel authors AQA/ASQ/ACQ from the live DMA ledger and performs the CC.EN write, reaching CSTS.RDY=1 without exposing a queue-base address.

  • nvme-admin-queue-identify + nvme-admin-interrupt-delivery add the brokered admin SQ/CQ doorbell and one interrupt-driven IDENTIFY (§7).

  • nvme-io-queue-and-read adds the brokered I/O queue pair and one bounded READ (§8) – the last piece of the userspace storage-provider foundation.

  • cloud-prod-nvme-userspace-provider-readonly-bind-local-proof ports the same-BDF read-only bind shape onto the non-qemu cloudboot kernel under the cloud_nvme_readonly_bind_proof Cargo feature. The feature constrains the three production grant sources (devicemmio_grant_source_prod, dmapool_grant_source_prod, interrupt_grant_source_prod) to the NVMe function (class 0x01 subclass 0x08); the userspace cloud-nvme-readonly-bind-smoke provider receives a same-BDF DeviceMmio/DMAPool/Interrupt bundle, reads CAP_LO/CAP_HI/VS/CC/ CSTS via brokered DeviceMmio.read32, releases the three caps, and asserts stale-handle rejection on each. Proof: make run-cloud-provider-nvme-readonly-bind. No CC.EN write, no admin or I/O queue, no IDENTIFY, no Interrupt.wait, no DMA, no live cloud.

  • cloud-prod-nvme-controller-reset-selected-write-local-proof layers the reset-only CC selected-write authority on the same-BDF bundle under the cloud_nvme_controller_reset_proof Cargo feature (which implies cloud_nvme_readonly_bind_proof, so the picker constraints are inherited). The kernel admits exactly one brokered DeviceMmio.write32 shape through kernel::device_manager::stub::write_devicemmio_u32: a CC write (offset 0x14) whose CC.EN (bit 0) is cleared. A CC write that sets CC.EN fails closed with devicemmio-nvme-cc-enable-deferred, a write to any non-CC offset fails closed with devicemmio-write32-register-unclaimed, and an out-of-range or unaligned offset fails closed at the range validator, all before any volatile MMIO write touches the BAR. The userspace cloud-nvme-controller-reset-smoke provider receives the same-BDF bundle, reads CAP_LO/CAP_HI/VS/CC/CSTS, exercises the two fail-closed write probes (CC.EN=1, non-CC offset 0x18), performs the admitted CC.EN=0 reset write, polls CSTS until CSTS.RDY=0, re-reads CC to assert CC.EN=0, releases the three caps, and confirms stale-handle rejection. Proof: make run-cloud-provider-nvme-controller-reset. No CC.EN=1 write, no admin or I/O queue, no IDENTIFY, no Interrupt.wait, no DMA, no live cloud.

  • cloud-prod-nvme-admin-queue-materialization-local-proof materializes the admin SQ and admin CQ backing buffers on the same-BDF bundle under the cloud_nvme_admin_queue_materialization_proof Cargo feature (which implies cloud_nvme_controller_reset_proof, which implies cloud_nvme_readonly_bind_proof, so the picker constraints and the reset-only CC.EN=0 claim are inherited). No new kernel admission surface is added: the production device_manager::stub already supports manager-owned bounce-buffer allocation through stage_bounce_buffer_dmapool_record + issue_manager_attached_dmabuffer_handle_with_request (a fresh zeroed-on-alloc kernel frame per buffer), scrub-before-frame-free on detach_dmabuffer_record_for_cap_release, and stale-handle rejection on the parked-slot ledger. The userspace cloud-nvme-admin-queue-materialization-smoke provider receives the same-BDF bundle, sequentially materializes the admin SQ backing buffer and the admin CQ backing buffer through the brokered DMAPool.allocateBuffer + DMABuffer.{info,map,unmap,freeBuffer} path (asserting userspace_dma_buffer=manager-issued-bounce-buffer, iova_export=disabled-future-only, host_physical_user_visible=false, and device_iova=0 on each), writes and reads back a deterministic 256-byte template through the userspace VMA, asserts the freshly-allocated admin CQ frame reads back as zero before the write (scrub-before-reuse, paired with the admin SQ’s scrub-before-frame-free), confirms post-free DMABuffer.map fail-closed on the stale handle, emits one cloudboot-evidence: provider-nvme-admin-queue-materialization <token> marker recording both manager-owned pool/buffer slot/generation identities and the discipline labels, releases the three bundle caps, and confirms stale-handle rejection on each. Proof: make run-cloud-provider-nvme-admin-queue-materialization. No NVMe controller register WRITE on this path (the kernel still admits the reset-only CC.EN=0 claim from the controller-reset sibling, but this smoke never calls DeviceMmio.write32), no AQA/ASQ/ACQ publication, no CC.EN=1, no I/O queue allocation, no IDENTIFY, no PRP/SGL publication, no doorbell write, no Interrupt.wait/ Interrupt.acknowledge, no host-physical or IOVA export, no live cloud.

  • cloud-prod-nvme-brokered-controller-enable-local-proof enables the controller through manager-authored AQA/ASQ/ACQ plus a provider-supplied CC.EN=1 write under the cloud_nvme_controller_enable_proof Cargo feature (which implies the three earlier features). The production device_manager::stub parked-pool slot holds two simultaneously-live bounce-buffer DMABuffers (PARKED_DMAPOOL_LIVE_BUFFER_CAPACITY = 2, PARKED_DMABUFFER_SLOTS = [1, 2]) so the admin SQ and admin CQ can stay parked together; the bounce-buffer grant proof and the virtio-net live-publish proof keep their existing single-buffer behavior (slot 0). The provider-supplied CC.EN=1 write of this path is superseded by cloud-prod-nvme-controller-enable-manager-op-remediation below, which makes raw DeviceMmio.write32(CC, value with CC.EN=1) fail closed before any MMIO side effect and exposes controller enable only through the no-parameter DeviceMmio.brokeredNvmeControllerEnable verb (schema @6). The parked-pool slot capacity, the [1, 2] slot ids, the AQA depth policy, and the four MMIO writes the manager authors all carry over unchanged.

  • cloud-prod-nvme-controller-enable-manager-op-remediation corrects the brokered enable contract. Raw DeviceMmio.write32(CC, value with CC.EN=1) now fails closed with authority_result=devicemmio-nvme-cc-enable-raw-blocked / authority_reason=cc-enable-requires-broker-nvme-controller-enable-op before any volatile MMIO side effect. Controller enable is reachable only through the new no-parameter DeviceMmio.brokeredNvmeControllerEnable verb (schema @6), which carries no offset, value, queue address, queue id, PRP/SGL, or provider-selected controller-bit parameter. The verb routes to the renamed manager-authored nvme_brokered_controller_enable_op_for_cap in kernel/src/device_manager/stub.rs. The manager: (1) validates the cap’s BAR matches the parked region and covers the CC/AQA/ASQ/ACQ register span; (2) resolves the two parked admin queue DMABuffers (slot order: SQ then CQ) and requires both to be live, unmapped, and frame-aligned; (3) selects every controller bit internally – CC.EN | IOSQES=6 | IOCQES=4 (NVMe Base Spec §3.1.5); (4) authors AQA = ((depth-1)<<16) | (depth-1) with depth 8, ASQ low/high from the admin SQ buffer’s phys address, ACQ low/high from the admin CQ buffer’s phys address through the boot-preseeded BAR0 kernel mapping; and (5) performs the manager-selected CC.EN=1 write. The provider supplies no parameters and never observes a host-physical / device-visible queue-base address. The cap dispatch admission carries authority_result=ok, register_write=performed, side_effect=mmio-write-performed, cc_en_write_performed=true, aqa_authored=true, asq_authored=true, acq_authored=true, and queue_base_source=manager-ledger. The kernel diagnostic line is now nvme: brokered-enable owner=cloud-nvme model=cloud-bounce validator=none trigger=manager-op admin_sq_slot=1 admin_cq_slot=2 aqa=0x00070007 cc=0x... asq_authored=true acq_authored=true cc_en_write=performed cc_bits_selected_by=manager queue_base_source=manager-ledger host_physical_user_visible=false proof_result=ok; trigger=manager-op proves the admission entered through @6, not through a raw CC write32. The cloud-nvme-controller-enable-smoke provider proves both the new fail-closed raw CC.EN=1 probe and the manager-op enable, and the headline cloudboot-evidence: provider-nvme-controller-enable <token> marker pins brokered_enable_trigger=manager-op and cc_raw_enable_write=refused. Proof: make run-cloud-provider-nvme-controller-enable. No IDENTIFY, admin or I/O queue command, PRP/SGL publication, doorbell write, Interrupt.wait/Interrupt.acknowledge, host-physical or IOVA export, or live cloud is claimed.

  • cloud-prod-nvme-admin-identify-manager-op-local-proof extends the corrected controller-enable surface with one explicit manager-owned admin-command operation: DeviceMmio.brokeredNvmeAdminIdentify (schema @7). The verb carries no parameters; the cap holder may not supply queue addresses, opcode, command id, NSID, PRP/SGL entries, data-buffer address, doorbell offset, or doorbell value. The production device_manager::stub parked-pool slot capacity was extended from two to three simultaneously-live bounce-buffer DMABuffers (PARKED_DMAPOOL_LIVE_BUFFER_CAPACITY = 3, PARKED_DMABUFFER_SLOTS = [1, 2, 3]) so the admin SQ (slot 1), admin CQ (slot 2), and IDENTIFY data page (slot 3) can stay parked together; the controller-enable sibling, the bounce-buffer grant proof, and the virtio-net live-publish proof keep their existing single- or dual-buffer behavior unchanged. The production grant source’s kernel-mapped BAR window was correspondingly widened from one to two pages (MAPPED_WINDOW_BYTES = 0x2000 under cloud_nvme_admin_identify_proof) so the admin SQ tail (0x1000) and admin CQ head (0x1004) doorbells fall inside the boot-preseeded mapping the manager already uses for CC/AQA/ASQ/ACQ – raw write32 to either doorbell offset still fails closed at the device-manager boundary as devicemmio-write32-register-unclaimed (the offset is outside the reset-only CC selected-write claim), and the brokered admin IDENTIFY verb is the only path that may ring them. The handler nvme_brokered_admin_identify_op_for_cap in kernel/src/device_manager/stub.rs: (1) validates the cap’s BAR matches the parked region and covers both doorbell offsets and the CSTS register; (2) resolves the three parked admin DMABuffers and requires all three to be live, unmapped, and frame-aligned; (3) re-reads CSTS through the boot-preseeded BAR mapping and refuses if CSTS.RDY=0; (4) authors the full submission queue entry at admin SQ index 0 through the HHDM kernel mapping of the SQ page – opcode IDENTIFY (0x06, NVMe Base Spec §5.17), command id 1, NSID 0, MPTR 0, PRP1 = data-page physical address (sourced from the manager’s parked-pool ledger), PRP2 0, CDW10 CNS 0x01 (Controller); (5) issues a SeqCst fence and rings the admin SQ tail doorbell at BAR0 offset 0x1000; (6) polls the admin CQ entry at index 0 through the HHDM kernel mapping of the CQ page for the phase-bit flip (NVMe Base Spec §4.6 CQE DW3 bit 16); (7) inspects the CQE status field (bits 30:17 of DW3) and command-id echo, refusing on either mismatch; (8) parses IDENTIFY Controller VID (offset 0, 2 bytes) and SSVID (offset 2, 2 bytes) through the HHDM kernel mapping of the data page; (9) advances the admin CQ head doorbell at BAR0 offset 0x1004. The provider sees only bounded-status labels, the manager-selected CNS/opcode/command-id echoes, the three parked-slot identities, the parsed VID/SSVID, and the doorbell side-effect labels. The cap-side dispatch admission carries authority_result=ok, result=ok, register_write=performed, side_effect=mmio-write-performed, sq_doorbell_written=true, cq_doorbell_written=true, completion_consumed=true, cq_status=0x0000, prp_source=manager-ledger, and host_physical_user_visible=false. The kernel diagnostic is nvme: brokered-admin-identify owner=cloud-nvme model=cloud-bounce trigger=manager-op admin_sq_slot=1 admin_cq_slot=2 admin_data_slot=3 cns=0x01 opcode=0x06 command_id=0x0001 ... cqe_status=0x0000 cqe_command_id=0x0001 sq_tail=1 cq_head=1 cq_phase=1 identify_vid=0x1b36 identify_ssvid=0x1af4 sq_doorbell_written=performed cq_doorbell_written=performed completion_consumed=true prp_source=manager-ledger host_physical_user_visible=false proof_result=ok (QEMU’s nvme device reports PCI VID 0x1b36 and SSVID 0x1af4, which the harness pins). The cloud-nvme-admin-identify-smoke provider exercises the inherited fail-closed raw-write claims (six in total: AQA/ASQ/ACQ + raw CC.EN=1 + raw admin SQ tail/CQ head doorbells), invokes the controller-enable verb at @6, invokes the admin IDENTIFY verb at @7, and emits one headline cloudboot-evidence: provider-nvme-admin-identify <token> marker plus three supplementary [cloud-nvme-admin-identify-smoke] discipline-* lines that re-anchor the contract within the per-call Console.writeLine bound. Proof: make run-cloud-provider-nvme-admin-identify. Future work (not yet implemented): I/O queue creation, READ/WRITE, Interrupt.wait/Interrupt.acknowledge admin-completion handoff, device-autonomous MSI-X delivery, host-physical/IOVA export, provider-authored SQE/PRP/SGL bytes, provider-authored doorbell offsets/values, and live cloud traffic.

  • cloud-prod-nvme-admin-completion-wait-ack-local-proof moves the admin IDENTIFY completion handoff off manager-internal CQ polling onto the production Interrupt.wait / Interrupt.acknowledge path. The admission-check-only production Interrupt grant (interrupt_grant_source_prod, wait/acknowledge=admission-check-only) is replaced by a fully-programmed cap-waiter MSI-X route on the same NVMe BDF (kernel/src/cap/nvme_admin_completion_wait_ack_proof.rs, table entry 0): its init registers + claims the route under ManagerGrantSource, programs the MSI-X table entry mask-first with the kernel-authored (message_address, message_data), attaches it to the device manager, arms the deferred-LAPIC-EOI gate, and unmasks. The admin IDENTIFY is split into two manager-owned verbs: DeviceMmio.brokeredNvmeAdminSubmit (schema @8) authors the SQE and rings the admin SQ tail doorbell (no CQ consumed, completion_consumed=false), and DeviceMmio.brokeredNvmeAdminComplete (schema @9) polls/consumes the admin CQ (the manager-owned CQ status/CID check is preserved), parses VID/SSVID, and advances the admin CQ head doorbell. Both reuse the shared NvmeBrokeredAdminOpResult schema struct. The handoff state machine is ordered and one-shot: brokeredNvmeAdminSubmit (@8) records the exact live admin SQ/CQ/data slots and generations; Interrupt.wait is admitted once for that submitted state, revalidates those live DMA records, and consumes the wait phase; brokeredNvmeAdminComplete (@9) is admitted only after the wait phase; and Interrupt.acknowledge is admitted once only after the completion phase has been recorded. Hostile complete-before-wait, ack-before-complete, repeat wait, repeat complete, repeat ack, and submit-then-DMABuffer.freeBuffer attempts fail closed before injecting extra dispatch, retiring an extra EOI, or freeing/scrubbing the manager-owned admin pages. Between the two verbs the provider calls Interrupt.wait – which injects exactly one bounded, non-autonomous device_interrupt::handle_lapic_delivery dispatch on the bound route (result=nvme-admin-completion-wait-ack-dispatch-consumed, real_interrupt_delivery=kernel-injected-dispatch, delivery count +1, one deferred LAPIC EOI armed) – then after the completion verb calls Interrupt.acknowledge to retire exactly that deferred EOI (hardware_dispatch_ack_delta=1). The chain is: read-only bind -> reset-only CC.EN=0 -> manager-owned admin buffer materialization -> brokeredNvmeControllerEnable -> brokeredNvmeAdminSubmit (@8) -> admin SQ tail doorbell -> Interrupt.wait wake -> brokeredNvmeAdminComplete (@9) -> admin CQ completion consumed -> admin CQ head doorbell advanced -> Interrupt.acknowledge deferred EOI retired. On Interrupt cap release the kernel requires exactly one observed dispatch, exactly one observed ack, and the terminal acked handoff state, then runs the masked-no-wake + reassign + stale-handle/stale-token assertion chain and emits exactly one headline cloudboot-evidence: provider-nvme-admin-completion-wait-ack <token> marker labeled admin_completion_wake=provider-cap-side-injected device_autonomous_raise=not-claimed. The wake is the same bounded kernel-injected cap-waiter model as make run-cloud-provider-cap-waiter; this proof does not claim a device-autonomously-raised NVMe MSI-X completion interrupt. Proof: make run-cloud-provider-nvme-admin-completion-wait-ack. Future work (not yet implemented): I/O queue creation, READ/WRITE, BlockDevice/filesystem integration, device-autonomous MSI-X delivery, host-physical/IOVA export, and live cloud traffic.

  • cloud-prod-nvme-io-queue-create-local-proof adds the single I/O queue pair (queue id 1) on top of the admin chain. After the combined poll-based admin IDENTIFY (DeviceMmio.brokeredNvmeAdminIdentify @7, VID 0x1b36), the manager authors the two queue-establishing admin commands behind parameterless per-command verbs: DeviceMmio.brokeredNvmeCreateIoCqSubmit (schema @10, opcode 0x05, CDW10 = queue id 1 | (queue-size-1)<<16, CDW11 PC=1 IEN=0, PRP1 = manager-owned I/O CQ base page) and DeviceMmio.brokeredNvmeCreateIoSqSubmit (schema @11, opcode 0x01, CDW10 = queue id 1 | (queue-size-1)<<16, CDW11 = CQ id 1 | PC<<16, PRP1 = manager-owned I/O SQ base page). The opcode/CDWs are manager-selected; the provider supplies nothing (widening @8 with a command-selector parameter was rejected because it would let a provider author arbitrary admin opcodes). Each SUBMIT verb authors the SQE at the next admin SQ index, rings the admin SQ tail doorbell, and records the in-flight create; the completion of each is consumed through the shared DeviceMmio.brokeredNvmeAdminComplete (@9, now command-aware: it reads the admin CQ entry at the recorded index, checks status/CID, and advances the CQ head doorbell) after one provider-cap-side Interrupt.wait, and the deferred LAPIC EOI is retired by one Interrupt.acknowledge. The cap-waiter route (kernel/src/cap/nvme_io_queue_create_proof.rs) drives two bounded kernel-injected dispatch + deferred-EOI cycles – one per create – and its ordered handoff enforces CREATE I/O CQ before CREATE I/O SQ, one create at a time, with submit-then-DMABuffer.freeBuffer, repeat-wait, and ack-before-complete attempts failing closed. The I/O CQ/SQ base pages are manager-owned brokered bounce buffers (parked-pool slots 3/4, userspace slots 4/5); their PRP1 is never exported. On Interrupt cap release the kernel requires both creates completed (CQE status 0), exactly two observed dispatches, two observed acks, the idle terminal handoff, and the masked-no-wake + reassign + stale-handle chain, then emits one cloudboot-evidence: provider-nvme-io-queue-create <token> marker labeled io_queue_create_wake=provider-cap-side-injected device_autonomous_raise=not-claimed io_command=create-only io_read=not-attempted io_sq_doorbell=not-attempted. Proof: make run-cloud-provider-nvme-io-queue-create. Future work (not yet implemented): the I/O SQ tail doorbell (0x1008), READ/WRITE, the I/O data page, the I/O-completion route, BlockDevice/filesystem integration, device-autonomous MSI-X delivery, host-physical/IOVA export, and live cloud traffic.

  • cloud-prod-nvme-io-read-local-proof adds one bounded I/O READ (LBA 0, 1 block, NSID 1) on top of the live I/O queue pair. After the two CREATE I/O queue commands, the manager authors the entire READ SQE behind two parameterless per-command verbs: DeviceMmio.brokeredNvmeIoReadSubmit (schema @12) writes CDW0 (opcode 0x02 | command-id<<16), NSID 1, MPTR 0, PRP1 = manager-owned I/O read-data page (parked-pool slot 5), PRP2 0, SLBA 0 (CDW10/CDW11), NLB 0 = “1 block” (CDW12 bits 15:0) at I/O SQ index 0 and rings the I/O SQ tail doorbell (0x1008); DeviceMmio.brokeredNvmeIoReadComplete (schema @13) polls the I/O CQ entry at index 0 for the phase flip, checks status/CID, advances the I/O CQ head doorbell (0x100c), reads the first bytes of the read-data page through the kernel mapping, and surfaces a bounded read-data digest (readDataDigestLo/readDataDigestHi = first 8 bytes, readDataLen = transferred length). The provider supplies no opcode/LBA/PRP/ doorbell (a provider write32(0x1008, ...) path was rejected because it would break the no-provider-authored-command discipline; reusing the create/admin verbs was rejected because they are hardwired to the admin SQ/CQ doorbells and ledger). The cap-waiter route (kernel/src/cap/nvme_io_read_proof.rs) drives three bounded kernel-injected dispatch + deferred-EOI cycles – two creates plus one read – with submit-then-DMABuffer.freeBuffer, repeat-wait, and ack-before-complete attempts failing closed; the read-data page is a manager-owned brokered bounce buffer (parked-pool slot 5, userspace slot 6) whose PRP1 is never exported, and the manager reads the completed block bytes only through the kernel mapping. On Interrupt cap release the kernel requires both creates and the read completed (CQE status 0), the verified block bytes (readDataLen > 0 and a non-zero digest), exactly three observed dispatches, three observed acks, the idle terminal handoffs, and the masked-no-wake + reassign + stale-handle chain, then emits one cloudboot-evidence: provider-nvme-io-read <token> marker labeled io_read_wake=provider-cap-side-injected device_autonomous_raise=not-claimed io_command=read io_read=completed io_sq_doorbell=performed io_cq_completion=polled-io-cq plus io_read_block_bytes=<digest> read_data_len=512. The local QEMU smoke seeds the NVMe backing file’s first sector with a known 16-byte pattern so the digest proves an actual byte transfer, not merely that a CQE arrived. Proof: make run-cloud-provider-nvme-io-read. The same marker shape also passed live on GCE run 1780806087-bf69 (make cloudboot-gcp-storage-nvme-io-read-test) against a Persistent Disk NVMe controller with vendor.1ae0, device.001f, live_cloud=gce-persistent-disk, and a 512-byte READ digest prefix eb3c904c494d494e4520200002000000. Future work (not yet implemented): a dedicated I/O-completion Interrupt route on live cloud, WRITE, multi-block/second-LBA reads, BlockDevice/filesystem integration, device-autonomous MSI-X delivery, host-physical/IOVA export, and live cloud coverage beyond this one GCE PD read.

  • cloud-prod-nvme-io-write-local-proof adds one bounded I/O WRITE (LBA 0, 1 block, NSID 1) of a fixed manager pattern on top of the live I/O queue pair, proven durable by reading it back. After the two CREATE I/O queue commands, the manager authors the entire WRITE SQE behind two parameterless per-command verbs: DeviceMmio.brokeredNvmeIoWriteSubmit (schema @14) pre-fills the manager-owned I/O write-data page (parked-pool slot 6, userspace slot 7) with the fixed 16-byte signature facefeedcafebabe1122334455667788 repeated across the block, then writes CDW0 (opcode 0x01 | command-id<<16), NSID 1, MPTR 0, PRP1 = that page, PRP2 0, SLBA 0, NLB 0 = “1 block” at I/O SQ index 0 and rings the I/O SQ tail doorbell (0x1008); DeviceMmio.brokeredNvmeIoWriteComplete (schema @15) polls the I/O CQ entry at index 0 for the phase flip, checks status/CID, advances the I/O CQ head doorbell (0x100c), reads the first bytes of the write-data page through the kernel mapping, and surfaces a bounded written-pattern digest (carried in the shared readDataDigestLo/readDataDigestHi/readDataLen fields). The landed I/O READ (@12/@13) is then reused unchanged to read LBA 0 back into the read-data page (slot 5) at the next I/O SQ/CQ index (1, since the WRITE consumed index 0); the provider supplies no opcode/LBA/PRP/pattern/doorbell, and a new schema field was deliberately avoided – the durability match is computed kernel-side by comparing the written-pattern digest with the read-back digest. The cap-waiter route (kernel/src/cap/nvme_io_write_proof.rs) drives four bounded kernel-injected dispatch + deferred-EOI cycles – two creates, one write, one read-back – with submit-then-DMABuffer.freeBuffer, repeat-wait, and ack-before-complete attempts failing closed; the write-data and read-data pages are manager-owned brokered bounce buffers whose PRP1 is never exported. On Interrupt cap release the kernel requires both creates, the write, and the read-back completed (CQE status 0), non-zero digests, exactly four observed dispatches and acks, the idle terminal handoffs, the masked-no-wake + reassign + stale-handle chain, and the read-back digest matching the written pattern, then emits one cloudboot-evidence: provider-nvme-io-write <token> marker labeled io_command=write io_write=completed io_sq_doorbell=performed io_cq_completion=polled-io-cq write_pattern=<digest> write_readback_match=true. The local QEMU smoke seeds the backing file’s first sector with a distinct sentinel so the read-back of the manager pattern proves the WRITE transferred the bytes. Proof: make run-cloud-provider-nvme-io-write. Future work (not yet implemented): a dedicated I/O-completion Interrupt route distinct from the admin/create/read route, multi-block/second-LBA/second-NSID I/O, flush/FUA/DSM, BlockDevice/filesystem integration, device-autonomous MSI-X delivery, host-physical/IOVA export, and live cloud traffic.

  • cloud-prod-nvme-io-second-lba-local-proof generalizes the manager-authored data path beyond the hardwired SLBA 0: it proves LBA addressing actually selects the block by driving three sequential I/O commands on the live queue pair through two new parameterless verbs (DeviceMmio.brokeredNvmeIoSecondLbaSubmit @16 / brokeredNvmeIoSecondLbaComplete @17), selected by a kernel-owned phase counter: phase 0 reads LBA 0 (the distinctness baseline, read-data slot 5), phase 1 pre-fills the write-data page (slot 6) with a fixed LBA-1-distinct 16-byte pattern 0123456789abcdeffedcba9876543210 and writes it to LBA 1 (opcode 0x01, CDW10 = SLBA low = 1, CDW11 = SLBA high = 0), and phase 2 reads LBA 1 back. Because the landed @12-@15 verbs hardwire SLBA 0 (and their I/O index depends on the io-write feature), this proof implies only cloud_nvme_io_queue_create_proof and authors its own SQEs at I/O SQ indices 0/1/2; the provider supplies no opcode/LBA/PRP/pattern/doorbell. No schema field was added – the LBA-1 read-back match (LBA-1 read digest == LBA-1 write pattern) and the LBA distinctness (LBA-1 read digest != LBA-0 read digest) are computed kernel-side across the three recorded phase digests. The cap-waiter route (kernel/src/cap/nvme_io_second_lba_proof.rs) drives five bounded kernel-injected dispatch + deferred-EOI cycles (two creates + three I/O phases), with submit-then-DMABuffer.freeBuffer, repeat-wait, and ack-before-complete attempts failing closed. On Interrupt cap release the kernel requires both creates and all three phases completed (CQE status 0), non-zero digests, five observed dispatches and acks, the masked-no-wake + reassign + stale-handle chain, second_lba_readback_match=true, and lba_distinct_from_zero=true, then emits one cloudboot-evidence: provider-nvme-io-second-lba <token> marker labeled io_command=second-lba io_second_lba=1 second_lba_readback_match=true lba_distinct_from_zero=true io_sq_doorbell=performed io_cq_completion=polled-io-cq. The local QEMU smoke seeds the backing file’s first sector with the distinct sentinel deadbeefcafebabe0102030405060708 so the LBA-0 read returns content distinct from the LBA-1 pattern. Proof: make run-cloud-provider-nvme-io-second-lba. Future work (not yet implemented): a dedicated I/O-completion Interrupt route, multi-block (NLB > 0) I/O, a third LBA or second namespace, flush/FUA/DSM, BlockDevice/filesystem integration (now unblocked by this LBA parameterization), device-autonomous MSI-X delivery, host-physical/IOVA export, and live cloud traffic.

  • cloud-prod-nvme-io-multiblock-local-proof generalizes the manager-authored data path beyond one logical block: it proves the authored SQE drives a transfer larger than a single block by driving two sequential I/O commands on the live queue pair through two new parameterless verbs (DeviceMmio.brokeredNvmeIoMultiblockSubmit @18 / brokeredNvmeIoMultiblockComplete @19), selected by a kernel-owned phase counter: phase 0 pre-fills the write-data page (slot 6) with two distinct 16-byte patterns – block 0 = 112233445566778899aabbccddeeff00, block 1 = f0e1d2c3b4a5968778695a4b3c2d1e0f – over 1024 B and writes both blocks to LBA 2 (opcode 0x01, NLB = block_count - 1 = 1, CDW10 = 2; PRP1 = slot 6, PRP2 = 0, since the 1024 B transfer fits one 4 KiB page), and phase 1 reads LBA 2 back into the read-data page (slot 5). Because the landed @12-@17 verbs hardwire a single block (block_count = 1), this proof implies only cloud_nvme_io_queue_create_proof and authors its own SQEs at I/O SQ indices 0/1; the provider supplies no opcode/LBA/count/PRP/pattern/doorbell. No schema field was added for the second block’s digest – the existing readDataDigestLo/Hi carry block 0’s first 8 bytes for userspace, while the per-block match (read block 0 == written pattern-0, read block 1 == written pattern-1) and the block-distinctness (block 0 digest != block 1 digest, proving the second 512 B block actually transferred) are computed kernel-side across the two recorded phase digest pairs and attested in the headline marker. The cap-waiter route (kernel/src/cap/nvme_io_multiblock_proof.rs) drives four bounded kernel-injected dispatch + deferred-EOI cycles (two creates + two I/O phases), with submit-then-DMABuffer.freeBuffer, repeat-wait, and ack-before-complete attempts failing closed. On Interrupt cap release the kernel requires both creates and both phases completed (CQE status 0), non-zero digests, four observed dispatches and acks, the masked-no-wake + reassign + stale-handle chain, multiblock_block0_match=true, multiblock_block1_match=true, and multiblock_blocks_distinct=true, then emits one cloudboot-evidence: provider-nvme-io-multiblock <token> marker labeled io_command=multiblock io_slba=2 io_nlb=1 io_block_count=2 prp2_zeroed=true multiblock_block0_match=true multiblock_block1_match=true io_sq_doorbell=performed io_cq_completion=polled-io-cq. Proof: make run-cloud-provider-nvme-io-multiblock. Future work (not yet implemented): a dedicated I/O-completion Interrupt route, NLB > 1 requiring a PRP list / second mapped page, a third LBA or second namespace, flush/FUA/DSM, wrapping the brokered READ/WRITE behind a userspace-served BlockDevice cap (now has both LBA selection and >1-block transfer as prerequisites), device-autonomous MSI-X delivery, host-physical/IOVA export, and live cloud traffic.

  • cloud-prod-nvme-io-synchronous-poll-read-local-proof collapses the four-call submit/wait/complete/ack NVMe I/O lifecycle into ONE synchronous CapObject::call – the shape BlockDevice.readBlocks @0 requires. It adds two parameterless single-call verbs (DeviceMmio.brokeredNvmeIoSyncWrite @20 / brokeredNvmeIoSyncRead @21), each mirroring the combined brokeredNvmeAdminIdentify @7: the manager pre-fills the write-data page (slot 6) with 112233445566778899aabbccddeeff00 (block 0) and f0e1d2c3b4a5968778695a4b3c2d1e0f (block 1), authors the SQE (WRITE opcode 0x01 at I/O SQ index 0 / READ opcode 0x02 at index 1; NSID 1, SLBA 2, NLB = 1 / two 512 B blocks, PRP1 = data page, PRP2 = 0), rings the I/O SQ tail doorbell (0x1008), polls the I/O CQ entry phase bit to completion within a bounded budget, advances the I/O CQ head doorbell (0x100c), and reads block 0/block 1 back – all inside one cap call, with no Interrupt.wait on the I/O data path. The two CREATE I/O queue commands still complete through the cap-waiter Interrupt.wait/acknowledge path, so the route (kernel/src/cap/nvme_io_sync_read_proof.rs) drives only two bounded kernel-injected dispatch + deferred-EOI cycles. The single-call verbs report sqDoorbellWritten, cqDoorbellWritten, and completionConsumed all true in one result; the block-1 match and block-distinctness are computed kernel-side across the recorded WRITE/READ digest pairs. On Interrupt cap release the kernel requires both creates and both single-call I/O commands completed (CQE status 0), non-zero digests, two observed dispatches and acks, the masked-no-wake + reassign + stale-handle chain, sync_block0_match=true, sync_block1_match=true, and sync_blocks_distinct=true, then emits one cloudboot-evidence: provider-nvme-io-sync-read <token> marker labeled io_command=sync-read io_slba=2 io_nlb=1 io_block_count=2 prp2_zeroed=true sync_block0_match=true sync_block1_match=true sync_blocks_distinct=true io_sq_doorbell=performed io_cq_completion=polled-io-cq-single-call interrupt_wait=not-used. Proof: make run-cloud-provider-nvme-io-sync-read. This closes concern (c) of the BlockDevice-shaped read gap (lifecycle collapse) without touching BlockDeviceCap/crate::virtio (concern a), the manager-op routing into a generic cap (concern b), a dedicated I/O-completion Interrupt route, or a PRP list (NLB > 1). Future work (not yet implemented): introducing an NVMe-backed BlockDeviceCap whose readBlocks @0 arm calls this single-call op, a readonly_fs-style consumer over it, device-autonomous MSI-X delivery, host-physical/IOVA export, and live cloud traffic.

  • cloud-prod-nvme-io-sync-read-block-bytes-local-proof surfaces the full read-back bytes. It adds one verb, DeviceMmio.brokeredNvmeIoSyncReadBytes @22 () -> NvmeBrokeredAdminOpReadBytesResult, that reuses the landed single-call poll-read body (nvme_brokered_io_sync_command) unchanged but returns the entire 1024 B read-back (block 0 ‖ block 1), read through the kernel mapping, as the inline data :Data field of a new narrow result struct – the full-bytes shape BlockDevice.readBlocks @0 -> (data :Data) requires – instead of folding it to an 8-byte digest. The provider issues @20 (WRITE) then @22 (READ-bytes) as two synchronous cap calls and compares the returned data byte-for-byte to the reconstructed manager-authored page, asserting the two 512 B halves differ; the kernel still attests per-block match and distinctness in the release marker. No host-physical/IOVA address crosses the boundary – only the content bytes the caller already authored. The cap-waiter route (kernel/src/cap/nvme_io_sync_read_bytes_proof.rs) is a clone of the sync-read proof that emits one cloudboot-evidence: provider-nvme-io-sync-read-bytes <token> marker labeled io_command=sync-read-bytes io_slba=2 io_nlb=1 io_block_count=2 read_data_len=1024 data_return=inline-bytes data_block0_match=true data_block1_match=true data_blocks_distinct=true io_cq_completion=polled-io-cq-single-call interrupt_wait=not-used. Proof: make run-cloud-provider-nvme-io-sync-read-bytes. This closes concern (b) of the BlockDevice-shaped read gap (full-bytes return) without touching BlockDeviceCap/crate::virtio (concern a), arbitrary (startLba, count) parameterization, a dedicated I/O-completion Interrupt route, or a PRP list (NLB > 1). Future work (not yet implemented): an NVMe-backed BlockDeviceCap backend enum whose readBlocks @0 arm calls this op, arbitrary-LBA routing, device-autonomous MSI-X delivery, host-physical/IOVA export, and live cloud traffic.

  • cloud-prod-nvme-blockdevice-fixed-lba-read-arm-local-proof makes the NVMe namespace consumable through the SAME BlockDevice.readBlocks @0 interface a filesystem consumer calls (proof-fixed LBA arm). It replaces BlockDeviceCap’s bare device_index: usize with a BlockDeviceBackend enum (kernel/src/cap/block_device.rs): the always-built Virtio { device_index } arm (behavior-identical to today, verified by make run-storage-fs) and, under cloud_nvme_blockdevice_read_proof, an NvmeBrokered { handle, owner } arm. The NvmeBrokered arm’s readBlocks @0 accepts ONLY startLba == 2 && count == 2 (fails closed with BlockDevice.readBlocks NVMe arm fixed to SLBA 2 NLB 1 on any other window) and drives the landed nvme_brokered_io_sync_read_bytes_op_for_cap (@22) body into a local 1024 B buffer surfaced as inline Data through the same read_blocks_results builder the virtio arm uses; writeBlocks/flush fail closed (read-only namespace) and info returns the fixed geometry. A new bootstrap grant arm mints the NvmeBrokered cap bound to the SAME live device_mmio handle/owner the production grant source staged (devicemmio_grant_source_prod::live_handle_for_nvme_blockdevice), so the device_mmio grant must precede the block_device grant in the manifest cap list. The provider drives the full bring-up (reset → enable → IDENTIFY @7 → CREATE I/O CQ @10 → CREATE I/O SQ @11 → @20 WRITE) and issues the TERMINAL read through BlockDevice.readBlocks(2, 2) instead of a raw DeviceMmio @22. The cap-waiter route (kernel/src/cap/nvme_blockdevice_read_proof.rs, a clone of the sync-read-bytes proof) emits one cloudboot-evidence: provider-nvme-blockdevice-read <token> marker labeled read_path=blockdevice-readblocks read_iface=BlockDevice read_method=0 io_slba=2 io_nlb=1 io_block_count=2 read_data_len=1024 data_return=inline-bytes nvme_arm_fixed_lba=true arbitrary_lba=not-supported. Proof: make run-cloud-provider-nvme-blockdevice-read. This closes concern (a) of the BlockDevice-shaped read gap (the same schema method, not the bespoke @22 verb), restricted to the proof-fixed window. Future work (not yet implemented): arbitrary (startLba, count) parameterization, NVMe write/flush durability through BlockDevice, a readonly_fs-style filesystem mounted over the NVMe BlockDevice cap, a dedicated I/O-completion Interrupt route, NLB > 1 with a PRP list, and graduating the NVMe data plane out of the per-proof feature into always-built production.

  • cloud-prod-nvme-blockdevice-arbitrary-lba-read-local-proof widens the NvmeBrokered arm off the hardwired SLBA 2 / NLB 1: readBlocks @0 now honors an ARBITRARY (startLba, count) window (read-only, bounded to one PRP1 page). The shared single-call body nvme_brokered_io_sync_command (kernel/src/device_manager/stub.rs) gains explicit slba/block_count fields on SyncIoParams and authors CDW10/CDW11 (SLBA) and CDW12 (NLB) from them instead of the module constants; the transfer length is block_count * 512, bounded fail-closed to one 4 KiB PRP1 page (block_count <= 8) so PRP2 stays 0. The existing @20/@21/@22 callers keep passing the proof-fixed SLBA 2 / count 2, so their behavior is byte-identical (regression: make run-cloud-provider-nvme-io-sync-read-bytes). A new parameterized op nvme_brokered_io_sync_read_window_op_for_cap(handle, owner, slba, count, out_data) rotates the I/O SQ/CQ index off the kernel-side read sequence (window 0 at index 1 after the WRITE at index 0, window 1 at index 2) so each completion is polled at the CQ slot the controller actually writes. Under cloud_nvme_blockdevice_arbitrary_lba_proof (implies and supersedes cloud_nvme_blockdevice_read_proof), BlockDeviceCap::nvme_read_blocks admits any 1 <= count <= 8 window with startLba + count <= namespace blocks (the IDENTIFY-derived NSZE reported through info @2 – 16 MiB / 512 = 32768 on the QEMU fixture image; see the READ-arm graduation entry), and fails closed with distinct errors for count == 0 / count > 8 (... count out of range (1..=8)) and a window past the namespace end (... window past namespace end). The proof (kernel/src/cap/nvme_blockdevice_arbitrary_lba_proof.rs, a clone of the fixed-LBA module) drives the full bring-up plus the @20 WRITE (seeding LBA 2 = pattern-0, LBA 3 = pattern-1), then issues TWO distinct readBlocks windows – readBlocks(0, 1) (zero-filled LBA 0) and readBlocks(3, 2) (LBA 3 = pattern-1, LBA 4 = zero-filled) – comparing each returned data byte-for-byte to the manager-authored content and asserting the two windows return distinct content. It emits one cloudboot-evidence: provider-nvme-blockdevice-arbitrary-lba-read <token> marker labeled arbitrary_lba=supported window0_slba=0 window0_count=1 window1_slba=3 window1_count=2 windows_distinct=true prp_pages=single nvme_arm_fixed_lba=false. Proof: make run-cloud-provider-nvme-blockdevice-arbitrary-lba-read. With this, the NVMe namespace is readable through BlockDevice.readBlocks @0 at the LBA the consumer names. Future work (not yet implemented): NLB spanning more than one PRP1 page (count > 8) with a PRP list, NVMe write/flush durability through BlockDevice, a readonly_fs-style filesystem mounted over the NVMe BlockDevice cap, a dedicated I/O-completion Interrupt route, and graduating the NVMe data plane out of the per-proof feature into always-built production.

  • cloud-prod-nvme-blockdevice-writeblocks-durability-arm-local-proof arms the NvmeBrokered arm’s writeBlocks @1 (read-only until now): it drives the brokered NVMe sync WRITE with the caller-supplied (startLba, count, data) and proves write-then-read-back durability. A new parameterized op nvme_brokered_io_sync_write_window_op_for_cap(handle, owner, slba, count, in_data) (kernel/src/device_manager/stub.rs) mirrors the arbitrary-window READ entry but rotates the I/O index off a kernel-side write sequence (next_write_io_index() => index 0, before the read-back at index 1). The shared single-call body nvme_brokered_io_sync_command gains a third fill mode – a write_payload: Option<&[u8]> that copies the caller’s count * 512 bytes into block 0..count of the manager-owned write-data page (slot 6) through the HHDM mapping before the WRITE SQE is authored – beside the fixed prefill_pattern and the readonly_fs seed_image modes (both unchanged: regressions make run-cloud-provider-nvme-blockdevice-arbitrary-lba-read and make run-storage-fs stay green). Under cloud_nvme_blockdevice_writeblocks_proof (implies and supersedes cloud_nvme_blockdevice_arbitrary_lba_proof), BlockDeviceCap::nvme_write_blocks admits any 1 <= count <= 8 window with startLba + count <= namespace blocks and data.len() == count * 512, failing closed with distinct errors for zero count, over-capacity, past-namespace-end, and length mismatch; info @2 reports readOnly = false; flush @3 stays fail-closed (a real NVMe FLUSH, opcode 0x00, is a distinct verb – see the flush @3 capability below). The proof (kernel/src/cap/nvme_blockdevice_writeblocks_proof.rs, a clone of the arbitrary-LBA module) drives the full bring-up then writeBlocks(5, 2, data) with a caller-authored, non-zero, two-distinct-block 1024 B payload, followed by readBlocks(5, 2), comparing the read-back byte-for-byte to the bytes written. It emits one cloudboot-evidence: provider-nvme-blockdevice-writeblocks-durability <token> marker labeled write_path=blockdevice-writeblocks write_method=1 write_slba=5 write_count=2 write_data_len=1024 readback_data_len=1024 write_readback_match=true nvme_arm_read_only=false flush=fail-closed prp_pages=single. Proof: make run-cloud-provider-nvme-blockdevice-writeblocks-durability. No schema/binding change (writeBlocks @1 and readBlocks @0 round-trip through existing bindings). Future work (not yet implemented): a writable_fs / persistent_store consumer mounted over the NVMe BlockDevice write arm, a real NVMe FLUSH on flush @3, a dedicated I/O-completion Interrupt route on the data path, and graduating the NVMe data plane out of the per-proof feature into always-built production.

  • ddf-nvme-multiprp-blockdevice-window-local-proof extends the same BlockDevice.writeBlocks @1 / readBlocks @0 round-trip to a three-page NVMe PRP window. Under cloud_nvme_blockdevice_multiprp_window_proof, BlockDeviceCap::nvme_write_blocks and BlockDeviceCap::nvme_read_blocks accept count <= 24 for the local proof geometry while the default and older proof builds keep the one-page count <= 8 bound. The shared nvme_brokered_io_sync_command body resolves primary read/write data pages from parked-pool slots 5/6, a manager-owned PRP-list page from slot 7, read extension pages from slots 8/9, and write extension pages from slots 10/11. For the writeBlocks(5, 24, data) and readBlocks(5, 24) proof window it authors PRP1 as the primary data page and PRP2 as a PRP-list page containing two little-endian page pointers, matching the NVMe PRP-list subset in NVMe Base Specification 1.4 §4.3. The provider still supplies only inline Data through the BlockDevice schema and never sees a host physical address, IOVA, PRP1, PRP2, PRP-list page address, SQE byte, doorbell offset, or doorbell value. Requests with zero count, count 25, namespace overflow, or length mismatch fail closed before any I/O SQ doorbell write. The release marker includes full-transfer FNV-1a hashes for the WRITE and read-back records so the kernel-side proof is not limited to the first two 16-byte block digests; the userspace smoke also compares all 12 KiB byte-for-byte. Proof: make run-cloud-provider-nvme-blockdevice-multiprp-window.

  • cloud-prod-nvme-blockdevice-flush-local-proof arms the NvmeBrokered arm’s flush @3 (fail-closed until now): it authors a real NVMe FLUSH (NVM command-set opcode 0x00, NSID-scoped, no data transfer) through the brokered sync command machinery and proves a writeBlocks then flush returns CQE status 0 and the written block survives the flush. A new parameter-free op nvme_brokered_io_sync_flush_op_for_cap(handle, owner) (kernel/src/device_manager/stub.rs) drives the shared single-call body nvme_brokered_io_sync_command with a FLUSH SyncIoParams (opcode = 0x00, command_id = 8, slba = 0, block_count = 0), rotating the I/O index off a kernel-side flush sequence (next_flush_io_index() => index 1, after the WRITE at 0 and before the read-back at 2). The shared body learns the FLUSH shape (gated on the opcode): it skips the one-PRP1-page data bound, authors the SQE with NSID only and PRP1 = 0/PRP2 = 0/CDW10..15 = 0 (no data page touched), and the WRITE/READ data-bearing path stays byte-identical for non-FLUSH opcodes. Under cloud_nvme_blockdevice_flush_proof (implies and supersedes cloud_nvme_blockdevice_writeblocks_proof, the flush proof’s true sibling, so the write/read arms and the whole brokered I/O chain are reused unchanged), BlockDeviceCap::nvme_flush returns () when the FLUSH was authored + the SQ doorbell rung + the completion consumed + CQE status 0, failing closed otherwise. The proof (kernel/src/cap/nvme_blockdevice_flush_proof.rs, a clone of the writeblocks module) drives the full bring-up then writeBlocks(5, 2, data), flush(), and readBlocks(5, 2), comparing the post-flush read-back byte-for-byte to the bytes written. It emits one cloudboot-evidence: provider-nvme-blockdevice-flush <token> marker labeled flush_path=blockdevice-flush flush_method=3 nvme_flush_opcode=0x00 flush_cqe_status=0 write_then_flush_ok=true flush_data_transfer=none prp1=0 prp2=0 write_readback_after_flush_match=true reboot_persistence=deferred durability_proof=flush-completion-only virtio_flush_regression=green. Proof: make run-cloud-provider-nvme-blockdevice-flush. No schema/binding change (flush @3 () -> () round-trips through existing bindings with its empty params/result). Future work (not yet implemented): an NVMe reboot-persistence pass and crash-consistency where the FLUSH barrier specifically changes the survival outcome (a flushed write surviving a forced poweroff an unflushed one would not), routing File.sync / the writable-fs / persistent-store sync through this FLUSH, a dedicated I/O-completion Interrupt route on the data path, NLB>1 spanning multiple PRP pages with a PRP list, and graduating the NVMe data plane out of the per-proof feature into always-built production.

  • cloud-prod-nvme-blockdevice-reboot-persistence-local-proof closes the reboot-persistence gap the flush proof named first: it proves a normally committed

    • FLUSHED write survives a CLEAN reboot through the same BlockDevice interface, the two-boot analogue of run-storage-persist on the NVMe arm. The Makefile recipe (make run-cloud-provider-nvme-blockdevice-reboot-persistence) creates ONE nvme.raw image and boots the non-qemu cloudboot kernel over it TWICE WITHOUT regenerating it between boots. The provider self-selects its boot phase by probing the LBA 5..6 window through readBlocks(5, 2) @0 – the data window itself is the guard sentinel: boot 1 reads back all-zero (fresh namespace), takes the writer branch, and issues writeBlocks(5, 2, data) @1 + a real flush() @3 (CQE status 0); QEMU restarts against the SAME backing file and boot 2 reads back the known payload, takes NO writer branch, and the single read-back verifies persistence. The proof reuses the landed writeBlocks @1 / flush @3 / readBlocks @0 arms and the brokered sync command machinery unchanged; the only kernel-internal additions are a flat per-boot single-call I/O op log (reworked from the flush proof’s rigid WRITE -> FLUSH -> read-back state machine so the verifier boot can record a READ with no prior WRITE in the same boot), the data-window phase select, and the cross-boot phase=1|2 marker labels. The cap-waiter route + headline marker come from kernel/src/cap/nvme_blockdevice_reboot_persistence_proof.rs (a clone of the flush module under cloud_nvme_blockdevice_reboot_persistence_proof, which implies and supersedes cloud_nvme_blockdevice_flush_proof). Each boot emits one cloudboot-evidence: provider-nvme-blockdevice-reboot-persistence <token> marker carrying its phase (phase=1 ... write_then_flush_ok=true flush_cqe_status=0 boot_role=writer-flush on boot 1; phase=2 ... reboot_persistence_match=true boot_role=verifier durability_proof=clean-reboot-persistence on boot 2). The reboot-persistence gate is the cross-boot correlation: boot 1’s persisted block digests equal boot 2’s read-back block digests (and both equal the known payload). No schema/binding change. Future work (not yet implemented): crash-consistency where the FLUSH barrier specifically changes the survival outcome under an induced mid-flush crash (the analogue of run-storage-writable-recovery), routing File.sync / the writable-fs / persistent-store sync through this FLUSH, a dedicated I/O-completion Interrupt route on the data path, NLB>1 spanning multiple PRP pages, and graduating the NVMe data plane out of the per-proof feature into always-built production.
  • cloud-prod-nvme-blockdevice-flush-crash-consistency-local-proof covers the flushed-write-survives half of crash-consistency: it proves a normally committed + FLUSHED write survives a FORCED poweroff (an abrupt kill -9 of the QEMU process AFTER the flush barrier completed), the NVMe BlockDevice analogue of run-storage-writable-recovery. The Makefile recipe (make run-cloud-provider-nvme-blockdevice-flush-crash-consistency) creates ONE nvme.raw image, boots the non-qemu cloudboot kernel over it in the BACKGROUND (boot 1: empty namespace -> writer branch -> writeBlocks(5, 2, data) @1

    • real flush() @3, CQE status 0), watches the kernel log for the bounded arming marker [nvme-blockdevice-flush-crash-consistency] kernel: flushed write armed; awaiting forced poweroff, kill -9s the QEMU PID (the forced poweroff AFTER the flush barrier), then boots a SECOND time over the SAME file WITHOUT regenerating it (boot 2: verifier -> single readBlocks(5, 2) @0 read-back). The proof reuses the reboot-persistence predecessor’s two-boot phase select, the landed writeBlocks @1 / flush @3 / readBlocks @0 arms, and the brokered sync command machinery unchanged; the only kernel-internal additions over the predecessor are the phase-1 arm-and-spin window after the flush (on_release emits the arming marker and spins forever so the recipe can kill -9 at that point) and the forced-poweroff marker labels. The cap-waiter route + headline marker come from kernel/src/cap/nvme_blockdevice_flush_crash_consistency_proof.rs (a clone of the reboot-persistence module under cloud_nvme_blockdevice_flush_crash_consistency_proof, which implies and supersedes cloud_nvme_blockdevice_reboot_persistence_proof). Boot 1 emits one cloudboot-evidence: provider-nvme-blockdevice-flush-crash-consistency <token> marker carrying phase=1 ... write_then_flush_ok=true flush_cqe_status=0 armed_forced_poweroff=true boot_role=writer-flush-arm before the spin; boot 2 emits one carrying phase=2 ... flush_survives_forced_poweroff=true boot_role=verifier durability_proof=flush-survives-forced-poweroff. The crash-consistency gate is the cross-boot correlation: boot 1’s persisted block digests equal boot 2’s read-back block digests (and both equal the known payload), AND boot 1 reached the arm-and-spin window (was forcibly killed, did not take the verifier branch). No schema/binding change. Scoped honestly: “an unflushed write rolls back” is NOT provable under QEMU’s -device nvme cache=writeback model (the host page cache survives kill -9), so the differential-rollback half is NOT claimed (unflushed_rollback=not-provable-under-qemu-nvme-model). Future work (not yet implemented): a dedicated I/O-completion Interrupt route on the data path, NLB>1 spanning multiple PRP pages, and graduating the NVMe data plane out of the per-proof feature into always-built production. (Both higher-level consumer FLUSH routings are now closed: the writable-fs File.sync half by cloud-prod-nvme-consumer-sync-to-flush-local-proof and the persistent-Store put-commit half by cloud-prod-nvme-persistent-store-sync-to-flush-local-proof, both below.)
  • cloud-prod-nvme-dedicated-io-completion-interrupt-local-proof moves the NVMe BlockDevice.writeBlocks @1 / readBlocks @0 data-completion handoff off the synchronous I/O-CQ poll return path and onto a dedicated data Interrupt route. The spec basis remains NVMe Base Specification 1.4 submission/completion queue doorbells and completion entries (§3 controller registers, §4 queue management and CQ phase/status handling, §6 NVM WRITE / READ commands). The proof keeps table entry 0 for the CREATE I/O CQ/SQ admin completions and adds table entry 1 for the data I/O CQ completions; both routes are kernel-injected cap-waiter MSI-X routes, not a device-autonomous interrupt claim. The implementation entry points are kernel/src/cap/nvme_io_completion_interrupt_proof.rs (init, invoke_wait, invoke_acknowledge, poll_blockdevice_completions, emit_marker), kernel/src/device_manager/stub.rs (nvme_brokered_io_completion_interrupt_submit_op_for_cap, nvme_brokered_io_completion_interrupt_complete_op_for_cap, nvme_io_completion_interrupt_submit_record_buffers_live), and kernel/src/cap/block_device.rs (nvme_interrupt_write_blocks, nvme_interrupt_read_blocks, call_with_context). The manager still authors queue bases and PRP1 from the live DMAPool ledger, copies the caller’s write payload into the parked write-data page, consumes the I/O CQ entry at Interrupt.acknowledge, advances the I/O CQ head doorbell (0x100c), and posts the deferred BlockDevice completion only after the bounded caller CQ has room. The proof (make run-cloud-provider-nvme-io-completion-interrupt) drives writeBlocks(5, 2, data), waits/acks the data route, observes the deferred write completion, then drives readBlocks(5, 2), waits/acks the data route, and receives the read bytes through the standard block_device::read_blocks_results data field. The headline marker cloudboot-evidence: provider-nvme-io-completion-interrupt <token> pins create.entry.0, io.entry.1, data_route_distinct_from_create_route=true, four dispatches/four deferred EOIs, both deferred BlockDevice completions posted, and a byte-for-byte write/read-back match. Scoped honestly: queue-base and PRP addresses remain hidden (host_physical_user_visible=0, iova_export=disabled-future-only, prp_source=manager-ledger); multi-PRP windows (count > 8), provider-written PRP/SGL/address lanes, live cloud, a second namespace, FUA/DSM, and device-autonomous MSI-X delivery remain future work.

  • cloud-prod-readonly-fs-over-nvme-blockdevice-local-proof provides a readonly_fs-style consumer over the NVMe BlockDevice arm: the read-only filesystem mount reads its sectors through the NVMe BlockDevice cap instead of the kernel-owned virtio-blk free functions. kernel/src/cap/readonly_fs.rs gains a BlockSource seam abstracting the two reads a mount needs (device geometry + range read). The always-built Virtio variant routes to the same crate::virtio free functions, so make run-storage-fs stays byte-identical; the Nvme variant (built only under cloud_readonly_fs_over_nvme_proof) reads through a granted NVMe-backed BlockDevice – geometry from the IDENTIFY Namespace claim (see the READ-arm graduation entry), each chunked range read through nvme_brokered_io_sync_read_window_op_for_cap, one 4 KiB PRP1 page per call. Because the brokered controller is brought up by the userspace provider, the NVMe root Directory (granted via read_only_fs_root) defers its mount-parse to the first Directory.open. The proof (kernel/src/cap/readonly_fs_over_nvme_proof.rs, a clone of the arbitrary-LBA module) drives the full bring-up, then seeds a tiny CAPOSRO1 image through the repurposed @20 op (one manager-baked sector per call: superblock @ LBA 0, entry table @ LBA 1, file data @ LBA 2), mounts the filesystem over the NVMe BlockSource, opens the one seeded file, reads it, and compares the bytes. It emits one cloudboot-evidence: provider-readonly-fs-over-nvme <token> marker labeled read_path=readonly-fs-over-blockdevice fs_format=CAPOSRO1 block_source=nvme-blockdevice file_match=true superblock_via_nvme=true entry_table_via_nvme=true extent_via_nvme=true after the kernel verifies each read-back block-0 digest against the baked image. Proof: make run-cloud-provider-readonly-fs-over-nvme. The malformed-image fail-closed paths (bad superblock magic, out-of-range entry-table or file extent) are the unchanged shared mount_root_inner/parse_entries validation in kernel/src/cap/readonly_fs.rs – the BlockSource seam swaps only the block-read backend, so the existing MountError checks covered by make run-storage-fs apply identically over the NVMe BlockSource; the NVMe arm additionally rejects an over-range range read with the arbitrary-LBA arm’s fail-closed error. Future work (not yet implemented): a multi-file directory walk / Directory.list traversal over NVMe, files whose extents span many one-PRP1-page chunks, NVMe write/flush durability through BlockDevice (the image is seeded via the manager-owned @20 op, not writeBlocks), a dedicated I/O-completion Interrupt route, and graduating the NVMe data plane and the readonly_fs NVMe mount out of the per-proof feature into always-built production.

  • cloud-prod-readonly-fs-over-nvme-multifile-dirwalk-local-proof extends the read-only filesystem to multi-file directories: it lists a directory with more than one entry and reads two distinct files over the NVMe BlockDevice cap, one of which spans multiple 4 KiB chunks. The baked image (kernel/src/cap/readonly_fs_over_nvme_multifile_proof.rs, a clone of the single-file module) grows to 12 sectors – superblock @ LBA 0, a two-record entry table @ LBA 1, a one-sector small file @ LBA 2, and a 9-sector large file @ LBA 3..11 carrying a deterministic position-dependent byte pattern. The large File.read covers nine sectors, so the read_range chunk loop issues TWO BlockDevice.readBlocks @0 calls (an 8-sector chunk @ LBA 3 + a 1-sector chunk @ LBA 11) – the multi-chunk path the single-file arm never exercised. The proof identifies each recorded read by (slba, count) and verifies it byte-for-byte with a per-read FNV-1a-64 over the full transfer (computed in device_manager alongside the block digests), so a dropped trailing chunk fails closed. Because per-sector seeding (12) plus the filesystem reads (5) issue 17 single I/O commands and the monotonic I/O SQ/CQ index must stay inside one first CQ pass, the build raises device_manager::stub::NVME_IO_QUEUE_DEPTH from 8 to 32 (create cdw10=0x001f0001); the change is inert for every other NVMe proof build. It emits one cloudboot-evidence: provider-readonly-fs-over-nvme-multifile <token> marker labeled dir_entry_count=2 file_count=2 files_distinct=true large_file_full_match=true large_file_read_blocks_calls=2 superblock_via_nvme=true entry_table_via_nvme=true extents_via_nvme=true (plus the single-file arm’s discipline labels). Proof: make run-cloud-provider-readonly-fs-over-nvme-multifile; the virtio mount path stays byte-identical (make run-storage-fs). Future work (not yet implemented): NVMe write/flush durability through BlockDevice, a dedicated I/O-completion Interrupt route on the data path, NLB > 1 spanning multiple PRP pages in a single call, sub-directory trees, and graduating the NVMe data plane and the readonly_fs NVMe mount out of the per-proof feature into always-built production.

  • cloud-prod-persistent-store-over-nvme-blockdevice-local-proof provides a writable consumer over the NVMe BlockDevice write arm: the disk-backed persistent Store mounts over the NVMe BlockDevice write arm and proves a put-then-get durability round-trip. kernel/src/cap/persistent_store.rs gains a read+write BlockSource seam (mirroring readonly_fs::BlockSource but with a write_blocks method). The always-built Virtio variant routes to the same crate::virtio free functions byte-identically (including the data_region_base_lba() installable-disk offset, folded into the variant), so make run-storage-persist stays green; the Nvme variant (built only under cloud_persistent_store_over_nvme_proof) reads through nvme_brokered_io_sync_read_window_op_for_cap and writes through nvme_brokered_io_sync_write_window_op_for_cap, one 4 KiB PRP1 page per call. Because the brokered controller is brought up by the userspace provider, the NVMe root Store (granted via persistent_store) defers its mount-parse to the first Store call. The proof (kernel/src/cap/persistent_store_over_nvme_proof.rs, a clone of the writeblocks module) drives the full bring-up, then seeds a CAPOSST1 superblock + empty entry table through the repurposed @20 op (superblock @ LBA 0, entry table @ LBA 1), and exercises the granted Store: Store.put writes the data extent (LBA 2), entry-table sector, and superblock through BlockDevice.writeBlocks @1, and Store.get reads the extent back through BlockDevice.readBlocks @0. The kernel attests the put WRITE and get READ block-0 digests both equal the payload digest and differ from the pre-put (zero) extent, and userspace compares the returned bytes byte-for-byte. It emits one cloudboot-evidence: provider-persistent-store-over-nvme <token> marker labeled write_path=store-put-over-blockdevice-writeblocks read_path=store-get-over-blockdevice-readblocks consumer=persistent-store store_iface=Store block_iface=BlockDevice store_format=CAPOSST1 write_method=1 read_method=0 put_get_roundtrip_match=true durability_attested=true virtio_regression=green. Because the round-trip issues 8 single I/O commands (2 seed WRITEs + 2 deferred-mount READs + 3 Store.put WRITEs + 1 Store.get READ) whose last monotonic CQ head reaches 8 – past the default depth-8 first pass – the build raises device_manager::stub::NVME_IO_QUEUE_DEPTH from 8 to 16; the change is inert for every other NVMe proof build. Proof: make run-cloud-provider-persistent-store-over-nvme. No schema/binding change (Store.put/get/has/delete and BlockDevice.writeBlocks @1/readBlocks @0 round-trip through existing bindings). Future work (not yet implemented): routing the writable filesystem (CAPOSWF1) over the NVMe write arm, a real NVMe FLUSH on flush @3 (stays fail-closed), an NVMe reboot-persistence pass, a dedicated I/O-completion Interrupt route on the data path, NLB > 1 spanning multiple PRP pages, and graduating the NVMe data plane out of the per-proof feature into always-built production.

  • cloud-prod-writable-fs-over-nvme-blockdevice-local-proof mounts the full disk-backed writable filesystem over the NVMe arm (kernel/src/cap/writable_fs.rs, the CAPOSWF1 node-table tree with mkdir / rename / remove and a fail-closed single-writer policy). writable_fs carries a read+write BlockSource seam mirroring persistent_store’s: the Virtio variant (built only in the qemu/ installable storage builds) routes to the crate::virtio free functions byte-identically (folding the data_region_base_lba() offset), so make run-storage-writable / make run-storage-writable-recovery stay green; the Nvme variant (built only under cloud_writable_fs_over_nvme_proof) reads through nvme_brokered_io_sync_read_window_op_for_cap and writes through nvme_brokered_io_sync_write_window_op_for_cap, one 4 KiB PRP1 page per call. Because writable_fs uses a process-wide singleton volume, the NVMe writable_fs_root grant stages the live device_mmio handle and defers the singleton mount-parse to the first Directory/File call. The proof (kernel/src/cap/writable_fs_over_nvme_proof.rs, a clone of the persistent-store module that supersedes and drops it) seeds a CAPOSWF1 superblock + root + one seeded file through the @20 op, contiguously from LBA 256 (superblock @256, node table @257, seeded file extent @258), then exercises the granted filesystem. READ arm: opening the seeded file triggers the deferred mount, which reads the seeded extent (@258) back through BlockDevice.readBlocks @0; File.read returns the RAM copy the mount loaded. WRITE arm: File.write to a fresh file lands a bump-allocated data extent (@259) + node-record + superblock through BlockDevice.writeBlocks @1. Because File.read serves the RAM content cache the mount loaded (not a fresh disk read), a same-extent disk re-read of a just-written file – which needs a remount/ reboot – is out of scope; the matching block-0 digests prove the same payload traversed both device arms, each device-acked. The single-writer policy is proven intact: a File.write through a second granted writable_fs_root cap fails closed. It emits one cloudboot-evidence: provider-writable-fs-over-nvme <token> marker labeled write_path=file-write-over-blockdevice-writeblocks read_path=file-read-over-blockdevice-readblocks consumer=writable-fs fs_iface=Directory file_iface=File block_iface=BlockDevice fs_format=CAPOSWF1 write_method=1 read_method=0 write_read_roundtrip_match=true durability_attested=true single_writer_policy=enforced second_writer_denied=true recovery_over_nvme=deferred virtio_regression=green. The round-trip issues 11 single I/O commands (3 seed WRITEs + 3 deferred-mount READs + 5 File.write WRITEs); NVME_IO_QUEUE_DEPTH stays 16. Proof: make run-cloud-provider-writable-fs-over-nvme. No schema/binding change. Future work (not yet implemented): the unclean-shutdown / forced- poweroff recovery window (recovery_crash_after_record) over the NVMe arm (the analogue of run-storage-writable-recovery, proved on virtio here), a real NVMe FLUSH on flush @3, an NVMe reboot-persistence pass, a dedicated I/O-completion Interrupt route on the data path, NLB > 1 spanning multiple PRP pages, and graduating the NVMe data plane out of the per-proof feature into production.

  • cloud-prod-writable-fs-over-nvme-recovery-local-proof proves the unclean-shutdown / forced- poweroff RECOVERY window (recovery_crash_after_record, the record-sector-written- but-superblock-not-yet-committed window in kernel/src/cap/writable_fs.rs) over the NVMe BlockDevice arm. A new cloud_writable_fs_over_nvme_recovery_proof feature implies (and supersedes the happy-path proof module/route/init of) cloud_writable_fs_over_nvme_proof and widens the storage_writable_recovery crash-window cfg gate so the same recovery-orphan.txt sentinel arms an induced forced poweroff when the writable filesystem is NVMe-backed. The recovery cap-waiter module (kernel/src/cap/writable_fs_over_nvme_recovery_proof.rs) reuses the NVMe BlockSource arm, deferred mount, third WritableFsRoot grant arm, window ops, and I/O queue create unchanged. Unlike the happy-path proof, the CAPOSWF1 image is HOST-BUILT (tools/mkstore-image --writable-nvme lays an empty superblock + root-only node table) rather than seeded through @20: a two-boot SAME-image recovery flow cannot re-seed on pass 2 without clobbering the pass-1 committed state, and the read ordering does not depend on a per-boot seed here (mirroring the virtio run-storage-writable-recovery proof, which also boots a host-built image twice). make run-cloud-provider-writable-fs-over-nvme-recovery boots QEMU twice with -device nvme against one shared raw drive file: pass 1 commits a File.write + sub-directory through writeBlocks @1, allocates the sentinel (its record sector lands on the namespace), and spins; the harness kill -9s QEMU before the superblock commit. Pass 2 boots the SAME file, mounts by reading the old superblock + node table back through readBlocks @0, and asserts the recovered tree omits the orphan slot (exactly the committed entries remain), preserves the committed mutation (file size + content), accepts a usable post-recovery write, and denies a second-grant write (single-writer policy). The userspace smoke emits one cloudboot-evidence: provider-writable-fs-over-nvme-recovery <token> marker labeled crash_window=record-written-superblock-uncommitted orphan_slot_ignored=true committed_mutation_survived=true post_recovery_write_ok=true recovery_over_nvme=true single_writer_policy=enforced durability_basis=host-page-cache real_flush=deferred reboot_persistence=deferred io_completion=polled interrupt_wait=not-used-on-data-path virtio_recovery_regression=green live_cloud=not-attempted; the kernel proof module’s on_release independently attests the cap-waiter route lifecycle + the two CREATE I/O queue dispatch/ack cycles. The two CREATE I/O queue commands keep their production Interrupt.wait/acknowledge cap-waiter cycles; the data path stays polled. Bounded-proof caveat: one record-vs-commit window, host-page-cache durability (the two passes share one backing file; a kill -9 preserves the host page cache), NOT media crash-consistency and NOT a real NVMe FLUSH barrier. No schema/binding change. Future work (not yet implemented): a real NVMe FLUSH on flush @3, an NVMe clean- reboot-persistence pass, NLB > 1 spanning multiple PRP pages, a dedicated I/O-completion Interrupt route on the data path, and graduating the NVMe data plane into production. Proof: make run-cloud-provider-writable-fs-over-nvme-recovery.

  • cloud-prod-nvme-consumer-sync-to-flush-local-proof routes a consumer-level File.sync @4 to a real BlockDevice.flush @3 NVMe FLUSH media barrier instead of a write-side no-op. writable_fs::BlockSource carries a flush() arm (the Virtio variant returns Ok(()) – the driver negotiates no VIRTIO_BLK_F_FLUSH, so virtio File.sync stays a byte-identical no-op and make run-storage-writable stays green; the Nvme variant drives nvme_brokered_io_sync_flush_op_for_cap with the same success predicate the read/write arms apply), and File.sync @4 (writable_fs.rs) routes through it AFTER the claim_writer gate. The feature cloud_nvme_consumer_sync_to_flush_proof composes cloud_nvme_blockdevice_flush_crash_consistency_proof (arming the real flush @3 op-for-cap) and cloud_writable_fs_over_nvme_proof (the consumer arm), dropping both predecessors’ proof modules. The proof (kernel/src/cap/nvme_consumer_sync_to_flush_proof.rs) seeds the CAPOSWF1 image, then Directory.open(CREATE) + File.write (through writeBlocks @1), then File.sync() – which issues the real NVMe FLUSH (opcode 0x00, CQE status 0, no data transfer: PRP1 = 0, PRP2 = 0) – then File.read confirming the bytes survive. The single-writer policy is shown intact: a File.sync through a second granted cap fails closed BEFORE any FLUSH is issued (denied_sync_issues_no_flush=true), and the kernel asserts exactly one consumer-sync FLUSH (status 0) was recorded. It emits one cloudboot-evidence: provider-nvme-consumer-sync-to-flush <token> marker labeled consumer_sync_path=File.sync-to-nvme-flush sync_method=4 flush_method=3 nvme_flush_issued_by_consumer_sync=true nvme_flush_opcode=0x00 flush_cqe_status=0 write_sync_read_roundtrip_match=true single_writer_policy=enforced durability_proof=consumer-sync-issues-real-flush virtio_sync_noop=byte-identical. The round-trip issues 12 single I/O commands (3 seed WRITEs + 3 mount READs + 5 File.write WRITEs + 1 File.sync FLUSH); NVME_IO_QUEUE_DEPTH stays 16. Proof: make run-cloud-provider-nvme-consumer-sync-to-flush. No schema/binding change. The bounded claim is consumer-sync-issues-real-flush, NOT a power-loss survival differential (unflushed_rollback=not-provable-under-qemu-nvme-model; the cross-boot forced-poweroff differential stays as the crash-consistency proof established it). Future work (not yet implemented): routing the persistent Store’s commit path through the FLUSH, graduating the NVMe data plane into production, NLB > 1 spanning multiple PRP pages, and a dedicated I/O-completion Interrupt route on the data path.

  • cloud-prod-nvme-persistent-store-sync-to-flush-local-proof routes the persistent Store’s put-commit path to a real BlockDevice.flush @3 NVMe FLUSH media barrier. The Store has NO sync schema method (put @0 / get @1 / has @2 / delete @3 only), so the routing point is the existing put-commit path, not a new method: persistent_store::BlockSource gains a flush() arm (the Virtio variant returns Ok(()) – no VIRTIO_BLK_F_FLUSH, so the virtio Store.put commit stays a byte-identical no-op and make run-storage-persist stays green; the Nvme variant drives nvme_brokered_io_sync_flush_op_for_cap with the same success predicate the read/write arms apply, but only when the flush lineage is composed so the plain make run-cloud-provider-persistent-store-over-nvme commit stays a no-op), and put_blob (persistent_store.rs) issues it AFTER the flush_superblock write (the ordering commit point) succeeds. A FLUSH that fails closed rolls back the in-RAM entry_count/next_free_sector so no live index insert occurs. The feature cloud_nvme_persistent_store_sync_to_flush_proof composes (and drops/supersedes) the cloud_nvme_consumer_sync_to_flush_proof lineage (transitively the real flush @3 op and the persistent-store-over-NVMe read+write seam). The proof (kernel/src/cap/nvme_persistent_store_sync_to_flush_proof.rs) seeds the CAPOSST1 image, then Store.put(data) – which writes the data extent / entry sector / superblock through writeBlocks @1, then issues the real NVMe FLUSH (opcode 0x00, CQE status 0, no data transfer: PRP1 = 0, PRP2 = 0) after the superblock commit – then Store.get(hash) confirming the bytes survive. The kernel asserts exactly one Store-commit FLUSH (status 0) recorded AFTER the superblock write (superblock_commit_before_flush=true) and emits one cloudboot-evidence: provider-nvme-persistent-store-sync-to-flush <token> marker labeled consumer_commit_path=store-put-to-nvme-flush put_method=0 flush_method=3 nvme_flush_issued_by_store_commit=true nvme_flush_opcode=0x00 flush_cqe_status=0 superblock_commit_before_flush=true put_get_roundtrip_match=true failed_flush_issues_no_live_entry=true virtio_commit_noop=byte-identical durability_proof=store-commit-issues-real-flush. The round-trip issues 9 single I/O commands (2 seed WRITEs + 2 mount READs + 3 Store.put WRITEs + 1 put-commit FLUSH + 1 Store.get READ). Proof: make run-cloud-provider-nvme-persistent-store-sync-to-flush. No schema/binding change. The bounded claim is store-commit-issues-real-flush, NOT a power-loss survival differential (unflushed_rollback=not-provable-under-qemu-nvme-model). Future work (not yet implemented): graduating the NVMe data plane out of the per-proof features into always-built production, with the dedicated I/O-completion Interrupt route on the data path and NLB > 1 spanning multiple PRP pages.

  • cloud-prod-nvme-sync-io-state-seam-always-built-local-proof extracts the brokered-NVMe synchronous-I/O state the shared op body depends on into ONE always-built module device_manager::nvme_sync_io_state (kernel/src/device_manager/nvme_sync_io_state.rs), compiled in the default no-proof cargo build (not behind any cloud_nvme_*_proof feature). The seam owns: the functional I/O SQ/CQ reservation cursor (reserve_io_slot() = a queue-depth-bounded first-pass slot reservation, with one live in-flight reservation, so a single-call command cannot reuse a stale CQE before the created queue wraps); the admissions predicate (sync_{read,write,flush}_admitted = bounded-ledger-not-full plus no active reservation); and the ordered SyncIoRecord op-log ledger (record_sync_{read,write,flush}, DIGEST_BYTES, op kind implied by record.opcode = 0x00 FLUSH / 0x01 WRITE / 0x02 READ). The shared body nvme_brokered_io_sync_command and the nvme_brokered_io_sync_{read_window,write_window,flush}_op_for_cap / nvme_brokered_io_sync_read_bytes_op_for_cap entries (kernel/src/device_manager/stub.rs) now record/admit/index through this seam instead of the per-proof nvme_io_proof alias; the alias stays only for the genuinely per-proof create/io-phase/seed ledgers (record_create_*, record_io_*, next_io_phase, next_seed_slba, seed_image_sector). All 15 NVMe BlockDevice proof modules (kernel/src/cap/*_proof.rs) delegate to the seam: each deletes its private SyncIoRecord/SyncHandoff/record/admit/index copy and reconstructs its release-marker view from the seam’s ordered op-log snapshot, byte-identical. The create-ordering / write-before-read orderings the per-proof harnesses formerly folded into their admit predicates are proof-harness assertions each proof still re-derives in its release marker; the always-built admit keeps only the production-honest bounded-ledger invariant (a real read needs no prior write). Code-location refactor only – no schema/binding change, no new device behavior; default cargo build and cargo build --features qemu stay warning-free (the seam carries a module-level #[allow(dead_code)] for its dormant not-yet-activated entry points). This UNBLOCKS the read-arm graduation: the read body’s sync-I/O symbols now resolve in the default build, so the graduation can make the read body always-built and gate activation behind a fail-closed runtime probe while the proof exercises the same always-built seam. Proof: every existing NVMe BlockDevice proof stays green (make run-cloud-provider-nvme-blockdevice-arbitrary-lba-read and the rest of the chain) and make run-net is byte-identical.

These brokered capabilities target the no-IOMMU QEMU/GCP lane, where queue-base and PRP addresses are materialized by the kernel/device manager from the live ledger. On a direct-remapping/vIOMMU gate the provider-written validator model (nvme-userspace-bind-and-controller-bringup) applies instead. The PCI metadata-only discovery summary (pci: nvme metadata ...) that also runs on make run-pci-nvme is the separate enumeration-evidence surface in kernel/src/pci.rs.

1. Spec basis

  • Device: NVM Express PCI controller. PCI class 0x01 (mass storage), subclass 0x08 (NVM), programming interface 0x02 (NVM Express). Detected by PciDevice::is_nvme_controller (kernel/src/pci.rs, NVME_CLASS_MASS_STORAGE / NVME_SUBCLASS_NVM / NVME_PROG_IF_NVM_EXPRESS). QEMU instance: -device nvme,drive=...,serial=... on the q35 machine.
  • Authoritative spec: NVM Express Base Specification (NVMe 1.4 / 2.0). The fields the validator relies on:
    • Controller registers CAP, CC (with CC.EN controller enable), AQA, ASQ, ACQ (NVMe Base Spec §3.1 controller register map). ASQ/ACQ base addresses have bits 11:0 reserved → 4 KiB page-aligned.
    • Submission/completion queue base addresses and the per-queue doorbell registers in the doorbell stride region (§3.1.x, §7.6 queue setup).
    • Physical Region Page (PRP) entries PRP1/PRP2 and the PRP List (§4.3): PRP list pages and list-pointer PRP2 entries are page-aligned; a transfer that needs more than one PRP list page chains a further list, which this bounded subset does not follow.
  • Reference driver (optional cross-check): the Linux drivers/nvme/host/ queue-setup and PRP-build paths (nvme_setup_prps, nvme_pci_configure_admin_queue).

2. Wire format (validator-relevant subset)

The validator reads only the device-visible addresses a single doorbell newly publishes, plus the byte extent of the region each names. It does not decode command opcodes, data payloads, or completion entries. Scanned items are modeled by ScanItem (kernel/src/cap/nvme_doorbell_validator.rs).

  • Queue-base registers (ScanItem::QueueBase): the ASQ/ACQ admin queue bases (scanned on the CC.EN / queue-arm write, ScanKind::QueueArm) and the I/O SQ/CQ bases. The named region is entries × entry_size bytes (e.g. an admin SQ is depth × 64, an admin CQ is depth × 16). Required alignment: page (4 KiB).
  • SQ entry PRP pointers (ScanItem::Prp): for each NVMe command newly made visible by an SQ tail doorbell (ScanKind::SqTailDoorbell), the PRP1 data pointer, the PRP2 data-or-list pointer, and one level of PRP-list indirection. list_depth counts indirection already followed (0 = a PRP carried in the SQE, 1 = a pointer inside the single PRP list page); list_depth > 1 is the out-of-subset deeper-chain case and fails closed (MAX_PRP_LIST_DEPTH). The named region is the transfer length (PRP data) or one page (a PRP list). Required alignment: page (4 KiB).

The scan is on-notify only: the provider may freely write its own mapped DMA pages between doorbells; nothing device-reachable happens until a doorbell rings, which is the single choke point the validator guards. Cost is O(descriptors published by this doorbell).

3. capOS mapping

The validator is the kernel half of the Model B genuine-userspace-driver model (docs/proposals/nvme-model-b-doorbell-dma-validator.md): the provider writes the device-visible queue-base and PRP addresses itself, and the kernel validates them on the doorbell path rather than minting them (Model A, the unchanged virtio-net TX path).

  • Authority gate: the live doorbell-path hook derives the owning provider’s identity (OwnerToken) and live grant generation from the DeviceMmio grant record in the device-manager ownership ledger, never from provider-supplied bytes. The cfg(qemu) self-test resolves owner/generation from synthetic windows only.
  • DeviceMmio: the validator is invoked from the pre-write step of the NVMe doorbell/queue-arm selected-write DeviceMmio claim (kernel/src/cap/device_mmio.rs; the existing notifyDoorbell path is the Model A virtio-net claim and does not trigger the validator). The scan completes — accept or reject — before the doorbell write is allowed to take effect, so the device never sees an unvalidated descriptor batch. BAR0 / doorbell pages stay device-uncacheable, NX, capability-scoped.
  • DMAPool / window descriptor: for a direct-remapping/vIOMMU lane, a DmaWindow can name the owner’s domain-scoped IOVA range with a live generation, and provider-written values can be checked against that range. On the current no-IOMMU lane, there is no provider-visible non-host-physical device-address namespace; the manager owns the physical bounce pages and must materialize queue-base and PRP/SGL fields itself. See docs/dma-isolation-design.md (Provider-Written Addresses And No-IOMMU Brokered Bounce).
  • Interrupt: completion_wakes_waiter enforces the stale-completion gate — a completion wakes a waiter only if its submission scan was accepted and the generation it was validated under is still live; an unvalidated or retired-generation completion does not wake a waiter.
  • Fail-closed / validation rules (ScanReject, all reject with no doorbell write and no waiter wake): out-of-window, host-physical, cross-owner-alias, region-overrun, unaligned, deep-prp-chain, stale-generation, and invalid-region. A doorbell rung after revoke/reset/regrant against a stale generation fails closed even when the byte value would have been in-window for the prior grant.
  • QEMU-emulable vs hardware-only: the validator mechanism and its hostile-scan invariants are end-to-end provable in QEMU (make run-pci-nvme, the nvme: validator ... proof lines). Live controller bring-up over a real NVMe controller — admin/I/O queue creation, IDENTIFY, and a bounded read with the validator gating the real doorbell — is QEMU-emulable too and is covered by the brokered bring-up capabilities (§4-§9), not the validator mechanism itself.

4. Userspace bind (read-only controller bring-up)

nvme-bind-claimed-mmio-read stands up the userspace storage-provider bind foundation over the existing DDF driver foundation. It binds the controller read-only – no register write, no DMA submission, and no doorbell – and so leaves the controller’s existing (firmware-initialized) state untouched.

  • Enumerate → claim → BAR0 preseed: bind_qemu_nvme_controller (kernel/src/pci.rs) runs for the first enumerated NVMe controller after the metadata summary. It preseeds the first decoded memory BAR (BAR0 controller registers) for brokered reads (devicemmio_grant_source::preseed_read32_for_device), then claims the function and parks it under DeviceOwner::ManagerGrantSource. On any staging failure it falls back to the pci: nvme no-authority/no-driver line (fail-closed): a partially-staged authority surface is never advertised.
  • Grant-source staging: the same device-agnostic {devicemmio,dmapool,interrupt}_grant_source::init_for_device the virtio path uses stage the bootstrap grants against the claimed NVMe handle. The virtio-net-specific provider-notify/doorbell selected-write claim is not staged here — that is the controller-enable path (§6). run-pci-nvme boots with no virtio devices, so the singletons are free for NVMe; run-net / run-ddf-provider-consumer (no -device nvme) keep the virtio bind untouched.
  • Brokered register reads: the nvme-bringup-smoke provider (demos/nvme-bringup-smoke/) holds the manifest-granted console/dmapool/device_mmio/interrupt caps and reads CAP (0x0, 0x4), VS (0x8), CC (0x14), and CSTS (0x1c) through DeviceMmio.read32 (the brokered boot-preseeded mapping in device_manager::read_devicemmio_u32). It proves the bound claim reaches a coherent NVMe BAR0 by requiring a live CAP (non-zero, non-floating) and a valid VS version (NVMe Base Spec §3.1.1/§3.1.2; QEMU reports 1.4.0), and reports the observed CC.EN/CSTS.RDY (§3.1.5/§3.1.6, bit 0 of each).
  • Firmware-initialized controller: under QEMU’s SeaBIOS BIOS boot, the NVMe boot-probe enables the controller (CC.EN=1, CSTS.RDY=1) before init runs, so the read-only bind observes a live controller. Bringing it to a known reset state (CC.EN=0, wait CSTS.RDY=0) before re-enabling with provider-owned admin queues is the controller-enable path’s responsibility (§6), not this read-only bind.
  • Proof line: the userspace [nvme-bringup-smoke] controller-bind ok ... mmio_read=brokered controller_state=firmware-enabled read proof, asserted by tools/qemu-pci-nvme-smoke.sh. The kernel bind line advanced from controller_init=read-only-bind to controller_init=reset-capable-bind when §5 added the CC selected-write claim to the same grant staging; the read proof itself is unchanged.

5. Userspace controller reset (selected-write CC claim)

nvme-controller-reset-selected-write is the first genuine userspace NVMe controller-register write: it brings the firmware-enabled controller to a known reset state. It does not enable the controller, program admin/IO queue bases, submit DMA, or ring a doorbell – controller enable publishes admin queue-base addresses and is the validator-gated path in §6.

  • NVMe CC selected-write claim: the NVMe bind stages the DeviceMmio grant region with a reset-only selected-write claim (DeviceMmioWrite32ClaimProvider::NvmeControllerRegister, device_manager::nvme_controller_register_grant_regionproofs::nvme_controller_register_region), scoped to the CC register (0x14, NVMe Base Spec §3.1.5). bind_qemu_nvme_controller (kernel/src/pci.rs) now calls devicemmio_grant_source::init_nvme_controller_for_device instead of the plain init_for_device, so the single granted cap carries both the brokered read surface and the CC write claim.
  • Value-flexible, scoped: unlike the virtio variants (which pin an exact (offset, value) pair), the NVMe claim is offset-scoped and value-flexible only for reset in validate_devicemmio_write32_claim (kernel/src/device_manager/qemu_full.rs): it admits any CC write whose CC.EN (bit 0) is clear – the read-modify-write reset – directly, fails closed on raw CC.EN=1 writes with devicemmio-nvme-cc-enable-raw-blocked, and fails closed on any write to a non-CC offset (unclaimed-register-write). Refused writes perform no MMIO.
  • Reset sequence: the nvme-bringup-smoke provider reads CC, writes it back with CC.EN cleared through DeviceMmio.write32 (the volatile MMIO write in device_manager::write_devicemmio_u32, resolved through the boot-preseeded BAR0 mapping), and polls CSTS (§3.1.6) until CSTS.RDY clears. QEMU clears CSTS.RDY synchronously on the CC.EN=1→0 write (nvme_ctrl_reset).
  • No DMA validator involvement: a reset write (CC.EN clear) publishes no queue-base or PRP addresses, so the Model B on-notify validator (§2/§3) is not invoked. The validator is invoked on the explicit brokered controller-enable queue-arm path (§6).
  • Proof lines (asserted by tools/qemu-pci-nvme-smoke.sh): pci: nvme userspace-bind ... controller_init=reset-capable-bind ... cc_selected_write=staged (kernel), [nvme-bringup-smoke] cc-raw-enable-refused ..., [nvme-bringup-smoke] non-cc-write-refused ..., and [nvme-bringup-smoke] controller-reset ok ... csts_rdy_before=1 csts_rdy_after=0 cc_en_after=0 reset_write=performed ... (userspace).

6. Brokered controller enable (no-IOMMU, manager-authored admin queues)

nvme-no-iommu-brokered-controller-enable enables the controller on the no-IOMMU make run-pci-nvme gate without exporting a host physical (== the device-visible address on the bounce shape) or raw IOVA to the provider. The provider invokes the explicit no-parameter DeviceMmio.brokeredNvmeControllerEnable verb (schema @6); the manager authors every address-bearing register and the selected CC value from its live DMA ledger. Raw DeviceMmio.write32(CC, value with CC.EN=1) fails closed before any MMIO side effect. The earlier provider-written Model B enable (nvme-userspace-bind-and-controller-bringup) stays blocked: it would require the provider to author a device-visible queue-base, which the reviewed iova_export=disabled-future-only discipline forbids on this gate.

  • Brokered admin queue memory: the provider allocates the admin submission and completion queue pages through the device’s DMAPool authority (DmaPool.allocateBuffer), maps each read-write to fill it, and unmaps. By convention the manager reads the admin SQ from pool slot 0 (NVME_ADMIN_SQ_POOL_SLOT) and the admin CQ from slot 1 (NVME_ADMIN_CQ_POOL_SLOT). The pages stay live in the manager ledger; DmaBuffer.info continues to report device_iova=0, iova_export=disabled-future-only, host_physical_user_visible=false.
  • Manager-authored queue-base registers: the @6 method dispatches through nvme_brokered_controller_enable_op_for_cap (kernel/src/device_manager/qemu_full.rs), which validates the cap against the live NVMe controller-register claim and then calls nvme_brokered_admin_queue_enable. It resolves the admin SQ/CQ pages (record.attached_dmapools[..].proof_buffers[slot].page), then authors AQA (0x24, zero-based admin queue sizes), ASQ (0x28/0x2c), and ACQ (0x30/0x34) from the ledger page physical addresses via nvme_authored_register_write (volatile writes resolved only through the boot-preseeded BAR0 mapping), and finally performs the manager-selected CC.EN | IOSQES=6 | IOCQES=4 write. No provider-supplied controller bits or address-bearing value reaches the controller.
  • Validator on the queue-arm path: before any register write the authored ASQ/ACQ bases are passed through the Model B on-notify DMA validator (crate::cap::nvme_doorbell_validator::validate_doorbell_scan, ScanKind::QueueArm). On this path the windows and the scanned items both derive from the same kernel ledger pages, so it is a self-consistency check on the kernel-authored bases: it proves page alignment and in-window containment of the named queue region (entries * entry_size – the admin SQ 128 B / CQ 32 B fit the 4 KiB page). The owner-identity, cross-owner-alias, host-physical, and stale-generation rejections are structurally unreachable here because both sides of each comparison come from the manager (the real authority gate against a stale/foreign page is the live-ledger membership check below); those hostile rejections are exercised by the bounded cfg(qemu) self-test (§3). A reject still fails closed before the CC.EN write.
  • Fail-closed before enable: raw write32(CC, CC.EN=1) returns devicemmio-nvme-cc-enable-raw-blocked before any MMIO. The explicit manager-op’s real authority gate is live-ledger membership – an enable request with the admin queue pages unallocated, freed, or in-flight returns nvme-admin-queues-not-armed (devicemmio-nvme-cc-enable-not-armed) with no MMIO side effect, covering the out-of-order manager operation and the post-free stale re-enable. A validator reject returns devicemmio-nvme-cc-enable-validator-reject.
  • Teardown under live admin queues: reset (CC.EN=0) quiesces the controller (CSTS.RDY clears) before the admin queue pages are reused; DmaBuffer.freeBuffer then scrubs each page before the frame is freed (page_scrubbed_before_frame_free=true), and a subsequent enable with the queue memory gone fails closed. The enable path submits no admin commands, so there are no live completions or waiters during teardown; the “stale/unvalidated completion does not wake a waiter” property (completion_wakes_waiter) is proven by the bounded cfg(qemu) self-test (§3), not by the live admin-queue teardown.
  • Proof lines (asserted by tools/qemu-pci-nvme-smoke.sh): [nvme-bringup-smoke] admin-queue-allocated ... (userspace), nvme: brokered-enable owner=nvme-storage trigger=manager-op admin_sq_slot=0 admin_cq_slot=1 validator=queue-arm scanned_items=2 aqa=0x00070007 cc=0x00460001 asq_authored=true acq_authored=true cc_en_write=performed cc_bits_selected_by=manager queue_base_source=manager-ledger host_physical_user_visible=false ... (kernel), [nvme-bringup-smoke] controller-enable ok ... cc_en_after=1 csts_rdy_after=1 ... brokered_enable_trigger=manager-op ..., [nvme-bringup-smoke] teardown-reset ok ... quiesced=true, [nvme-bringup-smoke] admin-queue-freed ... page_scrubbed_before_frame_free=true, and [nvme-bringup-smoke] stale-enable-refused ... brokered_enable_trigger=manager-op reason=nvme-admin-queues-not-armed (userspace).
  • Not in scope (this path): I/O queues, read/write commands, cloud evidence, and host-physical/IOVA export are out of scope for the enable path. hostile_hardware_isolation=not-claimed; the brokered no-IOMMU enable is not hostile-hardware isolation. One brokered IDENTIFY admin command is in §7.

7. Brokered admin command + IDENTIFY (no-IOMMU)

nvme-admin-queue-identify extends the brokered no-IOMMU lane to one admin command. After the §6 enable, the provider submits a single IDENTIFY (controller) admin command and consumes its completion from its own mapped admin CQ. As on the enable path, the manager authors every address-bearing field; the provider supplies only the non-addressing command dwords and the doorbell index. nvme-admin-interrupt-delivery then makes that completion interrupt-driven: the provider unmasks the admin completion interrupt route and blocks on Interrupt.wait, the kernel wakes the live waiter through the device-interrupt dispatch path, and only then is the completion consumed from the mapped CQ.

  • Admin command (wire subset): a 64-byte submission queue entry (NVMe Base Spec §4.2). The provider writes opcode 0x06 (IDENTIFY, §5.17) at byte 0, a command id at bytes 2:3, NSID=0, and CNS=0x01 (Identify Controller) in CDW10 at bytes 40:43 into the mapped admin SQ page. It leaves the address-bearing MPTR (bytes 16:23), PRP1 (bytes 24:31), and PRP2 (bytes 32:39) zero; the manager overwrites them. The IDENTIFY data structure is 4096 bytes, so a single page-aligned PRP1 covers it and PRP2 stays zero.
  • Doorbells: the same NvmeControllerRegister claim covers the admin SQ tail doorbell (0x1000) and CQ head doorbell (0x1004) – admin queue 0 with doorbell stride CAP.DSTRD=0 (NVMe Base Spec §3.1.24/§3.1.25; nvme_brokered_admin_sq_doorbell re-reads CAP.DSTRD and fails closed on a non-zero stride). The doorbell value (the tail/head index) is not address-bearing; the manager bounds it to <= NVME_ADMIN_QUEUE_DEPTH and performs the write.
  • Manager-authored PRP on the submit path: a write to the SQ tail doorbell is routed by validate_devicemmio_write32_claim (NvmeBrokeredWriteOp::AdminSqTailDoorbell) to nvme_brokered_admin_sq_doorbell (kernel/src/device_manager/mod.rs). It resolves the live admin SQ (slot 0), CQ (slot 1), and IDENTIFY data (slot 2, NVME_ADMIN_DATA_POOL_SLOT) ledger pages, authors the SQE’s MPTR/PRP1/PRP2 from the data page physical address through the SQ page’s HHDM mapping (nvme_author_admin_sqe_prp), fences, then rings the doorbell via nvme_authored_register_write. No provider-supplied address reaches the controller.
  • Validator on the SQ-tail path: before authoring the SQE the data-buffer PRP1 is passed through the Model B on-notify DMA validator (validate_doorbell_scan, ScanKind::SqTailDoorbell, ScanItem::Prp { list_depth: 0 }): page alignment and in-window containment of the 4 KiB data region against the data page’s own device-visible window. A reject returns devicemmio-nvme-admin-submit-validator-reject with no doorbell write. As on the queue-arm path the window and scanned item are both manager-derived, so the hostile owner/host-physical/stale rejections are exercised by the cfg(qemu) self-test (§3); the live authority gate is live-ledger membership.
  • Interrupt-driven completion wake (nvme-admin-interrupt-delivery): after ringing the SQ tail doorbell, the provider unmasks the admin completion interrupt route and blocks on Interrupt.wait. The route is the NVMe controller’s bootstrap Interrupt grant (DeviceOwner::ManagerGrantSource, MSI-X table entry 0, role GrantSource); the kernel wakes the live waiter with a real LAPIC dispatch routed through the device-interrupt dispatch slot plus the deferred-EOI and waiter path – the same grant-source delivery model proven by make run-interrupt-grant and used by the virtio-net provider. The wait returns result=interrupt-delivered real_interrupt_delivery=delivered wake_blocked=false with the route’s dispatch delivery_count incremented. This is a kernel-injected dispatch at the route’s programmed LAPIC vector, not a device-autonomous MSI-X raise: the NVMe MSI-X table is not yet hardware- programmed for an external write (msix_table_programming=not-written, as on the DDF interrupt-grant path), so device-raised MSI-X delivery and MSI-X table programming remain a documented next increment.
  • Completion consume: only after the interrupt-driven wake does the provider read completion queue entry 0 in its mapped admin CQ page (NVMe Base Spec §4.6): a 16-byte entry whose DW3 (bytes 12:15) carries the command id (bits 15:0), the phase tag (bit 16), and the status field (bits 31:17). It checks the status field is success and the command id matches, confirms the controller DMA-wrote the data structure (non-zero PCI Vendor ID at IDENTIFY byte 0; QEMU’s nvme reports 0x1b36), then advances the CQ head doorbell. The completion is thus consumed after an interrupt-driven wake, with the mapped-CQ read as the consume step.
  • Stale/post-reset no-wake: after the IDENTIFY completes, a second live Interrupt.wait waiter is installed on the (driver-unmasked) route, observed to stay pending, and the route is then masked; the live waiter completes result=interrupt-waiter-cancelled reason=route-masked rather than woken (masked_live_waiter_woke=false). At the kernel layer the stale/unvalidated/retired-generation completion no-wake invariant (completion_wakes_waiter) remains proven by the cfg(qemu) self-test (§3, waiter_wake=none).
  • Fail-closed before submit: a SQ tail doorbell with the admin SQ/CQ or data page unallocated, freed, or in-flight returns nvme-admin-command-not-armed (devicemmio-nvme-admin-submit-not-armed) with no MMIO side effect; an out-of-range doorbell index returns devicemmio-nvme-admin-doorbell-out-of-range. Teardown frees the data page first and proves the post-free re-submit fails closed.
  • Proof lines (asserted by tools/qemu-pci-nvme-smoke.sh): nvme: admin-submit owner=nvme-storage admin_sq_slot=0 admin_data_slot=2 validator=sq-tail-doorbell scanned_items=1 command=identify-controller prp1_authored=true ... doorbell_written=performed host_physical_user_visible=false (kernel), [nvme-bringup-smoke] admin-interrupt-route-unmasked ... route_state_after=driver-unmasked, [nvme-bringup-smoke] admin-interrupt-wake result=interrupt-delivered real_interrupt_delivery=delivered wake_blocked=false ... interrupt_driven_wake=delivered (userspace, the interrupt-driven wake), [nvme-bringup-smoke] identify-complete ok command=identify-controller cid=0x0042 status=0x0000 phase=1 ... identify_vid=0x... completion_consumed= mapped-admin-cq-after-interrupt-wake ... (userspace), nvme: admin-complete-ack ... cq_head=1 ... address_bearing=false, [nvme-bringup-smoke] identify-cq-head-advanced ..., [nvme-bringup-smoke] admin-interrupt-stale-no-wake ... result=interrupt-waiter-cancelled ... masked_live_waiter_woke=false, and [nvme-bringup-smoke] stale-submit-refused ... reason=nvme-admin-command-not-armed (userspace).
  • Not in scope (this path): I/O queue pairs, read/write commands, and the remaining out-of-scope items below are covered in §8.

8. Brokered I/O queue pair + bounded READ (no-IOMMU)

nvme-io-queue-and-read extends the brokered no-IOMMU lane to one I/O queue pair and one bounded read – the last piece of the userspace NVMe storage-provider foundation. After the §7 IDENTIFY, the provider creates one I/O queue pair (queue id 1) through admin commands, then issues one READ on it. As on every brokered path the manager authors each command’s address-bearing PRP1 from a live ledger page; the provider supplies only the non-addressing dwords and the doorbell index.

  • I/O queue entry sizes: the re-enable (§6) must program CC.IOSQES (bits 19:16, log2 of the 64 B SQ entry = 6) and CC.IOCQES (bits 23:20, log2 of the 16 B CQ entry = 4) before any I/O queue is created (NVMe Base Spec §3.1.5); a CC.EN 1->0 reset clears all of CC, so the provider sets them explicitly (resulting CC = 0x00460001). Creating an I/O queue with IOCQES/IOSQES unset is refused by the controller (QEMU returns command-specific Invalid Queue Size).
  • Create I/O queue commands (wire subset): CREATE I/O COMPLETION QUEUE (opcode 0x05, NVMe Base Spec §5.3) and CREATE I/O SUBMISSION QUEUE (opcode 0x01, §5.4) are admin commands submitted on the admin SQ. CDW10 carries the zero-based queue size (bits 31:16) and queue id (bits 15:0); the create-CQ CDW11 sets PC=1 with IEN=0 (no I/O interrupt; the completion is polled), the create-SQ CDW11 sets the completion queue id (bits 31:16) and PC=1. PRP1 (the queue base) is left zero by the provider and authored by the manager from the I/O CQ (slot 3) / I/O SQ (slot 4) ledger page.
  • Opcode-directed PRP authoring: a write to the admin SQ tail doorbell is routed to nvme_brokered_admin_sq_doorbell (kernel/src/device_manager/mod.rs), which reads the opcode the provider wrote into the just-published SQE (at SQ index tail - 1) and maps it to the live ledger page whose device-visible address is that command’s PRP1: IDENTIFY -> IDENTIFY data (slot 2), CREATE I/O CQ -> I/O CQ (slot 3), CREATE I/O SQ -> I/O SQ (slot 4). An unrecognized opcode fails closed (devicemmio-nvme-admin-submit-unknown-opcode). The opcode is the only provider-supplied input consulted and is non-addressing.
  • READ command (wire subset): opcode 0x02 (NVM Command Set §3.x), NSID 1, starting LBA 0 in CDW10/11, NLB 0 (zero-based, one block) in CDW12. PRP1 (the data buffer) is authored by the manager from the read data page (slot 5).
  • I/O doorbells: the same NvmeControllerRegister claim covers the I/O SQ tail doorbell (0x1008) and I/O CQ head doorbell (0x100c) – queue id 1 with doorbell stride CAP.DSTRD=0 (SQ y tail at 0x1000 + (2y)*4, CQ y head at 0x1000 + (2y+1)*4). The I/O SQ tail doorbell is routed to nvme_brokered_io_sq_doorbell, which requires the opcode to be READ, materializes the READ PRP1 from the live read data page, validates it through the Model B on-notify validator (ScanKind::SqTailDoorbell), authors the SQE PRP, and rings the doorbell. The I/O CQ head doorbell (nvme_brokered_io_cq_head_doorbell) carries only the consumed-entry head index (no address-bearing field).
  • Completion consume: the create-queue and READ completions are consumed by polling the mapped CQ phase tags. The single kernel grant-source injected interrupt delivery is spent on the §7 admin IDENTIFY wake (delivery_count_before == 0 gates injection to one delivery per route), so an interrupt-driven I/O completion wake awaits the device-autonomous MSI-X table-programming increment that §7 already defers. The provider confirms the controller DMA-transferred real data by checking the harness-seeded LBA 0 signature (0x4f504143 = “CAPO”) in its mapped read data page – proving the read moved bytes through the brokered PRP, not a zero page.
  • Fail-closed before submit: an I/O SQ tail doorbell with the I/O CQ/SQ or read data page unallocated, freed, or in-flight returns nvme-io-command-not-armed (devicemmio-nvme-io-submit-not-armed) with no MMIO side effect; an out-of-range index returns devicemmio-nvme-io-doorbell-out-of-range; an opcode other than READ returns devicemmio-nvme-io-submit-unknown-opcode. Teardown frees the read data page first and proves the post-free I/O re-submit fails closed.
  • Proof lines (asserted by tools/qemu-pci-nvme-smoke.sh): nvme: admin-submit ... command=create-io-cq ... admin_data_slot=3 sq_tail=2, [nvme-bringup-smoke] create-io-cq-complete ok ..., nvme: admin-submit ... command=create-io-sq ... admin_data_slot=4 sq_tail=3, [nvme-bringup-smoke] create-io-sq-complete ok ..., nvme: io-submit owner=nvme-storage io_queue_id=1 io_sq_slot=4 io_read_data_slot=5 ... command=read ... io_sq_tail=1 doorbell_offset=0x1008 doorbell_written=performed host_physical_user_visible=false (kernel), [nvme-bringup-smoke] io-read-complete ok command=read cid=0x0053 status=0x0000 ... io_read_dword0=0x4f504143 ... completion_consumed=mapped-io-cq-polled (userspace, the read data proof), nvme: io-complete-ack ... io_cq_slot=3 cq_head=1 ... address_bearing=false, and [nvme-bringup-smoke] io-stale-submit-refused ... reason=nvme-io-command-not-armed (teardown).
  • Not in scope: device-autonomous MSI-X delivery (hardware MSI-X table programming, a device-raised I/O completion interrupt, and an interrupt-driven I/O completion wake), multi-block / write / scatter-gather (PRP-list) I/O, cloud (GCP/AWS/Azure) enumeration or evidence, and host-physical/IOVA export remain out of scope. hostile_hardware_isolation=not-claimed.

9. Production-path cloudboot proofs (non-qemu cloud kernel)

This section covers the non-qemu cloudboot kernel proofs. The older cloud-prod-storage-bound-local-proof storage-bind path predates the later production-stub NVMe manager operations: it binds DeviceMmio/DMAPool/Interrupt surfaces to one NVMe function and exercises an interrupt-dispatch proxy, but does not attempt controller enable, admin commands, I/O queues, IDENTIFY, READ, or a userspace storage provider.

  • Older storage-bind proxy: cap::storage_bind_proof::report (kernel/src/cap/storage_bind_proof.rs) runs under #[cfg(not(feature = "qemu"))] during kernel::run_init (kernel/src/main.rs). It selects an NVMe function with PciDevice::is_nvme_controller (kernel/src/pci.rs), stages a readback DeviceMmio record through device_manager::stage_bar_readback_region, a parked bounce DMAPool/DMABuffer through device_manager::stage_bounce_buffer_dmapool_record and device_manager::issue_manager_attached_dmabuffer_handle_with_request, and one MSI-X interrupt route through device_interrupt plus mask-first PCI MSI-X table programming. The I/O-completion evidence is a kernel-side proxy: device_interrupt::handle_lapic_delivery advances the live dispatch slot, deferred EOI is acknowledged, masked no-wake is checked, and teardown proves stale route/pool/buffer/MMIO handles fail closed. Its marker is cloudboot-evidence: storage-bound <token>, with summary fields such as nvme_admin_identify=not-attempted, nvme_read_command=not-attempted, and waiter_wake=kernel-side-proxy.
  • Later production-stub manager ops: the non-qemu cloudboot kernel also implements real production-stub NVMe operations for the same local QEMU/cloudboot lane. The read-only bind, reset-only CC.EN=0 selected-write claim, parked admin SQ/CQ/data DMABuffer materialization, brokered controller-enable manager operation (DeviceMmio.brokeredNvmeControllerEnable @6), and brokered admin IDENTIFY Controller manager operation (DeviceMmio.brokeredNvmeAdminIdentify @7) live in kernel/src/device_manager/stub.rs and the production grant-source modules. These operations are not the storage-bind proxy: the manager authors AQA/ASQ/ACQ, CC.EN=1, the fixed IDENTIFY SQE, PRP1, SQ tail doorbell, CQ polling, and CQ head doorbell from its parked ledger. The provider still supplies no host-physical address, IOVA, queue base, PRP/SGL, opcode, command id, doorbell offset, or doorbell value.
  • Local production provider chain: the moved cloud-prod-nvme-brokered-userspace-provider-local-proof parent is closed by production-stub child records over the non-qemu cloudboot kernel. The local QEMU/cloudboot chain reaches split admin completion (@8 / @9 plus Interrupt.wait / Interrupt.acknowledge), I/O queue creation (@10 / @11), bounded READ/WRITE (@12-@15), second-LBA/multiblock I/O (@16-@19), synchronous read/write and read-bytes (@20-@22), BlockDevice.readBlocks / writeBlocks / FLUSH, higher-level filesystem and Store consumers, a dedicated data-path completion Interrupt route, and multi-PRP BlockDevice windows. These are manager-authored brokered operations in kernel/src/device_manager/stub.rs and the production proof modules, not provider-authored Model B doorbell writes.
  • READ-arm graduation to always-built production (cloud-prod-nvme-storage-graduate-readarm-local-proof): the NVMe BlockDevice READ arm is the first capstone piece graduated OUT of the per-proof cloud_nvme_*_proof features into always-built production code. The BlockDeviceBackend::NvmeBrokered arm and its arbitrary-window readBlocks @0 body (kernel/src/cap/block_device.rs, cfg(not(qemu))), the shared read body nvme_brokered_io_sync_command / nvme_brokered_io_sync_read_window_op_for_cap and the brokered controller bring-up registers/helpers it reaches (kernel/src/device_manager/stub.rs), and live_handle_for_nvme_blockdevice now compile in the default no-proof cargo build / make capos-cloudboot-image kernel – the GCE-validated production composition. ACTIVATION is fronted by a fail-closed runtime capability probe kernel/src/nvme_storage_backend.rs (dma_backend.rs-style atomic verdict + select_nvme_blockdevice_handle() resolver): the cap is minted only when a staged device_mmio grant resolves a live brokered-controller handle (recording the verdict), else a typed error is returned – never a panic. The no-NVMe default boot leaves the probe unverified, so the block_device grant fails closed. writeBlocks / flush stay fail-closed on the graduated arm (named follow-up graduations). The graduated data plane is bounded, not a general-purpose driver: every command runs through the synchronous single-call seam (kernel/src/device_manager/nvme_sync_io_state.rs), which admits at most 64 single-call I/O commands per boot (MAX_SYNC_OPS) and permanently rejects further commands at the first I/O CQ wrap (no CQ phase-toggle handling) – both limits fail closed. Namespace geometry is IDENTIFY-derived, not assumed: after the fixed three-command bring-up sequence completes, the first geometry consultation issues one manager-authored IDENTIFY Namespace (CNS 0x00, NSID 1, admin SQ index 3 / tail 4, NVMe Base Spec §5.17) through nvme_namespace_geometry_for_cap (kernel/src/device_manager/stub.rs), parses NSZE plus the active LBA format (FLBAS + LBAF LBADS/MS), caches the verdict for the boot, and emits nvme: brokered-identify-namespace ... nsze=... flbas=... lbads=... supported=.... BlockDevice.info @2 and the readonly_fs/persistent_store/writable_fs NVMe BlockSource::info report this IDENTIFY-derived geometry, and every read/write window bound is enforced against it; while the claim is unavailable (bring-up incomplete, a failed claim reset the controller, or an unsupported format – anything other than 512 B data blocks with no interleaved metadata) those paths fail closed instead of falling back to a fixture constant. Proof: make run-cloud-provider-nvme-blockdevice-read-graduated emits cloudboot-evidence: provider-nvme-blockdevice-read-graduated <token> (read_arm=always-built data_plane_feature_gated=false probe_verdict=verified nvme_read_roundtrip_match=true). This is a local QEMU/cloudboot proof; it does NOT claim a live cloud NVMe run, direct DMA, IOVA export, or a write/durability graduation.
  • Production boundary: one production-stub NVMe path now has live GCE Persistent Disk evidence: provider-nvme-io-read completed one brokered 512-byte READ on run 1780806087-bf69. Other production-stub NVMe proofs remain local QEMU/cloudboot evidence unless their task record explicitly says otherwise. The current evidence still does not claim direct DMA, cloud/guest IOMMU support, provider-visible device addresses, device-autonomous MSI-X delivery, AWS/Azure storage, a reusable storage provider, or full filesystem integration. The NVMe BlockDevice READ data plane is graduated to always-built production (above); other write/FLUSH/filesystem consumers and broader windows have local proof coverage but remain bounded by their recorded production proof gates unless their task record explicitly says the surface was graduated.

AWS Nitro EBS (NVMe storage)

This is a provenance map for the AWS Nitro EBS storage shape: how an AWS Nitro instance presents its EBS volumes to the guest, why that surface is the same standard NVMe device the shared NVMe storage-provider foundation already drives, and the small AWS delta capOS adds on top of it. It is not a re-spec; the NVMe register/queue/PRP wire subset capOS actually touches is documented once in NVMe and not repeated here.

Maturity caveat. This page documents a local QEMU cloud-shape classification, not a bound driver running on real AWS hardware. The NVMe bind/identify/read lifecycle is proven locally on make run-pci-nvme against QEMU’s -device nvme; the AWS delta is the AWS-context classification proof line and the Nitro DMA-backend policy note on top of that shared NVMe foundation. End-to-end AWS EBS enumeration, live namespace I/O, and cloud evidence capture are future work (tracked as cloud-aws-storage-live-proof), blocked until AWS access is provisioned. The ENA NIC is a distinct driver-binding claim (cloud-aws-ena-nic-live-proof) and is out of scope here.

1. Spec basis

  • Device: AWS Nitro EBS controller. All AWS Nitro-based instance families (effectively all current generations) expose attached EBS volumes as NVMe namespaces behind a standard NVMe PCI controller – there is no AWS-specific storage transport and no virtio-scsi alternative (unlike GCP, whose first-/second-generation families use virtio-scsi). PCI class 0x01 (mass storage), subclass 0x08 (NVM), programming interface 0x02 (NVM Express) – the same class triple QEMU emulates with -device nvme and the kernel detects with PciDevice::is_nvme_controller (kernel/src/pci.rs).
  • Production PCI identity: the Nitro EBS controller carries Amazon’s PCI vendor id 0x1d0f (device id 0x8061 for the EBS NVMe controller), distinct from QEMU’s 0x1b36. capOS therefore classifies on the device class surface and the brokered no-IOMMU bounce DMA shape, not on a vendor-id match (see §3); the live vendor-id confirmation belongs to the deferred cloud-aws-storage-live-proof.
  • Authoritative spec: the NVM Express Base Specification (NVMe 1.4 / 2.0) is the wire contract; AWS publishes no separate EBS register spec because the device is a standard NVMe controller. AWS documents the namespace exposure in the “Amazon EBS and NVMe on Linux instances” guide (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html); the in-guest reference driver is the upstream Linux drivers/nvme/host/.
  • Wire-format subset capOS implements: identical to the standard NVMe subset documented in NVMe §1-§2 (controller registers CAP/CC/ AQA/ASQ/ACQ/CSTS, the admin and one I/O submission/completion queue pair, per-queue doorbells, and PRP1/PRP2 data pointers). Nitro EBS adds no fields beyond that subset, so this page does not re-list them.

2. Wire format (relevant subset)

See NVMe §2 and §6-§8. There is no AWS-specific wire format to document: the brokered controller enable (manager-authored AQA/ASQ/ACQ), the admin IDENTIFY, the one I/O queue pair, and the bounded READ all use the standard NVMe encoding the shared foundation already implements and proves.

3. capOS mapping

The AWS delta is a cloud-shape classification plus a DMA-backend policy consumption detail layered onto the shared NVMe storage-provider foundation; it adds no new driver code.

  • Cloud-shape classification proof: after the first enumerated NVMe controller is bound (bind_qemu_nvme_controller), the enumeration path emits a nvme: cloud shape classification cloud_shape=aws-nitro-ebs ... proof line (kernel/src/pci.rs report_cloud_nvme_shape) classifying the bound controller against the documented AWS Nitro EBS device surface. It prints the enumerated pci_vendor/pci_device_id and class/subclass/prog_if, records the production aws_nitro_ebs_vendor=0x1d0f identity as documentation (not as a claimed match), and carries explicit scope flags (local_qemu_precursor=true, real_aws_enumeration=not-claimed, ena=separate-nic-driver-out-of-scope). make run-pci-nvme asserts this line conjunctively with the bounce-buffer dma: backend selection line (tools/qemu-pci-nvme-smoke.sh assert_nvme_cloud_shape), tying the bound device surface to the DMA backend resolved that boot.
  • Nitro IOMMU-availability DMA-backend policy: AWS Nitro does not guarantee guest VT-d remapping the way QEMU’s emulated IOMMU does, so the DMA backend the live AWS path consumes is selected by cloud-dma-backend-selection (kernel/src/dma_backend.rs select_and_report): direct-remapping where a usable+safe IOMMU is positively probe-verified, else the labeled bounce-buffer fallback. The classification line labels the expected backend (aws_labeled_dma_backend=bounce-buffer, dma_backend_policy=direct-remapping-if-verified-else-bounce-buffer); the resolved backend is proven separately by the dma: backend selection line, which on the no-IOMMU make run-pci-nvme gate is bounce-buffer.
  • Brokered DMA / no host-physical exposure: the binding lifecycle reuses the brokered no-IOMMU lane documented in NVMe §6-§8 – the manager authors every address-bearing register and PRP from the live DMA ledger, and host_physical_user_visible=false holds throughout. On a verified remapping lane the provider-written Model B path would apply instead; on the no-IOMMU gate the brokered bounce shape is the only consistent path (see docs/dma-isolation-design.md, “Provider-Written Addresses And No-IOMMU Brokered Bounce”).
  • DeviceMmio / Interrupt / DMAPool: unchanged from the shared foundation – the reset-only CC selected-write claim, the brokered admin and I/O doorbells, the interrupt-driven admin completion wake, and the DMAPool-allocated queue/data pages described in NVMe §4-§8.
  • QEMU-emulable vs hardware-only: the classification and the full bind/identify/read lifecycle are end-to-end QEMU-emulable (make run-pci-nvme). Live EBS enumeration over a real Nitro controller – vendor-id 0x1d0f confirmation, real namespace geometry, and live block I/O – is hardware-only and is the deferred cloud-aws-storage-live-proof.
  • NVMe – the shared NVMe controller wire subset and brokered no-IOMMU storage-provider foundation this shape binds onto.
  • virtio-net – the worked cloud-shape classification example (GCP virtio-net) this page mirrors for AWS storage.
  • docs/dma-isolation-design.md – the DMA-backend selection model and the no-IOMMU brokered bounce policy.
  • docs/backlog/hardware-boot-storage.md – the cloud device tracks, including the deferred live-AWS storage proof.

Azure managed disk (NVMe storage)

This is a provenance map for the Azure managed-disk storage shape: how an Azure VM presents its managed (and local) disks to the guest, why the modern surface is the same standard NVMe device the shared NVMe storage-provider foundation already drives, why the older-family SCSI path is not a usable alternative here, and the small Azure delta capOS adds on top of the shared foundation. It is not a re-spec; the NVMe register/queue/PRP wire subset capOS actually touches is documented once in NVMe and not repeated here.

Maturity caveat. This page documents a local QEMU cloud-shape classification, not a bound driver running on real Azure hardware. The NVMe bind/identify/read lifecycle is proven locally on make run-pci-nvme against QEMU’s -device nvme; the Azure delta is the Azure-context classification proof line and the Azure DMA-backend policy note on top of that shared NVMe foundation. End-to-end Azure managed-disk enumeration, live namespace I/O, and cloud evidence capture are future work (tracked as cloud-azure-storage-live-proof), to be done when Azure access is provisioned. The Azure MANA NIC is a distinct driver-binding claim (see Azure MANA) and is out of scope here.

1. Spec basis

  • Device: Azure managed-disk storage controller. Azure presents storage in two shapes depending on VM generation:
    • Azure Boost and newer NVMe-capable families expose managed disks (and local SSD) as NVMe namespaces behind a standard NVMe PCI controller – PCI class 0x01 (mass storage), subclass 0x08 (NVM), programming interface 0x02 (NVM Express). This is the same class triple QEMU emulates with -device nvme and the kernel detects with PciDevice::is_nvme_controller (kernel/src/pci.rs). This is the path this page documents.
    • Older VM families present managed disks over a Hyper-V SCSI controller (a virtio-scsi-shaped interface). capOS has no userspace virtio-scsi provider driver, and make run-virtio-blk proves the kernel-owned virtio-blk driver – a kernel-owned driver leaves the hidden kernel DMA ownership the userspace-provider acceptance forbids. The SCSI path is therefore out of scope for this driver (recorded on the classification line as azure_scsi_path=no-userspace-provider-driver-out-of-scope); supporting it would be a separate userspace virtio-scsi provider-driver foundation, not a re-use of the run-virtio-blk gate.
  • Production PCI identity: the Azure Boost NVMe controller carries Microsoft’s PCI vendor id 0x1414, distinct from QEMU’s 0x1b36. capOS therefore classifies on the device class surface and the brokered no-IOMMU bounce DMA shape, not on a vendor-id match (see §3); live vendor-id confirmation and real namespace geometry belong to the deferred cloud-azure-storage-live-proof.
  • Authoritative spec: the NVM Express Base Specification (NVMe 1.4 / 2.0) is the wire contract; Azure publishes no separate managed-disk register spec because the modern device is a standard NVMe controller. Azure documents the Boost NVMe interface and namespace exposure in the “Azure Boost” and “Enable NVMe” VM documentation (https://learn.microsoft.com/azure/virtual-machines/enable-nvme-interface); the in-guest reference driver is the upstream Linux drivers/nvme/host/.
  • Wire-format subset capOS implements: identical to the standard NVMe subset documented in NVMe §1-§2 (controller registers CAP/CC/ AQA/ASQ/ACQ/CSTS, the admin and one I/O submission/completion queue pair, per-queue doorbells, and PRP1/PRP2 data pointers). Azure Boost adds no fields beyond that subset, so this page does not re-list them.

2. Wire format (relevant subset)

See NVMe §2 and §6-§8. There is no Azure-specific wire format to document: the brokered controller enable (manager-authored AQA/ASQ/ACQ), the admin IDENTIFY, the one I/O queue pair, and the bounded READ all use the standard NVMe encoding the shared foundation already implements and proves.

3. capOS mapping

The Azure delta is a cloud-shape classification plus a DMA-backend policy consumption detail layered onto the shared NVMe storage-provider foundation; it adds no new driver code.

  • Cloud-shape classification proof: after the first enumerated NVMe controller is bound (bind_qemu_nvme_controller), the enumeration path emits a nvme: cloud shape classification cloud_shape=azure-managed-disk ... proof line (kernel/src/pci.rs report_cloud_nvme_shape_azure, alongside the AWS report_cloud_nvme_shape) classifying the same bound controller against the documented Azure managed-disk device surface. It prints the enumerated pci_vendor/pci_device_id and class/subclass/prog_if, records the production azure_nvme_vendor=0x1414 identity as documentation (not as a claimed match), records the out-of-scope SCSI path (azure_scsi_path=no-userspace-provider-driver-out-of-scope), and carries explicit scope flags (local_qemu_precursor=true, real_azure_enumeration=not-claimed, mana=separate-nic-driver-out-of-scope). make run-pci-nvme asserts this line (tools/qemu-pci-nvme-smoke.sh assert_nvme_cloud_shape_azure) in the same boot as the bounce-buffer dma: backend selection line asserted by assert_nvme_cloud_shape, tying the bound device surface to the DMA backend resolved that boot.
  • Azure IOMMU-availability DMA-backend policy: Azure does not guarantee a guest-visible VT-d/IOMMU the way QEMU’s emulated IOMMU does, so the DMA backend the live Azure path consumes is selected by cloud-dma-backend-selection (kernel/src/dma_backend.rs select_and_report): direct-remapping where a usable+safe IOMMU is positively probe-verified, else the labeled bounce-buffer fallback. The classification line labels the expected backend (azure_labeled_dma_backend=bounce-buffer, dma_backend_policy=direct-remapping-if-verified-else-bounce-buffer); the resolved backend is proven separately by the dma: backend selection line, which on the no-IOMMU make run-pci-nvme gate is bounce-buffer.
  • Brokered DMA / no host-physical exposure: the binding lifecycle reuses the brokered no-IOMMU lane documented in NVMe §6-§8 – the manager authors every address-bearing register and PRP from the live DMA ledger, and host_physical_user_visible=false holds throughout. On a verified remapping lane the provider-written Model B path would apply instead; on the no-IOMMU gate the brokered bounce shape is the only consistent path (see docs/dma-isolation-design.md, “Provider-Written Addresses And No-IOMMU Brokered Bounce”).
  • DeviceMmio / Interrupt / DMAPool: unchanged from the shared foundation – the reset-only CC selected-write claim, the brokered admin and I/O doorbells, the interrupt-driven admin completion wake, and the DMAPool-allocated queue/data pages described in NVMe §4-§8.
  • QEMU-emulable vs hardware-only: the classification and the full bind/identify/read lifecycle are end-to-end QEMU-emulable (make run-pci-nvme). Live managed-disk enumeration over a real Azure Boost controller – vendor-id 0x1414 confirmation, real namespace geometry, and live block I/O – is hardware-only and is the deferred cloud-azure-storage-live-proof.
  • NVMe – the shared NVMe controller wire subset and brokered no-IOMMU storage-provider foundation this shape binds onto.
  • AWS Nitro EBS (NVMe storage) – the sibling cloud NVMe storage shape; same shared foundation, different cloud provenance. AWS is NVMe-only with no SCSI alternative, whereas Azure’s older families use SCSI.
  • virtio-net – the worked cloud-shape classification example (GCP virtio-net) the storage classifications mirror.
  • Azure MANA – the distinct Azure NIC driver-binding claim, out of scope for this storage surface.
  • docs/dma-isolation-design.md – the DMA-backend selection model and the no-IOMMU brokered bounce policy.
  • docs/backlog/hardware-boot-storage.md – the cloud device tracks, including the deferred live-Azure storage proof.

GCP Persistent Disk (storage)

This is a provenance map for the GCP Persistent Disk (PD) storage shape: how a GCE instance presents its persistent disks to the guest, why most current families expose them as standard NVMe namespaces the shared NVMe foundation already drives, and the small GCP delta capOS adds on top. It is not a re-spec; the NVMe register/queue/PRP wire subset capOS actually touches is documented once in NVMe and not repeated here.

Maturity caveat. This page documents one bounded live-GCE NVMe Persistent Disk proof on a c3-standard-4 VM, plus the local QEMU/cloudboot proofs that preceded it. The live proof is a single brokered NVMe READ through provider authority; it is not a general reusable storage provider, filesystem integration, virtio-scsi path, Local SSD path, direct-DMA claim, or device-autonomous MSI-X claim. The older cloud-prod-storage-bound-local-proof composes production grant surfaces over a discovered NVMe function and emits cloudboot-evidence: storage-bound on a local boot of the make capos-cloudboot-image disk under QEMU. The later cloud-prod-nvme-brokered-userspace-provider-local-proof child chain drives the same local QEMU -device nvme surface through brokered controller bring-up, admin IDENTIFY, I/O queue creation, BlockDevice read/write/flush, a dedicated data-completion Interrupt route, and multi-PRP windows while preserving manager-authored queue-base/PRP materialization. The live GCE closeout is the cloud-gcp-storage-driver run described in §6.

1. Spec basis

  • Device: GCE Persistent Disk. GCE exposes attached PD volumes as a block device on the guest PCI surface. The legacy first-/second-generation families use virtio-scsi; current generations (Tau T2A, third-generation-or-later N2/N2D/C3, Confidential VM paths) expose them as NVMe namespaces behind a standard NVMe PCI controller – PCI class 0x01 (mass storage), subclass 0x08 (NVM), programming interface 0x02 (NVM Express) – the same class triple QEMU emulates with -device nvme and the kernel detects with PciDevice::is_nvme_controller (kernel/src/pci.rs).
  • Production PCI identity: the GCE NVMe PD controller carries Google’s PCI vendor id (current generation 0x1ae0, distinct from QEMU’s 0x1b36). capOS therefore classifies on the device class surface and the brokered no-IOMMU bounce DMA shape, not on a QEMU vendor-id match (see §3). The live cloud-gcp-storage-driver run confirmed the GCE NVMe PD identity as vendor.1ae0 / dev.001f on BDF 0000:00:05.0.
  • Authoritative spec: the NVM Express Base Specification (NVMe 1.4 / 2.0) is the wire contract; Google publishes no separate PD register spec because the device is a standard NVMe controller on the NVMe-family GCE shapes. Google documents PD device exposure under the “Persistent Disk overview” and “Local SSD” pages (https://cloud.google.com/compute/docs/disks).
  • virtio-scsi alternative: older GCE families use virtio-scsi for PD rather than NVMe. capOS has no userspace virtio-scsi provider driver and the in-tree make run-virtio-blk proves the kernel-owned virtio-blk driver, which would leave the hidden kernel DMA ownership the userspace-provider acceptance forbids. So the older-family virtio-scsi path is recorded out of scope here (gcp_scsi_path=no-userspace-provider-driver-out-of-scope), the same shape as docs/devices/azure-disk.md records for the Hyper-V/virtio-scsi older-family path.

2. Wire format (shared with docs/devices/nvme.md)

GCE NVMe PD is standard NVMe: the controller registers, admin SQ/CQ descriptors, IDENTIFY data, I/O SQ/CQ descriptors, PRP entries, and the on-notify validator scan targets are exactly the ones documented in NVMe §2. No GCP-specific subset is reproduced here. The shared NVMe storage-provider foundation (nvme-bind-claimed-mmio-read, nvme-controller-reset-selected-write, nvme-no-iommu-brokered-controller-enable, nvme-admin-queue-identify, nvme-admin-interrupt-delivery, nvme-io-queue-and-read) is the same wire model the local production cloudboot chain ports into kernel/src/device_manager/stub.rs and the production grant-source modules. The cloud-gcp-storage-driver closeout validated that provider/storage binding against the live GCE PD controller identity and evidence surface for one bounded NVMe READ.

3. capOS mapping

  • Cloud-shape classification: kernel/src/pci.rs report_cloud_nvme_shape (the GCP path) classifies the bound controller against the GCE NVMe surface and emits the nvme: cloud shape classification cloud_shape=gcp-persistent-disk ... proof line on make run-pci-nvme, conjunctively with the bounce-buffer dma: backend selection line.
  • DMA backend: GCE IOMMU-availability is the direct-remapping-if-verified-else-bounce-buffer policy from cloud-dma-backend-selection and the “Cloud DMA Backend” section of docs/dma-isolation-design.md. The 2026-05-24 GCE live probes recorded n1-standard-1, e2-small, c3-standard-4, and n2d-standard-2 Confidential shapes as IOMMU disabled → SWIOTLB → labeled bounce-buffer in Cloud DMA Provider Evidence Inventory, so the cloud-shape proof line and the production storage-bind proof both run conjunctively with the bounce-buffer DMA backend.
  • No host-physical / IOVA export: iova_export=disabled-future-only, host_physical_user_visible=0, direct_dma=blocked, real_dma=not-attempted — the same brokered-bounce shape NVMe records in §6–§8 of nvme.md and the production storage-bind proof records in §9.

4. Production storage-bind proof (local QEMU; non-qemu kernel)

cloud-prod-storage-bound-local-proof (the prerequisite of the billable cloud-gcp-storage-driver slice) lands the production-path NVMe storage-bind proof on the non-qemu cloud kernel. The implementation, composition, MSI-X table program, I/O-completion handoff (kernel-side proxy), masked-no-wake, teardown / stale-handle assertions, headline cloudboot evidence shape, why the proof is settled with a kernel-side proxy, and asserted proof lines are documented once in nvme.md §9 and not reproduced here. The marker is parsed by tools/cloudboot/run-test.sh as STORAGE_BOUND_MARKER into provider.json.storage_bind_proof.

The local QEMU boot of target/disk.raw (make capos-cloudboot-image, -device nvme) demonstrates the bound on QEMU’s NVMe class triple; it does not exercise a live GCE PD NVMe vendor id.

5. Local production brokered NVMe provider chain

The moved parent cloud-prod-nvme-brokered-userspace-provider-local-proof closes the local production provider prerequisite through its child records. The implemented path is the same brokered no-IOMMU shape as nvme.md: the manager authors AQA/ASQ/ACQ, queue-base pages, PRP1 entries, PRP lists, doorbells, and completion consumption from live DMAPool ledger records. The provider sees capability results and returned data bytes, not host-physical addresses, IOVAs, queue-base values, or provider-authored PRP/SGL fields.

The local evidence covers:

  • brokered controller enable and admin IDENTIFY;
  • I/O queue creation, bounded READ/WRITE, second-LBA and multiblock I/O;
  • BlockDevice.readBlocks, writeBlocks, and FLUSH-backed higher-level consumers over the NvmeBrokered backend;
  • dedicated data-path Interrupt.wait / Interrupt.acknowledge completion proof;
  • multi-PRP windows larger than one PRP1 page, with PRP list entries written by the manager.

This remains the local QEMU/cloudboot foundation under the same brokered authority model. The billable real-GCE Persistent Disk bind run is the bounded NVMe evidence in §6.

6. Live GCE NVMe Persistent Disk proof

cloud-gcp-storage-driver closed with live GCE run 1780806087-bf69, launched by make cloudboot-gcp-storage-nvme-io-read-test at source commit 28518165518c29a48633682f4a6d9b5844c43335. The run used a c3-standard-4 instance in europe-west3-a with storage_interface=nvme. The harness launched with GVNIC guest feature / NIC type because C3 requires that launch posture; this storage page does not claim a gVNIC driver or NIC datapath proof.

The evidence identified the GCE PD NVMe controller as class 01.08.02, vendor.1ae0, device.001f, BDF 0000:00:05.0, with selected_dma_backend=bounce_buffer and enumeration_source=legacy-io. The manager drove the shared brokered NVMe chain: admin IDENTIFY, I/O CQ/SQ creation, and one I/O READ against NSID 1, SLBA 0, NLB 1 / 512 bytes. The serial marker recorded live_cloud=gce-persistent-disk, io_read=completed, io_sq_doorbell=performed, io_cq_completion=polled-io-cq, prp_source=manager-ledger, host_physical_user_visible=0, and iova_export=disabled-future-only. The read digest prefix was eb3c904c494d494e4520200002000000.

The capOS authority mapping is the same one recorded in nvme.md: DeviceMmio gates BAR register and doorbell effects, DMAPool owns queue/data pages and manager-authored PRP materialization, and Interrupt is present as the bounded provider authority surface. The live read proof polls the I/O CQ; it does not claim device-autonomous MSI-X delivery. The cloud harness evidence also recorded no public IP, no service account, and teardown_status=complete.

7. Not in scope

  • The older-family virtio-scsi PD path (gcp_scsi_path=no-userspace-provider-driver-out-of-scope).
  • The Local SSD storage path (separate device surface, deferred).
  • Multi-namespace, FUA, DSM, reusable BlockDevice/filesystem integration on live GCE, or live-provider device-autonomous completion delivery (deferred per nvme.md).
  • Direct DMA, IOVA export, IOMMU/remapping programming (the direct-remapping-if-verified branch of the DMA-backend policy applies once a GCE shape with a verified vIOMMU is added; no current probed GCE shape satisfies that branch).
  • AWS EBS, Azure managed disk, and GCP NIC readiness.

ATAPI CD-ROM + ISO 9660 (boot-time reader)

This is a provenance map for the boot-time CD-ROM read path: it cites the specs, summarizes only the wire-format subset the code actually implements, and points into the implementation. It is not a re-spec. Unlike the PCI/virtio device pages, this is a legacy port-I/O hardware transport used only during boot or install-source proofs to read ELF/package bytes from an ISO; the capOS-mapping section reflects its boot-only, kernel-owned status. The boot source itself is planned CD-ROM/ISO support, not a deprecated path. The driver is concise and feature-gated (boot_iso_read / boot_iso), so the treatment is a short map rather than exhaustive register tables.

The whole reader lives in kernel/src/iso/mod.rs.

1. Spec basis

  • Device: ATAPI CD-ROM on a legacy IDE (Parallel ATA) channel, accessed by polled PIO over the legacy I/O ports. Not a PCI/virtio device and not enumerated through PCI; the two legacy channels are probed at fixed port bases (PRIMARY_CMD/PRIMARY_CTRL 0x1F0/0x3F6, SECONDARY_CMD/SECONDARY_CTRL 0x170/0x376). QEMU’s -cdrom shorthand attaches the disc on the secondary channel (master), which AtapiDevice::probe scans first.
  • Authoritative specs:
    • ATA Packet Interface — the PACKET command transport and the ATA/ATAPI register protocol, as standardized in the SFF-8020i / ATA/ATAPI-4+ family (INCITS T13). The PACKET data-in handshake, the ATAPI signature in the cylinder-low/high (LBA mid/high) registers, and the READ(12) / READ CAPACITY(10) command-descriptor blocks come from this basis. The signature the driver matches is ATAPI_SIG_MID (0x14) in the LBA-mid register and ATAPI_SIG_HIGH (0xEB) in the LBA-high register.
    • ECMA-119 (equivalently ISO 9660), Volume and File Structure of CDROM for Information Interchange — the volume-descriptor and directory-record on-disk layout the IsoFs parser indexes. The relevant structures are the primary volume descriptor (PVD) and the directory record.
  • Reference: the legacy IDE/ATAPI PIO sequence and the ISO 9660 fixed-offset field layout are the well-documented OSDev-wiki “ATAPI”/“ISO 9660” baseline; cross-checked against the QEMU IDE/ATAPI device behavior the proofs run against.

2. Wire format (implemented subset)

Only the polled-PIO read subset the driver uses is summarized; ATA features the driver never issues (DMA transfers, write commands, the full SCSI command set) are not implemented and are not transcribed here.

ATAPI PACKET read path

  • Channel register map: command-block register offsets relative to the channel command base — REG_DATA (0), REG_FEATURES (1), REG_SECCOUNT (2), REG_LBA_LOW (3), REG_LBA_MID (4), REG_LBA_HIGH (5), REG_DRIVE (6), REG_STATUS/REG_COMMAND (7). In the ATAPI PACKET protocol REG_LBA_MID / REG_LBA_HIGH carry the byte-count low/high for the data phase, not an LBA.
  • Status / control bits: STATUS_BSY (0x80), STATUS_DRQ (0x08), STATUS_ERR (0x01); control-block CTRL_NIEN (0x02, interrupts disabled — this path is polled only) and CTRL_SRST (0x04, soft reset in soft_reset). Drive select is DRIVE_MASTER (0xA0) / DRIVE_SLAVE (0xB0).
  • Probe / detect: AtapiDevice::probe soft-resets each channel and calls detect, which selects a drive and matches the ATAPI_SIG_MID / ATAPI_SIG_HIGH signature; a 0xFF status is treated as a floating (empty) bus. Every status spin is bounded by SPIN_LIMIT (wait_not_busy / wait_drq), so an absent or wedged device fails closed rather than hanging boot.
  • PACKET command issue (AtapiDevice::packet_data_in): writes CMD_PACKET (0xA0) to the command register, programs the per-block byte-count limit BYTE_LIMIT (2048, one CD logical sector) into the LBA mid/high registers, waits for DRQ, then writes the 12-byte command-descriptor block (CDB) as six 16-bit words. The data-in phase reads each DRQ block, taking its byte count from the LBA mid/high registers; it rejects a byte count over BYTE_LIMIT, an odd byte count, or one that would overflow the destination buffer (IsoError::Protocol / BufferTooSmall).
  • CDBs implemented: CDB_READ12 (0xA8) with the big-endian LBA at bytes 2..6 and transfer length at bytes 6..10 (built in AtapiDevice::read_sectors), and CDB_READ_CAPACITY10 (0x25, in AtapiDevice::read_capacity) returning the last addressable LBA and the logical block size. A reported block size is range-checked against MIN_BLOCK_SIZE (2048) / MAX_BLOCK_SIZE (4096).
  • Bounded sector read (AtapiDevice::read_sectors): rejects a zero count, arithmetic overflow (IsoError::InvalidRequest), an LBA range past the reported capacity (OutOfRange), and a destination buffer shorter than count * block_size (BufferTooSmall), all before any device access. A device that returns fewer bytes than requested is rejected (ShortRead).

ISO 9660 volume structure

  • Primary volume descriptor: IsoFs::mount reads PVD_LBA (sector 16, after the reserved system area) through read_sectors and validates the descriptor type (pvd[0] == 1), the CD001 standard identifier (pvd[1..6]), and the version (pvd[6] == 1). ISO 9660 stores integers both-endian (a little-endian half followed by a big-endian half); the driver reads the little-endian half with le_u16 / le_u32. It indexes the logical block size (both-endian u16 at offset 128, which must equal the device block size), the volume space size (both-endian u32 at offset 80), and the embedded root directory record (34-byte MIN_DIR_RECORD at offset 156, whose extent LBA is bytes 2..6 and size bytes 10..14).
  • Directory records: IsoFs::lookup / list_boot_bins walk a directory extent record by record. Each record’s length is byte 0 (a zero length skips to the next logical-sector boundary); the file-flags byte 25 carries FILE_FLAG_DIR (0x02); the file-identifier length is byte 32 and the identifier starts at byte 33; the extent LBA/size are bytes 2..6 / 10..14. The ./.. self/parent records (identifier 0x00/0x01) are skipped, and identifiers are matched case-insensitively after stripping the ;version suffix and trailing dots (name_matches / normalize_ident).
  • Path resolution: IsoFs::lookup_path descends from the root through each component; boot_bins_dir resolves /boot/bins/ and open_file resolves a named file beneath it, each returning a validated (lba, size) extent.

3. capOS mapping

  • Binding (boot-only, kernel-owned): this is not a DDF device. It is not enumerated through PCI and does not bind through the DeviceMmio/Interrupt/DMAPool provider grants the cloud-NIC/storage drivers use; it owns fixed legacy I/O ports directly in kernel mode and runs only during boot. There is no userspace driver and no *_grant_source for the reader itself.
    • Under boot_iso_read the kernel runs iso::boot_read_proof / iso::boot_fs_proof (called from kernel/src/main.rs) to exercise the device-read primitive and the ISO 9660 walk.
    • Under boot_iso the reader is the live boot-binary source: iso::boot_source::init (in kernel/src/main.rs run_init) builds a registry resolving each declared manifest binary name to its (lba, size) extent via open_file (name mapping through iso_dname, which applies the ISO 9660 d-character substitution xorriso records), and iso::boot_source::read_binary reads each ELF on demand. The device is owned by the registry and serialized behind a Mutex so concurrent spawns on multiple CPUs do not interleave PIO transfers on the shared IDE channel.
  • Directory/File cap fixture: the read path has no caps of its own, but the installable_image cap (kernel/src/cap/installable_image.rs) layers a read-only Directory/File CapObject over this reader for the focused QEMU install-source proof. It exposes the packaged /boot/bins/ tree to the installer smoke only; it is not a general post-bootstrap ISO filesystem service. It is granted via the qemu-gated installable_image_source (KernelCapSource::InstallableImageSource); Directory.list/ Directory.open + File.read/File.stat are served and every mutating method fails closed (read-only is structural, not a rights flag). It reuses the driver’s in-bounds checks (IsoFs::validate_extent at mount/open, AtapiDevice::read_sectors range validation per read) and is physically scoped to the ATAPI medium, so it cannot reach the writable virtio-blk target disk.
  • MMIO / Interrupt / DMA: none. Access is legacy port I/O (in/out via the module’s inb/outb/inw/outw helpers), not memory-mapped BARs. Interrupts are disabled (CTRL_NIEN) and the path is polled PIO, so there is no MSI/MSI-X vector binding. Transfers move through the data register word by word, so there is no DMA buffer and no IOMMU/bounce-buffer involvement.
  • Fail-closed / validation rules: every derived extent is validated against the volume size (IsoFs::validate_extent) before it is read, so a malformed or hostile volume cannot drive an out-of-bounds device read; directory extents are capped at MAX_DIR_BYTES and records are length/identifier-bounded before trusting them; capacity and buffer-length checks gate read_sectors; block size is floored/ceiled to MIN_BLOCK_SIZE/MAX_BLOCK_SIZE; and every status wait is SPIN_LIMIT-bounded. All failure modes funnel through IsoError (NoDevice/Timeout/DeviceError/Protocol/InvalidRequest/OutOfRange/ BufferTooSmall/ShortRead/BadVolume/NotFound/NotDirectory).
  • QEMU-emulable vs hardware-only: fully QEMU-emulable. QEMU’s -cdrom attaches an ATAPI CD-ROM on the secondary legacy IDE channel. make run-boot-iso-read proves the bounded ATAPI PIO read primitive and the ISO 9660 walk; make run-boot-iso and the default make run-smoke prove the live on-demand boot-binary load path; and make run-installable-image-source proves the read-only Directory/File install-source fixture layered over the reader. No hardware-only path.
  • kernel/src/iso/mod.rs — the ATAPI PIO reader, the ISO 9660 IsoFs driver, the boot_iso boot_source registry, and the boot-time proofs.
  • kernel/src/cap/installable_image.rs — the read-only Directory/File cap surface layered over this reader.
  • kernel/src/main.rsrun_init ISO boot-binary registry build and on-demand ELF load under boot_iso.

Azure MANA (Microsoft Azure Network Adapter)

This is a provenance map for the MANA / GDMA wire logic in capos-lib/src/mana.rs: it cites the spec basis, summarizes only the wire-format subset the code actually implements, and points into the implementation by symbol name. It is not a re-spec.

Maturity caveat. This page documents protocol encode/decode logic with a host-side conformance suite, not a bound driver. There is no MANA device in QEMU, so this logic is a deliberate QEMU-exception gated by cargo test-lib plus a warning-free cargo build --features qemu, not a make run-* smoke. End-to-end MANA bind / send / receive / teardown on real Azure hardware – including SR-IOV VF revocation with fallback-to-synthetic and DMA/MMIO/IRQ teardown – is future work (tracked as cloud-azure-mana-nic-live-proof), blocked until Azure access is provisioned. The ## 3. capOS mapping section below therefore describes the planned binding, not landed authority.

1. Spec basis

  • Device: Microsoft Azure Network Adapter (MANA), the modern Azure NIC for Dv5/Ev5 and later VM families. Exposed to the guest as a PCI SR-IOV Virtual Function. PCI vendor 0x1414 (Microsoft); device 0x00ba (VF, the guest-bound function) / 0x00b9 (PF). IDs at capos-lib/src/mana.rs (MANA_PCI_VENDOR_ID, MANA_VF_DEVICE_ID, MANA_PF_DEVICE_ID). The device is fronted by GDMA (Generic DMA), Microsoft’s queue/DMA abstraction; MANA is the network client riding on GDMA queues.
  • Authoritative spec: MANA has no freely published register specification. The basis of record is the upstream open-source MANA Linux driver, whose “HW DATA” structures are the documented wire contract:
    • include/net/mana/gdma.h – GDMA registers, doorbells, message headers, WQE/CQE/EQE, request-type space, device/queue enums.
    • include/net/mana/mana.h – MANA TX/RX OOB descriptors, completion OOBs, mana_cqe_type, mana_command_code.
    • include/net/mana/hw_channel.h, include/net/mana/shm_channel.h – the HWC management channel and the shared-memory bootstrap aperture.
    • Reference snapshot: torvalds/linux master at commit d60ec36cab338dfe2ae40d73e9c8d6c4af70d2b8 (the gdma.h structures are stable across recent kernels).
  • Reference driver: the same MANA Linux driver (drivers/net/ethernet/microsoft/mana/) is the behavior cross-check; mana_gd_init_req_hdr defines the standard request-header construction mirrored by GdmaReqHdr::standard.

2. Wire format (implemented subset)

All multi-byte words are little-endian; GDMA “HW DATA” structures are naturally aligned (not packed). Every decoder validates buffer length, rejects unknown enum members, and enforces must-be-zero (MBZ) reserved fields; every encoder range-checks its bitfields. Symbols below are in capos-lib/src/mana.rs.

  • Registers / BAR: single register BAR (BAR0). VF doorbell-page and shared-memory aperture offsets (GDMA_REG_*) and PF offsets (GDMA_PF_REG_*), the SR-IOV config base, and the fixed CQE/EQE/WQE-BU and max SQE/RQE sizes are in the regs module (REG_DB_PAGE_OFFSET, REG_SHM_OFFSET, PF_REG_*, SRIOV_REG_CFG_BASE_OFF, CQE_SIZE, EQE_SIZE, MAX_SQE_SIZE, MAX_RQE_SIZE, WQE_BU_SIZE).
  • Doorbells: the four-variant union gdma_doorbell_entry is modeled by the DoorbellEntry enum (Cq/Rq/Sq/Eq), encoding the 24- or 16-bit queue id, the 31- or 32-bit tail pointer, the RQ wqe_cnt, and the CQ/EQ arm bit, with kind-specific reserved MBZ enforcement on decode.
  • Admin (HWC) messages: GdmaMsgHdr (gdma_msg_hdr), GdmaDevId (gdma_dev_id), GdmaReqHdr (gdma_req_hdr, with standard mirroring mana_gd_init_req_hdr), and GdmaRespHdr (gdma_resp_hdr, reserved-word MBZ). The request-type space is GdmaRequestType (gdma_request_type, fail-closed); the GDMA admin status is the open GdmaStatus space (success / MoreEntries / CmdUnsupported / preserved Other, since GDMA status is a firmware error space, not a closed enum).
  • Work queue: GdmaSge (gdma_sge, 16-byte SGE with 64-bit address) and GdmaWqeHeader (gdma_wqe, the 8-byte WQE header: num_sge, inline_oob_size_div4, client_oob_in_sgl, client_data_unit, with reserved MBZ). MANA TX OOB descriptors that prepend the SGL: ManaTxShortOob (mana_tx_short_oob, checksum-offload + completion-CQ + vSQ-frame selection) and ManaTxLongOob (mana_tx_long_oob, encapsulation / VLAN / inner-offset fields).
  • Completion / event: GdmaCqeInfo (gdma_cqe.cqe_info: wq_num, is_sq, 3-bit owner_bits) and GdmaEqeInfo (union gdma_eqe_info: event type via GdmaEqeType, client_id, owner_bits). MANA completion OOBs: ManaCqeHeader (mana_cqe_header, cqe_type via the fail-closed ManaCqeType enum), ManaRxcompOob (mana_rxcomp_oob, RX flags + MANA_RXCOMP_OOB_NUM_PPI per-packet ManaRxcompPerpktInfo + RX WQE offset), and ManaTxCompOob (mana_tx_comp_oob, TX data/SGL/WQE offsets + reserved-padding MBZ).
  • Capability / feature negotiation: the verify-version surface (GdmaRequestType::VerifyVfDriverVersion, GDMA_PROTOCOL_V1, GdmaOsType) and the MANA control command space ManaCommandCode (mana_command_code, fail-closed) including QueryDevConfig / QueryVportConfig / ConfigVportTx/Rx / CreateWqObj.

3. capOS mapping (planned – not yet implemented)

MANA is a vendor-custom cloud NIC behind SR-IOV. The intended binding, when the live-proof work is unblocked, follows the same userspace-driver authority gate the other DDF device classes use; none of the grants below are exercised by the host conformance logic.

  • Authority gate: the MANA VF would be enumerated over PCI, claimed through the reviewed userspace-driver hardware-authority gate, and tracked in the device-manager ownership ledger, exactly as the cloud NIC/storage drivers are planned to bind. The current implementation grants nothing.
  • DeviceMmio: BAR0 (the GDMA register block, doorbell page, and SHM aperture) would be mapped device-uncacheable / NX, with doorbell writes scoped to the owning driver’s BAR window. The 64-bit DoorbellEntry values are the writes that path would emit.
  • Interrupt: GDMA EQs deliver completions via MSI-X; the live driver would bind one Interrupt per EQ vector and arm it through the EQ doorbell arm bit. The owner_bits phase mechanism (GdmaCqeInfo/GdmaEqeInfo) is how the driver detects new entries without a tail register.
  • DMAPool: GDMA queues and TX/RX buffers would be allocated from a labeled DMA pool through the selected DMA backend (cloud-dma-backend-selection: direct IOMMU vs labeled bounce buffer), with quiesce/scrub-before-reuse and host-physical-address / IOVA non-exposure. The GdmaSge address fields are IOVAs from that pool; the current implementation does not allocate or program any DMA.
  • Fail-closed / validation rules: the encode/decode logic is the fail-closed boundary capOS implements today – unknown request/queue/event/completion types and command codes are rejected, reserved fields are MBZ-enforced, and bitfields are range-checked. Stale-generation rejection, BAR bounds, doorbell scoping, and release/reset/VF-revocation teardown are the live driver’s responsibility and are future work.
  • QEMU-emulable vs hardware-only: none of MANA is QEMU-emulable – QEMU has no MANA device model. The wire logic here is provable only by the host conformance suite (cargo test-lib); SR-IOV VF revocation/hot-remove semantics in particular cannot be reproduced even by a hypothetical QEMU MANA device model and remain a live-hardware concern.

GCE gVNIC (Google Virtual Ethernet)

This is a provenance map for gVNIC, the Google Virtual NIC presented to Compute Engine guests. It cites the public specification basis, summarizes only the wire-format subset a capOS driver would implement, and maps the device onto capOS’s userspace-driver hardware-authority gate. It is not a re-spec: where the behavior is defined in the upstream driver or the public docs, it links rather than transcribing register tables.

Maturity caveat. This page remains primarily a grounding map. capOS has landed live-GCE proofs that request the GVNIC image/instance posture, record the gVNIC PCI function (1ae0:0042) with BAR and MSI-X metadata, map BAR0 through DeviceMmio, use manager-owned DMA pages for the admin queue and descriptor buffer, and bring up one GQI/QPL TX/RX queue pair far enough to send one DHCP DISCOVER raw Ethernet frame and receive one inbound IPv4 frame before teardown. capOS also has a bounded hardware-only typed Nic adaptation proof over that same queue path: the proof marker records Nic.transmit, Nic.receive, Nic.macAddress, and Nic.linkStatus semantics with inline frame transfer and no host-physical/IOVA export. capOS still has no reusable gVNIC provider service and no host conformance suite. There is no gVNIC device model in QEMU, so unlike the virtio-net path there is no local make run-* smoke that can execute the device. The ## 3. capOS mapping section distinguishes the landed inventory/admin-queue/raw-frame proof and typed Nic-adaptation proof from future productionization work. The bounded implementation lane that consumes this map is decomposed in Hardware, Boot, and Storage.

gVNIC is a separate GCE portability lane, not a blocker for the first public Web UI proof. GCE exposes a selectable VIRTIO_NET NIC type on supported first/second-generation machine families, and capOS already drives modern virtio-net (see virtio-net). A first public Web UI proof scoped to a virtio-compatible GCE machine type needs no gVNIC support. gVNIC matters because Google documents it as the Compute Engine NIC alternative to virtio, with third-generation-and-later machine series supporting only gVNIC for virtual network interfaces; it is the portability lane for those shapes, not a precondition for the virtio-net Web UI proof.

1. Spec basis

  • Device: Google Virtual NIC (gVNIC), the modern Compute Engine virtual network interface. Exposed to the guest as a PCI function with vendor 0x1ae0 (Google) and device 0x0042. The same vendor/device pair is recorded for the GCP NIC path in Cloud Deployment (“PCI Device IDs for Cloud Hardware”). The upstream Linux driver names the device family GVE (Google Virtual Ethernet).
  • Authoritative spec: gVNIC has no freely published register specification. The basis of record is the combination of:
  • Reference driver: the upstream GVE Linux driver (drivers/net/ethernet/google/gve/) is the behavior cross-check for the admin-queue handshake, queue creation, and the two descriptor formats.

2. Wire format (subset a capOS driver would implement)

The subset below is the slow-path bring-up plus one traffic-queue format a minimal capOS gVNIC driver would need. Exact register offsets, opcode numbers, and descriptor bit layouts are defined in the GVE headers cited above and are not transcribed here — this is a map, not a re-spec. Endianness is not uniform on this device: admin-queue messages and GQI descriptors are big-endian, while DQO descriptors are little-endian (per the GVE driver docs), so a capOS decoder/encoder must select endianness per structure.

  • Registers / BARs: three 32-bit memory BARs.
    • BAR0 — device configuration and status registers (the gve_register.h block): GVE_DEVICE_STATUS / driver-status handshake, max TX/RX queue counts, the admin-queue PFN and doorbell, the admin-queue event counter, and the reset trigger.
    • BAR1 — the MSI-X vector table.
    • BAR2 — the IRQ doorbells plus the per-queue RX and TX doorbells.
  • Admin queue (AQ): a single page-sized command array. The driver writes a command into a free slot, advances its submission counter, rings the admin-queue doorbell in BAR0, and polls the admin-queue event counter until the device marks the command executed and writes back its status. The gve_adminq.h opcode space covers device description and resource lifecycle (describe device, configure/deconfigure device resources, register/unregister page list, create/destroy TX queue, create/destroy RX queue, and feature/option negotiation). The landed capOS proofs register the AQ page, issue DESCRIBE_DEVICE, parse the returned descriptor and GQI/QPL option, configure device resources with two notification blocks, register TX/RX queue page lists, create one TX and one RX queue, then destroy/unregister/deconfigure and release the admin queue before emitting evidence.
  • Interrupt classes: MSI-X only, in two roles.
    • A management interrupt that tells the driver to re-examine GVE_DEVICE_STATUS (link / device-state changes).
    • Notification-block interrupts, one block servicing a set of traffic queues; a block firing tells the driver to poll the associated queues. The notification blocks are the per-queue completion-signal path.
  • Queue formats (GQI vs DQO): gVNIC defines two mutually incompatible descriptor formats; a device instance negotiates one.
    • GQI (“Google Queue Interface”): fixed-size, power-of-two descriptor rings; the classic format. Big-endian descriptors.
    • DQO (“Descriptor Queue, Out-of-order”): split descriptor and completion queues with per-completion generation bits for ownership tracking and 16-bit tags identifying which posted buffer a completion refers to, allowing out-of-order completion. Little-endian descriptors. DQO is the format the newer machine families use.
  • Addressing modes (QPL vs RDA): independent of the descriptor format, each queue uses one of two buffer-addressing modes.
    • QPL (“queue page list”): the driver pre-registers a fixed set of guest pages with the device through the admin queue, and descriptors reference offsets into that registered page list rather than arbitrary guest physical addresses. The device only ever DMAs into pages the driver explicitly registered.
    • RDA (“raw DMA addressing”): descriptors carry guest DMA addresses directly, so the device can DMA to dynamically allocated guest memory.
  • Descriptor / ring ownership: the driver owns descriptor production and doorbell rings; the device owns completions. In GQI the device advances a completion/used position the driver reads; in DQO the device writes completion entries whose generation bit flips when the entry is the device’s to consume, so the driver detects new completions without a separate tail register.
  • Reset / link-up sequence: bring-up drives the BAR0 device-status / driver-status handshake, sets up the admin queue (legacy revision: program the AQ PFN; newer revisions: program AQ length/base and set driver-status RUN), issues the admin commands above to describe the device and create queues, and arms the notification-block interrupts. Teardown follows the upstream driver: legacy revision writes 0x0 to the AQ PFN and waits for it to read back zero; newer revisions write driver-status RESET and wait for DEVICE_IS_RESET.
  • Known unsupported / out-of-scope features: offloads (checksum, TSO/LRO, RSS hashing), jumbo frames, multi-queue scaling beyond a single TX/RX pair, and the RDA addressing mode are out of scope for an initial bring-up. The first capOS lane targets QPL addressing with one TX and one RX queue (see §3).

3. capOS mapping

gVNIC is a vendor-custom cloud NIC. capOS now exercises inventory, admin-queue/register, bounded raw-frame GQI/QPL TX/RX, and a bounded typed Nic-adaptation proof in private GCE runs. Productionization remains future work: there is no reusable gVNIC provider service, local device model, DQO/RDA support, or host conformance suite yet.

  • Authority gate: the gVNIC PCI function is inventoried over the production PCI enumeration source. The admin-queue proof binds BAR0 and a manager-owned DMA pool for one DESCRIBE_DEVICE command (kernel/src/cap/gvnic_adminq_register_proof.rs). The raw-frame proof (kernel/src/cap/gvnic_raw_frame_proof.rs) then uses the same device-manager authority model to configure one GQI/QPL TX/RX queue pair, transmit one DHCP DISCOVER, poll a bounded RX descriptor completion, and tear the queues down. The cloud_gce_gvnic_nic_cap_adaptation_proof build reuses that module’s report_nic_cap_adaptation path to prove the existing Nic ABI semantics over the same GQI/QPL data path: the marker records inline-frame Nic.transmit / Nic.receive, Nic.macAddress, and Nic.linkStatus evidence without exposing queue addresses or emitting the broader provider bind claim. Both proofs use kernel/src/pci.rs find_driver_bind_device for resolved-source driver enumeration and kernel/src/device_manager/stub.rs devicemmio_kernel_window_for_proof for the live BAR0 DeviceMmio window. They do not issue a reusable userspace gVNIC provider service and do not claim provider-nic-bound.
  • DeviceMmio: the landed proof stages BAR0 as a device-manager DeviceMmio record, bounds all big-endian register accesses to the staged window, rings the admin-queue doorbell, and detaches the record with a stale-handle assertion. The raw-frame proof also maps a bounded 64 KiB BAR2 kernel-only doorbell window and validates returned TX/RX doorbell indexes before ringing them. BAR1 MSI-X remains unprogrammed in this polling proof.
  • Interrupt: the management interrupt and each notification-block vector would each bind one Interrupt cap over an MSI-X table entry, with the same mask-first / deferred-LAPIC-EOI lifecycle the landed production interrupt path uses (kernel/src/device_interrupt.rs, exercised by the virtio-net userspace IRQ-ownership slice). gVNIC uses MSI-X exclusively — there is no legacy-IRQ fallback. The admin-queue proof does not program MSI-X.
  • DMAPool / DMABuffer: the admin-queue pages come from the manager-owned bounce-buffer pool through stage_bounce_buffer_dmapool_record and issue_manager_attached_dmabuffer_handle_with_request. The raw-frame proof keeps larger queue resources and QPL pages manager/proof-owned, publishes device-visible addresses only internally to the hardware, and never grants userspace a DMABuffer cap or raw host-physical/IOVA value. It asserts DmaBufferCap::info_for_handle reports host_physical_user_visible=0, device_iova=0, and iova_export=disabled-future-only. Teardown destroys queues, unregisters both QPLs, deconfigures device resources, releases/resets the admin queue, scrubs/frees traffic frames, requires scrub/ledger removal/frame-free labels for manager buffers, and checks stale pool/buffer/MMIO handles. Future reusable gVNIC provider integration must use the same selected DMA backend model documented in DMA Isolation.
  • Fail-closed / validation rules: the landed proof emits cloudboot-evidence: gvnic-adminq-register <token> or cloudboot-evidence: gvnic-raw-frame-tx-rx <token> only after the bounded command/traffic sequence passes, the release/reset handshake completes, the PCI command register is restored, and stale DeviceMmio/DMAPool/DMABuffer handles all fail closed. The typed adaptation proof emits cloudboot-evidence: gvnic-nic-cap-adaptation <token> only after the same teardown and stale-handle checks plus Nic-semantic TX/RX evidence. If queue or admin-queue release times out, the proof intentionally leaves still-owned DMA pages live and emits no success marker rather than freeing memory the device may still own.
  • QEMU-emulable vs hardware-only: none of gVNIC is QEMU-emulable — QEMU has no gVNIC/GVE device model. Every bind step is therefore hardware-only and requires a private, explicitly billable GCE instance launched with the GVNIC guest-OS feature and nic-type=GVNIC. The lane is gated accordingly: the landed inventory proof (cloud-gce-gvnic-image-launch-inventory-proof), the landed admin-queue/register proof (cloud-gce-gvnic-adminq-register-proof), the landed bounded raw-frame TX/RX proof (cloud-gce-gvnic-raw-frame-tx-rx-proof), and the landed typed Nic adaptation proof (cloud-gce-gvnic-nic-cap-adaptation-proof). Each is decomposed in Hardware, Boot, and Storage and requires a private, explicitly billable GCE run for hardware evidence.

Documentation Workflow

The published documentation is organized as a system manual first. The top of docs/SUMMARY.md should lead with pages that explain how to understand, build, boot, configure, operate, and review the current capOS implementation.

The mdBook site may keep the wider project corpus reachable for maintainers: roadmap, changelog, backlog, proposal, paper, and research files can remain under the lower archive section. Those pages should not shape the primary reader path, and they should not be treated as part of the generated PDF manual unless they become current system documentation.

PDF Manual Pipeline

The PDF is a Typst-authored manual shell plus generated body content:

  1. docs/manual.typ explicitly lists the Markdown pages that belong in the generated manual with {{CAPOS_MANUAL_PAGE:...}} placeholders. The mdBook site navigation in docs/SUMMARY.md can point at a different landing page or archive structure without changing the PDF contents.
  2. tools/docs-bundle.js reads that explicit page list, rewrites bundled-doc links to PDF-local heading anchors, emits the aggregate generated Markdown at target/docs-bundle/manual.md, and emits one Markdown file per manual page under target/docs-bundle/.
  3. mdbook-mermaid checks Mermaid syntax, and mermaid-cli converts Mermaid blocks in the generated Markdown to 2x PNG artifacts under target/docs-bundle/.
  4. uv tool run --constraints tools/md2typst-constraints.txt --from md2typst==0.3.3 md2typst converts each generated Markdown page to Typst with the converter dependency set pinned.
  5. tools/build-typst-manual.js normalizes the converted pages, fills docs/manual.typ with generated version/date/source metadata and the selected page include paths, and writes target/docs-bundle/manual.pdf.typ. The normalizer also collapses Markdown source-wrap line breaks outside code blocks so PDF prose and list items use normal paragraph layout, demotes generated page headings so manual parts remain the only top-level outline entries, and scales selected tall Mermaid diagrams so they fit with their surrounding manual context instead of becoming orphaned figure pages.
  6. The pinned Typst binary compiles the final PDF.

docs/manual.typ owns the PDF document structure: title page, version block, table of contents, page setup, base typography, and the explicit manual page order. Manual part dividers are top-level headings; generated page titles are demoted during PDF normalization so chapters sit below those parts instead of appearing as peers.

Most manual pages are generated from Markdown through md2typst. A page can be overridden for the PDF only by adding a checked-in Typst file at docs/manual-overrides/<page-id>.typ, where <page-id> is the source path with non-alphanumeric characters collapsed to hyphens, for example docs/manual-overrides/architecture-memory.typ. Overrides replace the generated Typst page in the PDF but do not change the mdBook page or target/docs-bundle/manual.md. Override files are copied into target/docs-bundle/ before Typst compilation and should be self-contained Typst fragments.

Benchmark result tables stay in their source Markdown pages. If a wide benchmark table needs PDF-specific layout, mark that Markdown table with <!-- capos-benchmark-results:<id> start --> and matching end comments. The mdBook site renders the source table, while tools/build-typst-manual.js parses the marked table and replaces only that table region with a compact Typst rendering. Keep interpretation, caveats, and conclusions in normal prose around the table rather than encoding them in the table parser.

Generated files under docs/topics.md, target/docs-bundle/, and target/docs-site/ must remain untracked.

The mdBook metadata preprocessor and PDF bundler normalize default cross-document link labels. When a link label is only the target Markdown path or filename, rendered site and manual output use the target document title instead. Keep explicit prose labels in source when the surrounding sentence needs a more specific phrase than the document title.

PDF Typography Rules

The manual and the schema paper should share a conservative typographic base: letter paper, readable serif body text, a restrained heading scale, consistent link color, consistent code styling, and predictable figure/table captions. They do not need identical layouts. The paper can remain citation-oriented and formal; the manual should favor scanning, command lookup, and dense technical reference pages.

For the manual PDF:

  • Keep body text readable before optimizing page count. Avoid global spacing changes that create worse page breaks or orphaned callouts.
  • Use headings as navigation markers: leave more room before a heading than after it, and keep headings with the first paragraph or code block whenever practical. In the manual PDF, the below-heading gap must be visibly larger than ordinary line leading, while the above-heading gap remains larger than the below gap so the heading belongs to the content that follows.
  • Treat long bullets as structure problems. Prefer short bullets, definition lists, or command/proof tables over paragraph-length list items.
  • Use framed code blocks for commands and transcripts. Give them visible internal padding, a very light background, and enough surrounding whitespace to read as intentional panels.
  • Keep inline code sparse in prose. When a sentence accumulates several commands, paths, or target names, prefer a code block or table.
  • Use one callout style consistently. A left rule or light box is acceptable, but the callout needs enough padding that it does not look like accidental indentation.
  • Avoid visual changes without checking rendered pages. Review at least one command-heavy page and one dense prose/list page after each PDF style change.

Scope Rules

The PDF manual includes current system documentation: introduction, status, build and boot workflow, configuration, repository map, runnable demos, architecture, and security/verification pages.

Project archives stay on the mdBook site but are excluded from the PDF manual: proposals, backlogs, research notes, whitepaper planning, and other planning records are useful context for maintainers, not the operator-facing manual.

Topics Index

This page is generated from document front matter fields during mdbook builds:

  • status
  • description
  • topics

Quick Orientation

Capabilities, IPC, and Authority

  • ABI Evolution PolicyCompatibility policy for capOS schema and ring ABIs.
  • Authority AccountingAuthority accounting rules for capability transfer and resource charges.
  • Cap’n Proto Error HandlingPrior-art on capnp-rpc error semantics.
  • Capability ModelCore capability object model, cap tables, schema interface IDs, grants, receiver metadata, and transfer.
  • Capability RingShared-memory capability ring ABI, dispatch paths, and completion semantics.
  • Capability-Infrastructure ClusterDecomposition of the near-term capability-infrastructure cluster: matured proposals and Stage 6 remainder that share the schema serial surface.
  • Cloudflare, Cap’n Proto, Workers RPC, and Cap’n WebCloudflare Workers, workerd, Durable Objects, Workers RPC, Cap’n Web, and Cloudflare’s production use of Cap’n Proto/KJ.
  • Crash Recovery and SupervisionUnplanned-failure detection, stale-cap propagation, structured crash records, watchdog liveness, and bounded restart policy for capOS services.
  • Debug and Trace AuthorityCapability-scoped debug session attach, read-only cap-table inspection, ring-trace replay, and sampler authority without ambient process inspection.
  • Delegated Subject ContextFuture delegated-subject and act-on-behalf-of capability model.
  • Error HandlingCurrent error model for capability ring CQE status, CapException payloads, endpoint RETURN exceptions, and ordinary schema result unions.
  • Error HandlingTransport and application error model for capability calls and CQE results.
  • GenodeGenode OS Framework: capability-based component model, session routing, VFS plugin architecture, POSIX compatibility, and Sculpt OS – with lessons for capOS.
  • IPC and EndpointsEndpoint IPC, capability transfer, direct handoff, and shared-memory data paths.
  • Memory Authority ModelMemory authority, residency classes, mapping consistency, OOM boundaries, and proof obligations.
  • OS Error HandlingCross-OS error-model comparison.
  • Rejected: Cap’n Proto SQE EnvelopeRationale for keeping ring SQEs fixed-layout instead of Cap’n Proto envelopes.
  • Rejected: Endpoint Badges as Service IdentityPost-mortem of the rejected seL4-style endpoint badge service identity model.
  • Remote Session CapSet ClientsRemote host app model for authenticated capOS sessions, broker-issued CapSet views, and typed capability calls over Cap’n Proto RPC.
  • Resource Accounting and QuotasResource profiles, quota ledgers, donation, reservation, and fail-closed accounting semantics.
  • Schema RegistryA SchemaRegistry capability that serves Cap’n Proto reflection metadata – interface IDs, method names and ordinals, parameter/result layouts, and doc comments – at runtime, as the machine-readable twin of the System Manual.
  • Service ArchitectureCapability-based service composition, authority-at-spawn, exports, and service graph policy.
  • Service Object Identity MigrationSuperseded large-chunk migration plan for service object identity, retained as historical context after the active direction changed to session-bound invocation context.
  • Session ContextCurrent session-bound invocation context, endpoint caller-session metadata, disclosure, transfer-scope, and liveness rules.
  • Session-Bound Invocation ContextImplementation plan for one-session-per-process invocation context and session-keyed shared services.
  • Session-Bound Invocation ContextSession-bound invocation context and privacy-aware disclosure model replacing service-object identity migration.
  • Spritely, OCapN, and CapTPSpritely, OCapN, CapTP, netlayers, locators, Syrup, promise pipelining, handoffs, and capability-network lessons for capOS.
  • Stage 6 Capability SemanticsStage 6 capability work.
  • Standard App CapabilitiesPer-app AppData storage, a user-mediated powerbox/file-picker grant, and attenuated capability sharing as standard app-facing capabilities.
  • Superseded: Service Object CapabilitiesSuperseded service-minted object capability model that was replaced by session-bound invocation context.
  • System Info CapabilitySystemInfo capability for MOTD, hostname, host metadata, help topics, and shell bundle integration.
  • System Manual CapabilityA built-in man-pages analog: the Manual capability serves Unix-style reference pages, schema-derived interface manuals, and a man-shaped reference corpus through the shell, the self-served web UI, and a typed capnp API.
  • Time and Clock AuthorityCapability-native wall-clock authority with provenance labeling, clock discipline, and trusted timestamps for audit and TLS.
  • Userspace Authority BrokerUserspace shell-bundle broker and lifecycle-control authority model.
  • ZirconFuchsia Zircon kernel: handle-based capability model, channels, VMARs/VMOs, async ports, and FIDL – with lessons for capOS capability dispatch, IPC, and memory design.

Boot, Manifests, and Init

  • Boot FlowKernel boot, manifest handoff, init launch, and QEMU boot-proof flow.
  • Boot to ShellLogin, setup, session, credential, and broker path from boot into the native shell.
  • Cloud Image Import and Serial-Console BootCloud provider disk-image import and serial-console-boot notes.
  • Cloud MetadataCloud metadata and config-drive bootstrap through scoped configuration capabilities.
  • ConfigurationHow operators extend the default capOS boot manifest with a gitignored system.local.cue overlay and convert CUE-authored data to specified Cap’n Proto schemas.
  • Hardware, Boot, and StorageHardware bring-up backlog.
  • Installable SystemOrdered implementation track turning the installable-system proposal into work grounded in the landed BlockDevice/filesystem/Store/writable-persistence/disk-image contracts.
  • Installable SystemDesign for an installed, persistent capOS that boots from disk and keeps mutable system configuration across reboots, composed with the immutable boot manifest.
  • Manifest and Service StartupManifest encoding, service graph validation, bootstrap grants, and init-side spawning.
  • Run Targets, Init Mandate, and Default-Run IntegrationRun-target governance.
  • Stateful Task and Job GraphsDurable stateful task and job graphs for init orchestration, package builds, operator work, and notebook-style run stories without creating a god object.
  • System Configuration and Operator ExtensibilityLayered CUE configuration model for operator boot-manifest overlays, host-user injection, and per-user toolchain caches.

Process Model, Threading, and Scheduling

Memory and Resource Accounting

  • Cloud DMA Provider Evidence InventoryOfficial AWS/Azure/GCP device-surface facts, an evidence-matrix schema, a live guest-probe checklist, and classification rules for the cloud DMA backend decision.
  • Cloud Driver Foundation Gap AnalysisGap analysis between the existing userspace virtio driver foundation and the blocked cloud NIC/storage driver tasks: what is already proven, the narrow per-task remaining work, and the superseded live-NIC runnable-now claim.
  • Device Manager RefactorRefactor direction for separating the kernel device authority ledger from QEMU proof scaffolding.
  • DMA Assurance ModelAssurance model for DMA authority, backend selection, and proof obligations.
  • DMA IsolationDMA isolation model for device memory, IOMMU policy, and capability-scoped hardware access.
  • DMA User-Space Driver IsolationDMA, user-space driver, vIOMMU, and no-IOMMU bounce-buffer design consequences for capOS device authority.
  • Go VirtualMemory ContractVirtualMemory cap contract for Go.
  • IOMMU Remapping GroundingPrimary-source grounding for Intel VT-d (landed under cfg(qemu)), AMD-Vi, and QEMU IOMMU remapping work.
  • Memory Authority ModelMemory authority model backlog.
  • Memory Authority ModelMemory authority, residency classes, mapping consistency, OOM boundaries, and proof obligations.
  • Memory ManagementPhysical frames, address spaces, user buffers, MemoryObject, and VirtualMemory contracts.
  • NVMe Model B Doorbell DMA ValidatorConditional DMA-address ownership model for the userspace NVMe storage provider: provider-written queue-base and PRP/SGL addresses require a non-host-physical device-visible namespace; no-IOMMU GCP planning must use brokered bounce address publication instead.
  • OOM Handling and SwapMemory-pressure, OOM, anonymous-memory budgeting, and optional encrypted swap policy.
  • Resource Accounting and QuotasResource profiles, quota ledgers, donation, reservation, and fail-closed accounting semantics.
  • virtio-rngProvenance map for the in-tree virtio-rng entropy device - spec basis, implemented wire-format subset, and its role as a QEMU-only DDF metadata and IOMMU-remapping hardware-DMA proof fixture (no userspace-facing capability, not a production driver).

Userspace Runtime, Languages, and Binaries

  • Browser Capability and Agent Web SessionsBrowser profiles, cap-native document engines, visual browsing, and agent/shell browser sessions as capability-scoped services.
  • Browser Engines, Document Engines, and Agent BrowsersBrowser engine portability, cap-native document-engine options, and agent-browser patterns for capOS browser capabilities.
  • Browser/WASMBrowser-hosted capOS experiment using WebAssembly and worker-per-process isolation.
  • capOS SDK and Dual TransportcapOS front-door SDK crate with a transport abstraction for in-system and remote clients, plus crate-namespace publication.
  • capos-serviceUserspace service framework (Rust crate capos-service) for lifecycle, endpoint loops, readiness, shutdown, metrics, context, and resource hooks.
  • Cloudflare, Cap’n Proto, Workers RPC, and Cap’n WebCloudflare Workers, workerd, Durable Objects, Workers RPC, Cap’n Web, and Cloudflare’s production use of Cap’n Proto/KJ.
  • Go RuntimeGo runtime plan for GOOS=capos, memory growth, TLS, scheduling, and networking.
  • IX-on-capOS HostingIX as a package corpus, content-addressed build/store model, and a capability-native build-service surface for capOS.
  • Language Support Status and PlansCurrent and planned programming-language support on capOS.
  • Linux Sandboxes and Virtualization for WorkloadsLinux sandbox, container, gVisor, KVM, microVM, and CPU-isolation prior art for generic Linux workload execution.
  • LLVM TargetCustom LLVM target triple requirements: kernel on x86_64-unknown-none, userspace on x86_64-unknown-capos; calling conventions, TLS, relocations, and Go/C runtime porting.
  • Lua ScriptingCapability-scoped Lua runner with curated libraries and explicit grants.
  • POSIX AdapterPOSIX compatibility adapter (libcapos-posix) over the libcapos C-ABI substrate, with smallest-deps POSIX shell and DNS resolver as the first ports.
  • POSIX Adapter Dash PortPOSIX adapter Phase P1.4 (dash port) backlog – libcapos-posix file/dir/stdio/env/printf surface, dash vendoring + per-call-site patch, and the run-posix-shell-smoke harness.
  • Runtime, Networking, and ShellRuntime/network/shell backlog.
  • Scientific Agent-Lab Software StackScientific computing, solver, proof-assistant, notebook, and reproducible-package prior art for a capOS-hosted LLM research lab.
  • Scientific Standard Package and Agent Lab CapabilitiesScientific standard package and agent-lab capability services for CAS, solvers, proof assistants, notebooks, and reproducible research environments.
  • Userspace BinariesNative userspace binary model, capos-rt authority handling, language runtimes, and compatibility adapters.
  • Userspace Runtimecapos-rt entry ABI, heap, CapSet lookup, ring client, and typed userspace capability clients.
  • WASI Host AdapterWASI host adapter as a userspace process whose imports are backed by typed capOS capabilities. Phase W.1 host-runtime scaffold landed 2026-05-05 19:12 UTC; Phase W.2 sub-slice 1 (wasm-host binary + empty-instantiation smoke + userspace-image budget bump) landed 2026-05-06 20:19 UTC; Phase W.2 sub-slice 2 (Preview 1 stdout-only imports plus probe-driven nosys=52 proof) landed 2026-05-07 08:03 UTC; Phase W.2 sub-slice 3 (Rust hello, wasi smoke + manifest-payload load path) landed 2026-05-07 09:36 UTC; Phase W.2 sub-slice 4 (C hello, wasi smoke) landed 2026-05-07 10:53 UTC and closes Phase W.2; Phase W.3 (per-instance CapSet plumbing + LaunchParameters bounded-text argv grant + wasi-cli-args smoke) landed 2026-05-07 18:25 UTC; Phase W.4 (random_get production-ready against the kernel EntropySource cap + wasi-random granted/ungranted smokes) landed 2026-05-07 20:09 UTC. A 2026-05-13 compatibility-import smoke promotes authority-free Preview 1 imports (clock_res_get(MONOTONIC), sched_yield, and stdio fd metadata/seek behavior); a 2026-05-13 bounded environment grant reflects initConfig.init.wasiEnv through environ_get / environ_sizes_get, with make wasi-env-negative-check covering count, per-entry, total-byte, and interior-NUL rejection; the refusal smoke (make run-wasi-preview1-refusals) proves nine representative blocked filesystem/socket imports fail closed with ERRNO_NOSYS = 52 (extended 2026-05-13 21:15 UTC to cover fd_pread, fd_pwrite, path_create_directory, sock_shutdown in addition to the original five). Open Questions §1 (per-instance vs per-process) and §3 (poll_oneoff semantics) resolved 2026-05-13 16:46 UTC; §6 (environ_get source) and §7 (args_get source) reclassified as resolved by Phase W.3 with the bounded manifest-text grants. W.5 (filesystem) closed 2026-05-17 05:42 UTC: the wasm-host installs the manifest-granted root Directory cap (CapSet slot root) as a single Preview 1 preopen at fd 3 (/preopen-0) and implements path_open, fd_read, fd_write, fd_seek, fd_close, fd_filestat_get, fd_prestat_get, and fd_prestat_dir_name against the kernel Directory / File cap interface in capos-wasm/src/wasi/fs.rs (POSIX P1.4 Slice 4 resolver shape); fd_readdir over the preopen Directory.list landed 2026-05-24 08:44 UTC; fd_tell (host-side position read) and fd_filestat_set_size (over File.truncate) landed 2026-05-24 09:34 UTC, completing the File-cap method triad with no schema change; path_create_directory and path_remove_directory (over Directory.mkdir/remove, same preopen sandbox, no schema change) landed 2026-05-24 10:09 UTC; fd_pread and fd_pwrite landed 2026-05-30 14:49 UTC as positional I/O over the host File cap (no schema change – File.read/File.write already carry an explicit offset), using the WASI-supplied offset and leaving the fd’s stream position untouched (the positional-I/O invariant). path_filestat_get and path_unlink_file landed 2026-05-30 as path-resolved metadata/removal over the host File.stat / Directory.remove caps (no schema change), leaving only path_filestat_set_times, path_rename, and the symlink/link family fail-closed. The make run-wasi-fs smoke (system-wasi-fs.cue, demos/wasi-fs/, tools/qemu-wasi-fs-smoke.sh) completes a full path_open(CREAT+TRUNC) / fd_write / fd_close / re-open / fd_filestat_get / fd_seek / fd_read round trip, asserts the preopen sandbox refuses absolute paths and .. segments with ERRNO_NOTCAPABLE = 76, proves the positional fd_pwrite/fd_pread round trip leaves the offset unchanged plus the negative-offset and stdio refusals, and stats smoke.txt by path (size 4, regular-file type) before unlinking it; the existing make run-wasi-preview1-refusals smoke continues to pass with W.5-split errnos (path_open / fd_prestat_get / fd_read / path_create_directory / fd_pread / fd_pwrite / path_filestat_get / path_unlink_file now return ERRNO_BADF = 8 against an absent preopen, only the socket imports stay at ERRNO_NOSYS = 52). Store / Namespace integration remains deferred. W.6 (sockets) remains blocked on the userspace network stack. W.7 (Component Model) and W.8 (TinyGo / Go-on-WASI CUE evaluator) remain blocked on the std-userspace decision.

Shells and Interactive Surfaces

  • Boot to ShellLogin, setup, session, credential, and broker path from boot into the native shell.
  • Browser Capability and Agent Web SessionsBrowser profiles, cap-native document engines, visual browsing, and agent/shell browser sessions as capability-scoped services.
  • Browser Engines, Document Engines, and Agent BrowsersBrowser engine portability, cap-native document-engine options, and agent-browser patterns for capOS browser capabilities.
  • capOS-Hosted Agent SwarmscapOS-hosted OpenClaw-like personal agents, agent swarms, harness controls, memory, retrieval, and research agenda.
  • Chat As Multimedia SubstrateChat as unified text/audio/video multimedia transport across human, agent, and service participants, with listener-cap delivery and a clean WebRTC mapping.
  • Default User AvatarDeterministic default user avatar derived from a stable account identifier, with explicit user override.
  • Interactive Command SurfacesStructured command-session model for native interactive applications over typed invocations.
  • Language Models and Agent RuntimeLanguage-model, embedder, agent-runner, and browser-agent capability interfaces.
  • Realtime Voice Agent ShellRealtime audio agent shell model across browser media, provider sessions, and brokered tools.
  • Remote Session CapSet ClientsRemote host app model for authenticated capOS sessions, broker-issued CapSet views, and typed capability calls over Cap’n Proto RPC.
  • Schema RegistryA SchemaRegistry capability that serves Cap’n Proto reflection metadata – interface IDs, method names and ordinals, parameter/result layouts, and doc comments – at runtime, as the machine-readable twin of the System Manual.
  • ShellNative, agent-oriented, and POSIX shell models over explicit capability grants.
  • SSH Shell GatewaySSH terminal gateway design preserving TerminalSession and broker-issued shell boundaries.
  • Stateful Task and Job GraphsDurable stateful task and job graphs for init orchestration, package builds, operator work, and notebook-style run stories without creating a god object.
  • System Info CapabilitySystemInfo capability for MOTD, hostname, host metadata, help topics, and shell bundle integration.
  • System Manual CapabilityA built-in man-pages analog: the Manual capability serves Unix-style reference pages, schema-derived interface manuals, and a man-shaped reference corpus through the shell, the self-served web UI, and a typed capnp API.
  • Telnet over TLS ShellOptional TLS-protected Telnet TerminalSession gateway with client certificates and credential fallback.

Networking

  • Azure MANAProvenance map for the Azure MANA NIC / GDMA wire logic - spec basis, implemented host-conformance wire-format subset, and capOS authority mapping.
  • Browser Capability and Agent Web SessionsBrowser profiles, cap-native document engines, visual browsing, and agent/shell browser sessions as capability-scoped services.
  • capOS SDK and Dual TransportcapOS front-door SDK crate with a transport abstraction for in-system and remote clients, plus crate-namespace publication.
  • capos-serviceUserspace service framework (Rust crate capos-service) for lifecycle, endpoint loops, readiness, shutdown, metrics, context, and resource hooks.
  • Chat As Multimedia SubstrateChat as unified text/audio/video multimedia transport across human, agent, and service participants, with listener-cap delivery and a clean WebRTC mapping.
  • Cloud DMA Provider Evidence InventoryOfficial AWS/Azure/GCP device-surface facts, an evidence-matrix schema, a live guest-probe checklist, and classification rules for the cloud DMA backend decision.
  • Cloudflare, Cap’n Proto, Workers RPC, and Cap’n WebCloudflare Workers, workerd, Durable Objects, Workers RPC, Cap’n Web, and Cloudflare’s production use of Cap’n Proto/KJ.
  • GCE gVNICProvenance map for the GCE gVNIC (Google Virtual Ethernet) NIC - spec basis from the public gVNIC docs and the GVE Linux driver, the wire-format subset capOS exercises today, and the bounded Nic-cap adaptation proof. capOS has live-GCE inventory, admin-queue/register, raw-frame GQI/QPL TX/RX, and typed Nic-adaptation proofs, but no reusable gVNIC provider service or host conformance suite yet.
  • Google Drive Storage BackendUse a Google-authenticated user’s Drive as a capOS storage backend behind the standard storage caps, via a browser-transport near-term path and a native OAuth2/HTTP/TLS backend later.
  • Network Usability and Post-smoltcpNetwork usability, resolver, diagnostics, and post-smoltcp backlog.
  • Network-Reachable Datapath Scope DecisionScope decision recording that the real-GCE-boot milestone’s reachable-network-stack requirement means raw-frame TX/RX (Option A), not L4 sockets, grounded in what the billable cloudboot harness actually gates on.
  • NetworkingNetwork capability architecture from virtio-net smoke to TCP sockets and terminal handoff.
  • Phase C Userspace NIC Driver RelocationPhase C design for relocating the virtio-net driver into userspace: the cap-surface delta, the inline-Data Nic ABI (matching the networking-proposal draft), the writable selected-write common-config window (an extension of the accepted notify-doorbell discipline; slice 1 landed 2026-06-02 20:30 UTC at c9518b2d), the userspace-vring slice that reuses the landed production DMA isolation (bounce policy + dma_backend probe + IOMMU IOVA-export), the sustained-receive Nic ABI design used by the multi-frame TCP path, the selected serve-from-userspace 7c-ii(b) socket-authority proof, and retirement of the non-qemu legacy kernel socket grant path.
  • PingoraProxy/server framework as a userspace runtime case study.
  • Remote Session CapSet ClientRemote session CapSet client backlog.
  • Remote Session CapSet ClientsRemote host app model for authenticated capOS sessions, broker-issued CapSet views, and typed capability calls over Cap’n Proto RPC.
  • Spritely, OCapN, and CapTPSpritely, OCapN, CapTP, netlayers, locators, Syrup, promise pipelining, handoffs, and capability-network lessons for capOS.
  • SSH Shell GatewaySSH terminal gateway design preserving TerminalSession and broker-issued shell boundaries.
  • Telnet over TLS ShellOptional TLS-protected Telnet TerminalSession gateway with client certificates and credential fallback.
  • virtio-netProvenance map for the in-tree modern virtio-net PCI NIC - spec basis, implemented wire-format subset, and capOS authority binding.

Storage, Persistence, and Naming

  • Cloud DMA Provider Evidence InventoryOfficial AWS/Azure/GCP device-surface facts, an evidence-matrix schema, a live guest-probe checklist, and classification rules for the cloud DMA backend decision.
  • Google Drive Storage BackendUse a Google-authenticated user’s Drive as a capOS storage backend behind the standard storage caps, via a browser-transport near-term path and a native OAuth2/HTTP/TLS backend later.
  • Hardware Audit Log PersistenceDurable, tamper-evident persistence and admission policy for the hardware audit log.
  • Hardware, Boot, and StorageHardware bring-up backlog.
  • Installable SystemOrdered implementation track turning the installable-system proposal into work grounded in the landed BlockDevice/filesystem/Store/writable-persistence/disk-image contracts.
  • Installable SystemDesign for an installed, persistent capOS that boots from disk and keeps mutable system configuration across reboots, composed with the immutable boot manifest.
  • IX-on-capOS HostingIX as a package corpus, content-addressed build/store model, and a capability-native build-service surface for capOS.
  • Standard App CapabilitiesPer-app AppData storage, a user-mediated powerbox/file-picker grant, and attenuated capability sharing as standard app-facing capabilities.
  • Stateful Task and Job GraphsDurable stateful task and job graphs for init orchestration, package builds, operator work, and notebook-style run stories without creating a god object.
  • Storage and NamingCapability-native storage, namespaces, boot packages, volumes, and persistence model.
  • Volume EncryptionEncryption-at-rest model for system and user volumes with recovery and KMS options.

Identity, Policy, and User Accounts

  • ConfigurationHow operators extend the default capOS boot manifest with a gitignored system.local.cue overlay and convert CUE-authored data to specified Cap’n Proto schemas.
  • Default User AvatarDeterministic default user avatar derived from a stable account identifier, with explicit user override.
  • Delegated Subject ContextFuture delegated-subject and act-on-behalf-of capability model.
  • Formal MAC/MICFormal mandatory access and integrity model for future policy and proof work.
  • Google Drive Storage BackendUse a Google-authenticated user’s Drive as a capOS storage backend behind the standard storage caps, via a browser-transport near-term path and a native OAuth2/HTTP/TLS backend later.
  • Local Users, Storage, and PolicyIdentity/local-user backlog.
  • OIDC and OAuth2Federated login, OAuth2 clients, token capabilities, JWKS, DPoP, and broker integration.
  • Rejected: Endpoint Badges as Service IdentityPost-mortem of the rejected seL4-style endpoint badge service identity model.
  • Remote Session CapSet ClientRemote session CapSet client backlog.
  • Remote Session CapSet ClientsRemote host app model for authenticated capOS sessions, broker-issued CapSet views, and typed capability calls over Cap’n Proto RPC.
  • Service Object Identity MigrationSuperseded large-chunk migration plan for service object identity, retained as historical context after the active direction changed to session-bound invocation context.
  • Session ContextCurrent session-bound invocation context, endpoint caller-session metadata, disclosure, transfer-scope, and liveness rules.
  • Session-Bound Invocation ContextImplementation plan for one-session-per-process invocation context and session-keyed shared services.
  • Session-Bound Invocation ContextSession-bound invocation context and privacy-aware disclosure model replacing service-object identity migration.
  • Standard App CapabilitiesPer-app AppData storage, a user-mediated powerbox/file-picker grant, and attenuated capability sharing as standard app-facing capabilities.
  • System Configuration and Operator ExtensibilityLayered CUE configuration model for operator boot-manifest overlays, host-user injection, and per-user toolchain caches.
  • User Identity and PolicyUser, session, profile, RBAC/ABAC/MAC, and policy-layer model for capability grants.

Cryptography, Certificates, and Trust

  • Certificates / TLSBounded implementation slice chain for the certificates/TLS track, from vendored verifier crates to a capOS-terminated Web UI endpoint.
  • Certificates and TLSCapability-native X.509, trust store, ACME, pinning, and TLS configuration model.
  • Cryptography and Key ManagementCapability model for keys, signing, encryption, vaults, entropy, and cryptographic policy.
  • Google Drive Storage BackendUse a Google-authenticated user’s Drive as a capOS storage backend behind the standard storage caps, via a browser-transport near-term path and a native OAuth2/HTTP/TLS backend later.
  • Hardware Audit Log PersistenceDurable, tamper-evident persistence and admission policy for the hardware audit log.
  • OIDC and OAuth2Federated login, OAuth2 clients, token capabilities, JWKS, DPoP, and broker integration.
  • Telnet over TLS ShellOptional TLS-protected Telnet TerminalSession gateway with client certificates and credential fallback.
  • Time and Clock AuthorityCapability-native wall-clock authority with provenance labeling, clock discipline, and trusted timestamps for audit and TLS.
  • Volume EncryptionEncryption-at-rest model for system and user volumes with recovery and KMS options.

Security and Verification

  • ABI Evolution PolicyCompatibility policy for capOS schema and ring ABIs.
  • AWS Nitro EBS (NVMe storage)Provenance map for the AWS Nitro EBS NVMe storage shape - spec basis, the standard-NVMe wire subset it shares with docs/devices/nvme.md, and the capOS cloud-shape classification plus DMA-backend policy it binds onto.
  • Azure managed disk (NVMe storage)Provenance map for the Azure managed-disk NVMe storage shape - spec basis, the standard-NVMe wire subset it shares with docs/devices/nvme.md, why the older-family virtio-scsi path is out of scope, and the capOS cloud-shape classification plus DMA-backend policy it binds onto.
  • Cloud DMA Provider Evidence InventoryOfficial AWS/Azure/GCP device-surface facts, an evidence-matrix schema, a live guest-probe checklist, and classification rules for the cloud DMA backend decision.
  • Cloud Driver Foundation Gap AnalysisGap analysis between the existing userspace virtio driver foundation and the blocked cloud NIC/storage driver tasks: what is already proven, the narrow per-task remaining work, and the superseded live-NIC runnable-now claim.
  • Debug and Trace AuthorityCapability-scoped debug session attach, read-only cap-table inspection, ring-trace replay, and sampler authority without ambient process inspection.
  • Device Manager RefactorRefactor direction for separating the kernel device authority ledger from QEMU proof scaffolding.
  • DMA Assurance ModelAssurance model for DMA authority, backend selection, and proof obligations.
  • DMA IsolationDMA isolation model for device memory, IOMMU policy, and capability-scoped hardware access.
  • DMA User-Space Driver IsolationDMA, user-space driver, vIOMMU, and no-IOMMU bounce-buffer design consequences for capOS device authority.
  • Error HandlingCurrent error model for capability ring CQE status, CapException payloads, endpoint RETURN exceptions, and ordinary schema result unions.
  • Formal MAC/MICFormal mandatory access and integrity model for future policy and proof work.
  • Full-Scope Review 2026-06-09Findings ledger and decomposition source for the 2026-06-09 full-scope review of the tree at 50e8eaba (review base bb776326e, 2026-05-23).
  • GCP Persistent Disk (storage)Provenance map for the GCP Persistent Disk storage shape - virtio-scsi vs NVMe families, the standard-NVMe wire subset it shares with docs/devices/nvme.md, the capOS cloud-shape classification, the DMA-backend policy on no-IOMMU GCE shapes, the local production brokered NVMe provider chain, and the bounded live-GCE NVMe Persistent Disk read proof.
  • IOMMU Remapping GroundingPrimary-source grounding for Intel VT-d (landed under cfg(qemu)), AMD-Vi, and QEMU IOMMU remapping work.
  • Memory Authority ModelMemory authority model backlog.
  • Memory Authority ModelMemory authority, residency classes, mapping consistency, OOM boundaries, and proof obligations.
  • NVMeProvenance map for the NVMe controller wire subset capOS touches - conditional Model B validator scan targets, the read-only userspace bind, the reset-only CC selected-write claim, the no-IOMMU manager-op controller enable through the brokeredNvmeControllerEnable @6 verb, the no-IOMMU manager-op admin IDENTIFY through the brokeredNvmeAdminIdentify @7 verb, the brokered admin SQ/CQ doorbell + IDENTIFY command, the split admin SUBMIT @8 / COMPLETE @9 verbs whose completion handoff runs through a cap-waiter Interrupt.wait/acknowledge MSI-X route, the brokered I/O queue pair + bounded READ including one live-GCE Persistent Disk proof, and the dedicated BlockDevice data-completion Interrupt route - with spec basis and capOS authority mapping.
  • NVMe Model B Doorbell DMA ValidatorConditional DMA-address ownership model for the userspace NVMe storage provider: provider-written queue-base and PRP/SGL addresses require a non-host-physical device-visible namespace; no-IOMMU GCP planning must use brokered bounce address publication instead.
  • Panic Surface InventoryPanic/unwrap/expect inventory.
  • Public Release and Maintainer BoundariesPublic release posture, maintainer boundaries, issue intake, and repository hygiene gates.
  • Remote Session UI SecurityWeb-security hardening posture for the trusted local remote-session-ui bridge, the capOS-served Web UI, public-origin carry-over policy, and the Tauri desktop wrapper.
  • Repository CompositionRepository scope, sibling project split criteria, and cross-repository organization plan.
  • Security and VerificationSecurity/verification backlog.
  • Security and VerificationSecurity review vocabulary, trust-boundary checklist, and verification tracks for capOS.
  • Security Verification Track RegistryManual reference for Security Verification Track labels.
  • Session Archive & Gantt EffortA pipeline to collect, normalize, and archive per-task effort data from the run-telemetry log and agent session transcripts, enabling development timeline visualization and task-duration prediction.
  • Trust BoundariesThe reviewer’s authority-boundary inventory.
  • Trusted Build InputsTrusted toolchain inventory.
  • Verification WorkflowThe verification gates used by capOS.

Services, Operations, and Monitoring

  • BenchmarksCurrent benchmark policy and results.
  • Capability-Infrastructure ClusterDecomposition of the near-term capability-infrastructure cluster: matured proposals and Stage 6 remainder that share the schema serial surface.
  • capos-serviceUserspace service framework (Rust crate capos-service) for lifecycle, endpoint loops, readiness, shutdown, metrics, context, and resource hooks.
  • Cloud DeploymentCloud VM deployment plan covering hardware abstraction, storage, networking, and aarch64.
  • Cloud MetadataCloud metadata and config-drive bootstrap through scoped configuration capabilities.
  • ConfigurationHow operators extend the default capOS boot manifest with a gitignored system.local.cue overlay and convert CUE-authored data to specified Cap’n Proto schemas.
  • Crash Recovery and SupervisionUnplanned-failure detection, stale-cap propagation, structured crash records, watchdog liveness, and bounded restart policy for capOS services.
  • Debug and Trace AuthorityCapability-scoped debug session attach, read-only cap-table inspection, ring-trace replay, and sampler authority without ambient process inspection.
  • Hardware Audit Log PersistenceDurable, tamper-evident persistence and admission policy for the hardware audit log.
  • HPC Parallel Processing PatternsGeneric single-node and multi-node parallel processing patterns for HPC-style benchmark coverage.
  • Live UpgradeService replacement, capability retargeting, quiesce/resume, and in-flight call handling.
  • Rejected: Endpoint Badges as Service IdentityPost-mortem of the rejected seL4-style endpoint badge service identity model.
  • Scientific Standard Package and Agent Lab CapabilitiesScientific standard package and agent-lab capability services for CAS, solvers, proof assistants, notebooks, and reproducible research environments.
  • Service ArchitectureCapability-based service composition, authority-at-spawn, exports, and service graph policy.
  • Session ContextCurrent session-bound invocation context, endpoint caller-session metadata, disclosure, transfer-scope, and liveness rules.
  • Session-Bound Invocation ContextSession-bound invocation context and privacy-aware disclosure model replacing service-object identity migration.
  • Stateful Task and Job GraphsDurable stateful task and job graphs for init orchestration, package builds, operator work, and notebook-style run stories without creating a god object.
  • Superseded: Service Object CapabilitiesSuperseded service-minted object capability model that was replaced by session-bound invocation context.
  • System Configuration and Operator ExtensibilityLayered CUE configuration model for operator boot-manifest overlays, host-user injection, and per-user toolchain caches.
  • System MonitoringCapability-scoped logs, metrics, health checks, traces, crash records, and status views.
  • System Performance BenchmarksCorrectness-gated benchmark model for primitives, workloads, and user stories.
  • Time and Clock AuthorityCapability-native wall-clock authority with provenance labeling, clock discipline, and trusted timestamps for audit and TLS.

AI, Agents, GPU, and Robotics

Demos, Onboarding, and Contributor Surfaces

Build, Tooling, and Documentation Site

  • ABI Evolution PolicyCompatibility policy for capOS schema and ring ABIs.
  • Build, Boot, and TestBuild, ISO, QEMU, host-test commands.
  • capOS Agentic Development ExperimentLongitudinal study design for using capOS development sessions, subagents, reviews, and recap tooling as an agentic software-engineering experiment.
  • capOS Repository Harness EngineeringRepository-local harness engineering for making capOS legible, checkable, and safer for long-running coding agents.
  • Current Design AuthorityCurrent-design authority map and proposal lifecycle rule for keeping implemented behavior out of archival proposal records.
  • Documentation WorkflowHow the mdBook site and generated PDF manual are positioned and built.
  • mdBook Documentation SiteDocumentation-site structure, metadata, status vocabulary, and curation workflow.
  • Repository CompositionRepository scope, sibling project split criteria, and cross-repository organization plan.
  • Repository MapSource-tree subsystem index.
  • Schema RegistryA SchemaRegistry capability that serves Cap’n Proto reflection metadata – interface IDs, method names and ordinals, parameter/result layouts, and doc comments – at runtime, as the machine-readable twin of the System Manual.
  • System Manual CapabilityA built-in man-pages analog: the Manual capability serves Unix-style reference pages, schema-derived interface manuals, and a man-shaped reference corpus through the shell, the self-served web UI, and a typed capnp API.
  • Trusted Build InputsTrusted toolchain inventory.

Research and Papers

  • Crash Recovery and SupervisionPrior-art survey of crash recovery and supervision for the Crash Recovery proposal.
  • Debug, Trace, and Profiling AuthorityPrior-art survey of debug/trace/profile authority for the Debug and Trace proposal.
  • PapersLong-form research write-ups.
  • ResearchIndex of research deep-dive reports informing capOS design.
  • seL4 HAMREvaluation of seL4 HAMR (AADL/Slang/CAmkES) versus the capOS Cap’n Proto schema-as-contract model.
  • Time and Clock AuthorityPrior-art survey of OS time/clock authority for the Time and Clock proposal.

Prior Art and Comparative OS Research

  • Capability-Based and Microkernel Operating Systems SurveyDesign consequences pulled from the survey.
  • Cloudflare, Cap’n Proto, Workers RPC, and Cap’n WebCloudflare Workers, workerd, Durable Objects, Workers RPC, Cap’n Web, and Cloudflare’s production use of Cap’n Proto/KJ.
  • EROS, CapROS, CoyotosPersistent capability-system lineage.
  • Future Scheduler ArchitectureSurvey of modern scheduler algorithms and architectures for capOS scheduler evolution.
  • Game Mechanics Prior ArtGrounded mechanics research for Aurelian Frontier seasonal play, markets, construction, and tactical combat.
  • GenodeGenode OS Framework: capability-based component model, session routing, VFS plugin architecture, POSIX compatibility, and Sculpt OS – with lessons for capOS.
  • HPC Parallel PatternsHPC benchmark and programming-model grounding for generic parallel processing patterns.
  • Linux Sandboxes and Virtualization for WorkloadsLinux sandbox, container, gVisor, KVM, microVM, and CPU-isolation prior art for generic Linux workload execution.
  • Out-of-Kernel SchedulingPrior art survey on kernel versus userspace CPU scheduling policy split, with capOS design implications.
  • Plan 9 and InfernoPlan 9 and Inferno: per-process namespaces, 9P protocol, file-server-as-service pattern, Dis VM, and Limbo concurrency — applied to capOS capability composition and IPC design.
  • Scientific Agent-Lab Software StackScientific computing, solver, proof-assistant, notebook, and reproducible-package prior art for a capOS-hosted LLM research lab.
  • seL4Microkernel and capability reference.
  • Spritely, OCapN, and CapTPSpritely, OCapN, CapTP, netlayers, locators, Syrup, promise pipelining, handoffs, and capability-network lessons for capOS.
  • ZirconFuchsia Zircon kernel: handle-based capability model, channels, VMARs/VMOs, async ports, and FIDL – with lessons for capOS capability dispatch, IPC, and memory design.

Stage Backlogs and Long-Form Planning

Capabilities And Security

  • POSIX fork/execve fd InheritanceTarget POSIX fork/execve full-fd-table inheritance for the recording shim, reconciled with the capability model, so unmodified POSIX software inherits stdio/cwd without bespoke per-app dup2 patches.

Hardware

  • Network-Reachable Datapath Scope DecisionScope decision recording that the real-GCE-boot milestone’s reachable-network-stack requirement means raw-frame TX/RX (Option A), not L4 sockets, grounded in what the billable cloudboot harness actually gates on.
  • Phase C Userspace NIC Driver RelocationPhase C design for relocating the virtio-net driver into userspace: the cap-surface delta, the inline-Data Nic ABI (matching the networking-proposal draft), the writable selected-write common-config window (an extension of the accepted notify-doorbell discipline; slice 1 landed 2026-06-02 20:30 UTC at c9518b2d), the userspace-vring slice that reuses the landed production DMA isolation (bounce policy + dma_backend probe + IOMMU IOVA-export), the sustained-receive Nic ABI design used by the multi-frame TCP path, the selected serve-from-userspace 7c-ii(b) socket-authority proof, and retirement of the non-qemu legacy kernel socket grant path.
  • Real-Filesystem DecisionReal-filesystem direction for capOS: a role-split between capnp-native managed state and read-only FAT32 for host-populated/interop images, with ext4-read deferred and FAT write rejected, grounded in the existing Directory/File/Store cap surface and the storage layouts already in tree.

Hardware And Drivers

  • ATAPI CD-ROM + ISO 9660Provenance map for the planned CD-ROM boot/install ATAPI PIO reader and read-only ISO 9660 driver - spec basis, implemented wire-format subset, and boot-only kernel-owned capOS mapping.
  • AWS Nitro EBS (NVMe storage)Provenance map for the AWS Nitro EBS NVMe storage shape - spec basis, the standard-NVMe wire subset it shares with docs/devices/nvme.md, and the capOS cloud-shape classification plus DMA-backend policy it binds onto.
  • Azure MANAProvenance map for the Azure MANA NIC / GDMA wire logic - spec basis, implemented host-conformance wire-format subset, and capOS authority mapping.
  • Azure managed disk (NVMe storage)Provenance map for the Azure managed-disk NVMe storage shape - spec basis, the standard-NVMe wire subset it shares with docs/devices/nvme.md, why the older-family virtio-scsi path is out of scope, and the capOS cloud-shape classification plus DMA-backend policy it binds onto.
  • Device Driver SpecificationsPer-device driver specs - cited authoritative spec, implemented wire-format subset, and capOS authority mapping.
  • Device Spec TemplateBlank three-part device-spec template - copy to docs/devices/.md when starting a driver.
  • DMA User-Space Driver IsolationDMA, user-space driver, vIOMMU, and no-IOMMU bounce-buffer design consequences for capOS device authority.
  • FAT32 (read-only backer)Provenance map for the read-only FAT32 Directory/File backer over virtio-blk and NVMe - spec basis, the vendored fatfs read subset used, timestamp provenance limits, and the capOS cap mapping.
  • GCE gVNICProvenance map for the GCE gVNIC (Google Virtual Ethernet) NIC - spec basis from the public gVNIC docs and the GVE Linux driver, the wire-format subset capOS exercises today, and the bounded Nic-cap adaptation proof. capOS has live-GCE inventory, admin-queue/register, raw-frame GQI/QPL TX/RX, and typed Nic-adaptation proofs, but no reusable gVNIC provider service or host conformance suite yet.
  • GCP Persistent Disk (storage)Provenance map for the GCP Persistent Disk storage shape - virtio-scsi vs NVMe families, the standard-NVMe wire subset it shares with docs/devices/nvme.md, the capOS cloud-shape classification, the DMA-backend policy on no-IOMMU GCE shapes, the local production brokered NVMe provider chain, and the bounded live-GCE NVMe Persistent Disk read proof.
  • NVMeProvenance map for the NVMe controller wire subset capOS touches - conditional Model B validator scan targets, the read-only userspace bind, the reset-only CC selected-write claim, the no-IOMMU manager-op controller enable through the brokeredNvmeControllerEnable @6 verb, the no-IOMMU manager-op admin IDENTIFY through the brokeredNvmeAdminIdentify @7 verb, the brokered admin SQ/CQ doorbell + IDENTIFY command, the split admin SUBMIT @8 / COMPLETE @9 verbs whose completion handoff runs through a cap-waiter Interrupt.wait/acknowledge MSI-X route, the brokered I/O queue pair + bounded READ including one live-GCE Persistent Disk proof, and the dedicated BlockDevice data-completion Interrupt route - with spec basis and capOS authority mapping.
  • virtio-blkProvenance map for the QEMU-fixture virtio-blk BlockDevice driver - spec basis, implemented wire-format subset, capOS authority binding, and why it is a qemu-gated fixture rather than the production storage route.
  • virtio-netProvenance map for the in-tree modern virtio-net PCI NIC - spec basis, implemented wire-format subset, and capOS authority binding.
  • virtio-rngProvenance map for the in-tree virtio-rng entropy device - spec basis, implemented wire-format subset, and its role as a QEMU-only DDF metadata and IOMMU-remapping hardware-DMA proof fixture (no userspace-facing capability, not a production driver).

Programming Languages And Runtimes

  • POSIX fork/execve fd InheritanceTarget POSIX fork/execve full-fd-table inheritance for the recording shim, reconciled with the capability model, so unmodified POSIX software inherits stdio/cwd without bespoke per-app dup2 patches.

Remote Session

  • Remote Session CapSet ClientsRemote host app model for authenticated capOS sessions, broker-issued CapSet views, and typed capability calls over Cap’n Proto RPC.
  • Remote Session UI SecurityWeb-security hardening posture for the trusted local remote-session-ui bridge, the capOS-served Web UI, public-origin carry-over policy, and the Tauri desktop wrapper.

Security

  • Phase C Userspace NIC Driver RelocationPhase C design for relocating the virtio-net driver into userspace: the cap-surface delta, the inline-Data Nic ABI (matching the networking-proposal draft), the writable selected-write common-config window (an extension of the accepted notify-doorbell discipline; slice 1 landed 2026-06-02 20:30 UTC at c9518b2d), the userspace-vring slice that reuses the landed production DMA isolation (bounce policy + dma_backend probe + IOMMU IOVA-export), the sustained-receive Nic ABI design used by the multi-frame TCP path, the selected serve-from-userspace 7c-ii(b) socket-authority proof, and retirement of the non-qemu legacy kernel socket grant path.

Storage

  • FAT32 (read-only backer)Provenance map for the read-only FAT32 Directory/File backer over virtio-blk and NVMe - spec basis, the vendored fatfs read subset used, timestamp provenance limits, and the capOS cap mapping.
  • Real-Filesystem DecisionReal-filesystem direction for capOS: a role-split between capnp-native managed state and read-only FAT32 for host-populated/interop images, with ext4-read deferred and FAT write rejected, grounded in the existing Directory/File/Store cap surface and the storage layouts already in tree.
  • virtio-blkProvenance map for the QEMU-fixture virtio-blk BlockDevice driver - spec basis, implemented wire-format subset, capOS authority binding, and why it is a qemu-gated fixture rather than the production storage route.

Roadmap

Long-term direction for capOS. Related material lives elsewhere: detailed task decomposition in docs/backlog/, selected-milestone state in docs/tasks/state.toml, current execution order in root task records under docs/tasks/, and shipped-milestone reports in docs/changelog.md.

Current Direction

Current selected milestone: GCE Self-Hosted Web UI.

The next visible goal is a self-hosted capOS Web UI reachable through the Phase C userspace network stack, then proved on private GCE reachability before any public endpoint. The userspace smoltcp-backed TcpListenAuthority local path is proved by cloud-prod-userspace-network-stack-smoltcp-local-proof. The local DHCP/IPv4 configuration proof is done by cloud-prod-network-stack-dhcp-ipv4-config-local-proof: the userspace stack acquires a QEMU SLIRP DHCPv4 lease, installs the default route, resolves gateway and same-subnet ARP neighbors, and serves NetworkManager.getConfig before public or live GCE exposure. The cloudboot-local Web UI authority inventory is done by remote-session-webui-cloudboot-authority-inventory: it records the required and forbidden remote-session-web-ui grants, trusted listener/source metadata, browser-visible forbidden markers, and local L4 proof markers for the completed cloudboot proof. Server-side session hardening is done by remote-session-web-ui-session-hardening (Review C high closed: unpredictable rotated server-side session ids, idle/absolute expiry enforced before dispatch, Host/Origin/double-submit-CSRF gates, and a Secure-when-HTTPS cookie posture). Web UI connection bounds are done by remote-session-web-ui-connection-bounds (per-connection request-read/response-send deadlines in the Web UI client over the bounded network-stack listener, with a drip-feed abandon proof). The legacy kernel socket-path retirement is done by cloud-prod-legacy-kernel-network-socket-path-retirement: non-qemu production manifests reject kernel network_manager / tcp_listen_authority grants, leaving those sources as qemu-only fixtures. The local cloud-prod-remote-session-web-ui-l4-local-proof is the done service-level L4 proof on top of the userspace L4 and DHCP/IPv4 substrate. The legacy-virtio serving gap is closed locally by cloud-gce-legacy-virtio-webui-serving-local-proof (2026-06-11): a kernel-brokered legacy virtio 0.9 runtime backs the typed Nic cap and a host HTTP peer fetches the byte-verified UI bundle under disable-modern=on. A public-ingress hardening set is done on the L4 gate (public-origin policy, IAP-aware SameSite cookie policy, JSON content-type guard, security response headers and strict CSP, GFE-range-pinned forwarded-scheme trust, the public /healthz contract, and in-guest login peer-gate/backoff hardening), and a no-spend provider-harness fixture set is done (private --preflight-only, private/public proof-evidence validators, public ingress plan gate, journal-driven teardown engine, provider-command allowlist gate) — all local QEMU/cloudboot or recording-stub fixture evidence with no real provider invocation or mutation; the current ladder summary lives in Current Status. cloud-gce-private-self-hosted-webui-proof remains on hold: the cloudtest credential lacks the firewall IAM a private same-VPC probe needs against GCE default-deny ingress, and the live run needs per-run billable authorization. Public GCE ingress and TLS remain under the explicit on-hold cloud-gce-public-self-hosted-webui-ingress-tls task and require separate authorization; the selected milestone does not grant public exposure, broad firewall changes, TLS key custody, or production release authority. The capOS-terminated TLS successor remains a separate later evidence class behind the provider-terminated first public proof.

The previous selected milestone, Installable System, is complete through commit 12b8334a (commit timestamp 2026-06-07 18:19 UTC; task closeout 2026-06-07 18:20 UTC) for the bounded local/QEMU contract: persistent data-region mount, config-overlay compose/merge fallback, generation/rollback machinery, integrated installable disk packaging, target-disk install (make run-installable-install), first-boot provision (make run-installable-provision), update/rollback (make run-installable-update), and structural proposal/body wording reconcile are landed. The closeout preserves the RAM-only Namespace caveat and does not claim secure boot/signing, production release authority, public ingress, AWS/Azure live support, direct-remapping production hardware, userspace smoltcp/L4 readiness, or full durable account policy. Detailed decomposition lives in docs/backlog/installable-system.md.

The preceding selected milestone, Device Driver Foundation, is complete by the 2026-06-07 08:23 UTC production-authority closeout recorded in ddf-production-authority-closeout. That closeout ties together the landed provider-driver, interrupt, audit, and DMA-policy prerequisites and preserves the runtime fail-closed DMA backend baseline: remapping only when capOS can validate it, otherwise brokered bounce buffers or unsupported. The related GCP-first provider NIC/storage rollup is also closed by cloud-usable-instance-provider-nic-storage (2026-06-07 05:26 UTC), but only for the recorded operator serial path, selected raw-frame NIC/storage evidence, and gVNIC portability evidence. Public L4 ingress, AWS/Azure live support, direct-remapping production hardware, device-autonomous MSI-X delivery, userspace smoltcp/L4 readiness, and high-throughput or multiqueue NIC readiness remain explicit future follow-ups, not part of the closed DDF selected milestone.

The previous selected milestone, In-Process Threading Scalability, is complete at commit 136b72de (2026-05-01 14:58 UTC) after repairing the benchmark validity issue found on 2026-05-01: the old 1 MiB/spinning-parent workload was not a valid four-core scaling reference because the matching Linux pthread baseline also stayed flat at four workers. The repaired shape now uses a blocking parent join, 262,144 blocks (16 MiB), and work_rounds=64. The controlled capOS/Linux pair on capos-bench 2026-05-02 21:38 UTC against main commit 374f8556 (5 runs each, both pinned to physical-core logical CPUs 0,1,2,3) recorded capOS 1-to-2 work/total speedups 1.883x / 1.787x and matching Linux pthread baseline 1.988x/1.987x. Its 1-to-4 row became the diagnostic that justified Phase D’s fair-share enqueue policy: capOS sat at 1.566x/1.538x while Linux scaled to 3.963x/3.858x on the same physical-core pin set. Phase D WFQ has now closed that diagnostic gap as a scheduler-evolution milestone, recording capOS 3.088x/2.700x and Linux 3.974x/3.850x on 2026-05-10. These rows are summarized in docs/benchmarks.md and docs/changelog.md. Historical pre-collapse 1-to-2 (1.828x/1.687x) and the post-collapse 3-run diagnostic remain in docs/benchmarks.md for reference. Ordinary -smp 2 regression coverage also passed.

The previous selected milestone, Multi-Process SMP Concurrency, is complete at commit 3fb89923 (2026-04-30 09:45 UTC): make run-smp-process-scale has repeated KVM-backed evidence for independent CPU-bound worker processes with 1.608x 1-to-2 speedup, and the ordinary run-smoke/run-spawn coverage passed under -smp 2.

The previous selected milestone, Session-Bound Invocation Context, is complete: normal workload processes have one immutable live session context, endpoint calls reveal only privacy-preserving caller-session metadata by default, explicit subject disclosure is gated by request and scope, and chat/adventure/terminal/stdio paths no longer derive ordinary caller identity from caller-selected service-visible metadata. Gate 4 verification is recorded at commit faeff80 (2026-04-29 21:39 UTC), and paper/status closeout is merged at commit 503abc9. Follow-up session lifecycle work remains outside that completed milestone: production interactive shells need mutable session liveness cells, explicit logout/close propagation, and renewal/recovery paths so fixed short expiry is not the only way to bound stale authority.

Username-aware local password login is prioritized ad-hoc implementation work, not the selected milestone, unless explicitly selected later.

Current priority ladder, reflecting user direction (2026-05-05 17:56 UTC redirect supersedes the earlier SMP/threading-first ladder; the previous ordering is retained as background only at the end of this section):

  1. Userspace driver transition prerequisites – the S.11.2 hostile-smoke gate items in docs/dma-isolation-design.md and the matching open items of docs/backlog/hardware-boot-storage.md Task 3 are now closed. S.11.2.7 stale IRQ after revoke/reset closed 2026-05-05 18:17 UTC via real-INT $vector cross-reset injection in make run-net. S.11.2.8 stale DMA completion after revoke/reset closed 2026-05-05 19:37 UTC via the device-manager prove_qemu_stale_dma_completion_handoff proof in make run-net: real virtio-net DMA page free + reallocate cycle bumps the live ledger’s page generation at three boundaries (after revoke, after detach, after reset/reuse), then a synthesized stale DeviceDmaAllocation is fed to the production device_dma::record_virtio_net_completion_for_allocation path and rejected as stale-dma-handle with side-effect blocking. S.11.2.9 hostile-smoke gate-wiring closed 2026-05-05 20:49 UTC by aggregating every hostile-smoke acceptance matrix proof line into the make run-net -> tools/qemu-net-smoke.sh gate, including the newly wired device-manager: devicemmio driver crash hook proof and device-manager: interrupt driver crash hook proof assertions. The manifest-granted DMAPool path currently exposes eight fixed manager-owned bounce-buffer DMABuffer result caps with typed allocate/free/map/unmap/submit/complete surfaces; DMABuffer.unmap removes only the caller’s borrowed userspace VMA and preserves pool/page and descriptor accounting, and accepted submitDescriptor now writes a bounded provider-owned queue entry plus submit marker after authority validation and the submit scrub. The manifest-granted DeviceMmio path now exposes a read-only borrowed userspace VMA over boot-preseeded BAR pages, with explicit DeviceMmio.unmap, duplicate-map/no-op-unmap denials, revoke-before-detach cleanup, brokered read-only read32, and one bounded write32 effect for the provider-scoped PCI MSI-X metadata-derived virtio-rng vector-control mask dword, while arbitrary register writes, doorbells, host physical/IOVA exposure, and production provider-driver consumers remain blocked. The remaining gating prerequisites for moving NIC/block drivers out of the kernel are production userspace DMAPool/DeviceMmio/Interrupt handles, real device-manager page quiesce/scrub/release hooks, real userspace Interrupt waiter objects, and durable/signed production audit consumption beyond the first volatile HardwareAuditLog.snapshot cap. IOMMU domain programming has landed for the bounded QEMU Intel remapping path (umbrella closed 2026-05-23 23:35 UTC); production-hardware IOMMU programming, AMD-Vi, and trusted sharing groups remain future work. The device-manager refactor proposal is already on main at commit 77358400; treat its proof/handles/domain/transaction-helper splits as high-priority, behavior-preserving risk reduction only when they unblock or lower risk for those DDF authority gates. It remains subordinate to behavior-moving DDF slices and the scheduler SMP/nohz prerequisite chain.
  2. Scheduler evolution in docs/backlog/scheduler-evolution.md: Phase D best-effort fair scheduling closed at commit 77caafc0 (2026-05-10 19:39 UTC) and docs commit 1a08ec23 (2026-05-10 21:47 UTC). The WFQ slice uses per-thread vruntime accounting, SchedulingPolicyCap weight/latency-class authority, per-CPU WFQ run queues, and bounded steal/migration invariants. The controlled Task 6 benchmark pair materially closed the 1-to-4 thread-scale diagnostic gap: capOS recorded work/total speedups 3.088x / 2.700x versus the prior 1.566x / 1.538x baseline, while Linux on the same host/pin set recorded 3.974x / 3.850x. Phase E SchedulingContext capability follow-ups are now closed: endpoint donation/return and the scheduler-observable UserSession.logout() hook are merged; timeout/depletion notifications use fixed per-context cells plus drain observer results; ordinary non-donated session-logout stale-context coverage is proven; donated receiver logout keeps the conservative counted/skipped policy until endpoint return restores only reduced donor budget; and clean local owner-shell exit calls the same UserSession.logout() path before process exit. Phase F auto-nohz / SQPOLL / tickless idle follows Phase E; the one-SQ-consumer ring ownership prerequisite, CpuIsolationLease scaffold, nohz activation/deactivation telemetry child, and explicit housekeeping/deferred-work placement, bounded SQPOLL ring mode, the clockevent/deadline substrate, and bounded producer-wake SQPOLL progress are complete. The telemetry proof records accepted active candidates, rejected activation decisions, stale/revoked rollback labels, ready and selected housekeeping CPUs, selected deferred-work placement or fail-closed reasons, target runnable entity counts, monotonic clocksource/accounting readiness, and explicit disabled tick/SQPOLL/full-nohz guardrails. The first two automatic nohz activation increments have since landed: the CpuIsolationLease preflight performs real per-CPU periodic-tick suppression for the narrow single-runnable-entity window with fail-closed rollback (docs/tasks/done/2026/scheduler-phase-f-auto-nohz-activation.md), and a ring-coupled kernelSqpoll lease whose bound ring is in SQPOLL running/sleeping mode with a live owner is admitted for tick suppression with the SQPOLL ring-state re-check as the decisive rollback gate (docs/tasks/done/2026/scheduler-phase-f-auto-nohz-sqpoll.md). Timeout-based auto-revoke, generic full-nohz for explicitly budgeted compute leases, and generic SQPOLL nohz for explicitly leased caller-thread rings have since landed; production policy-service issuance and broader userspace-poller/device-queue admission remain future work. The future full-SMP hardware scalability milestone is now recorded in the existing SMP/scheduler/benchmark/HPC proposal set and docs/backlog/scheduler-evolution.md Phase F.5. It targets direct high-core hardware/perf-runner rows at 1/2/4/8/16/32 workers, with QEMU kept for boot/regression and virtualization context rather than as the primary performance source. Phase G realtime islands follows Phase F. EEVDF is retained as a follow-on policy evaluation, not a Phase D blocker; generic full-nohz is landed for explicitly budgeted compute leases, with policy-service issuance still future.
  3. Language-support tracks remain active high-priority parallel work alongside the kernel/scheduler focus. POSIX adapter v0 P1.2 (UDP cap + dns.c) and P1.3 (Pipe cap + fork-for-exec + recording-shim posix_spawn) landed; the remaining v0 phase is P1.4 (dash port
    • libcapos-posix file/dir/stdio/env/printf surface + the run-posix-shell-smoke harness), which is in flight against the Storage Phase 3 RAM-backed File/Directory/Store/Namespace caps. P1.4 Slice 3 (FdBacking File/Directory/Terminal variants + make run-posix-file-backing-smoke) landed at ae58f936, and Slice 4 (absolute-path resolver + functional open()/opendir() over the bootstrap-granted root Directory cap with per-fd file position + make run-posix-open-smoke) landed at 94b29177. The file/directory fd closeout landed at commit f97d9833 (2026-05-23 06:23 UTC): make run-posix-file proves open(), write(), lseek(), read(), opendir(), readdir(), and closedir() through a live POSIX C process. Together these bring POSIX file I/O to functional end-to-end parity as the first non-shell POSIX subsystem. Identity stubs landed at commit 1a8a9896 (2026-05-23 06:51 UTC): make run-posix-identity proves parent and fork/exec child getpid lines with hardcoded uid/gid 0. The printf/string subset now has make run-posix-printf, which proves formatted output plus string/mem, numeric conversion, and ctype behavior from a live capOS C process. The signal/time surface landed at commit 90e64011 (2026-05-23 08:11 UTC): make run-posix-signal-time proves Timer-backed time, nanosleep, and sleep plus fail-closed signal-delivery stubs from a live capOS C process. Remaining P1.4 work is dash vendoring + smoke (Slices 11-13). Long-form decomposition lives in docs/backlog/posix-adapter-dash-port.md. WASI host adapter v0 W.1/W.2, Lua iteration follow-ons, libcapos / libcapos-posix successor work, and Go runtime stay in the parallel pool when selectable.
  4. Storage capability interfaces, starting with RAM-backed Store/Namespace; proceed to local disk and a small read-only filesystem when the block path and the userspace-driver gate are ready. Phase 2 (schema-only BlockDevice/File/Directory interfaces), Phase 3 slice 1 (minimal RAM-backed File CapObject with the KernelCapSource::file grant source and the make run-file-server-smoke proof), Phase 3 slice 2 (minimal RAM-backed Directory CapObject with the KernelCapSource::directory grant source, result-cap transfer of File/Directory handles, and the make run-directory-server-smoke proof), and Phase 3 slice 3 (the Store/Namespace schema interfaces plus minimal RAM-backed Store/Namespace CapObjects with the KernelCapSource::store/KernelCapSource::namespace grant sources, content-addressed blob storage, Namespace.sub() result-cap transfer, and the make run-store-namespace-smoke proof) have landed. The local-disk path has also reached its first read-only milestone: the first virtio-blk BlockDevice CapObject (make run-virtio-blk) and a read-only filesystem service over BlockDevice (kernel/src/cap/readonly_fs.rs, parsing a fixed CAPOSRO1 on-disk layout and serving Directory.list/open + File.read; make run-storage-fs) now serve a known on-disk tree to a userspace consumer. The Local Disk Storage Milestone’s final gate has also landed: a disk-backed persistent Store (kernel/src/cap/persistent_store.rs, a CAPOSST1 on-disk layout written through the virtio-blk driver, granted via the persistent_store KernelCapSource) with a two-pass reboot proof (make run-storage-persist) that stores+commits a capnp object on the first boot and reads it back on a fresh boot of the same disk image. The Writable Local Storage Milestone has now landed: directory/file mutation, the fail-closed concurrent-writer policy, clean-reboot durability for both filesystem mutations and co-located Store objects on one disk (kernel/src/cap/writable_fs.rs, a CAPOSWF1 sub-volume; two-pass proof make run-storage-writable), and a bounded unclean-shutdown recovery proof (make run-storage-writable-recovery): an induced forced poweroff in the record-written / superblock-pending window proves the next mount recovers to a consistent tree with the interrupted allocation atomically absent. See docs/proposals/storage-and-naming-proposal.md.
  5. Keep serial diagnostics as the first remote troubleshooting path for cloud/hardware bring-up, then add SSH, Telnet development access, and basic WebShell access when network and identity prerequisites are credible. The host-served remote-session UI remains separate from the self-served capOS web UI path. The old self-served proof target is retired with the qemu-only kernel TCP listener; the replacement proof is the future Phase C Web UI L4 gate. Ordinary make run still starts the host-local remote-session CapSet path, and the full boot-resource UI bundle is served with fixed names and integrity labeling. The host-served make remote-session-ui bridge remains a separate trusted development path, not the self-hosted cloud Web UI proof.
  6. Boot on GCP/AWS in staged provider tracks. The first GCP serial-console boot proof landed as run 1778230874-715a (2026-05-08 09:06 UTC, source commit 3951e275). The GCP-first usable-instance provider rollup is also closed: serial-console operator access, live virtio-net raw-frame provider-nic-bound, live NVMe Persistent Disk brokered READ, and separate gVNIC raw-frame / typed-Nic portability evidence are recorded under cloud-usable-instance-provider-nic-storage. AWS/Azure providers, public L4 ingress, SSH/WebShell productization, broader storage variants, and cloud benchmark reruns remain future gates.

Game/demo plans (Paperclips, Aurelian Frontier) are deprioritized opportunistic-only per the same redirect; see docs/tasks/README.md Ad-Hoc Planning / Research Tasks for the High / Normal / Low / Closed bands and the dispatch ordering.

Earlier (pre-2026-05-05) priority ladder retained as background:

  1. Finish a reasonable SMP/threading milestone, including the current scheduler hot-lock bottleneck if the milestone still claims scalability.
  2. Build the device-driver foundation before cloud/network/storage expansion: ACPI/MADT/MCFG, PCI/PCIe, I/O APIC, MSI/MSI-X, DMA/MMIO/IRQ authority, and reusable virtio/device lifecycle code.
  3. Implement storage capability interfaces, starting with RAM-backed Store/Namespace; proceed to local disk and a small read-only filesystem when the block path is ready.
  4. Keep serial diagnostics as the first remote troubleshooting path for cloud/hardware bring-up, then add SSH, Telnet development access, and basic WebShell access when network and identity prerequisites are credible.
  5. Boot on GCP/AWS in two stages: first imported-image serial-console boot, then a usable cloud instance with provider storage/network drivers and network shell access.

The 2026-05-05 ladder above is the authoritative current ordering; the earlier ladder remains as background context only.

Details:

  • docs/tasks/README.md
  • docs/backlog/smp-phase-c.md
  • docs/backlog/session-bound-invocation-context.md
  • docs/proposals/session-bound-invocation-context-proposal.md
  • docs/proposals/user-identity-and-policy-proposal.md
  • docs/backlog/local-users-management.md
  • docs/proposals/boot-to-shell-proposal.md
  • docs/proposals/oidc-and-oauth2-proposal.md

Whitepaper Track

A future capOS whitepaper / technical report consumes – not duplicates – work from the other tracks. The plan, outline, and live evidence-gap log remain in docs/paper/ (plan.md, outline.md, evidence-gaps.md). The paper itself is a Typst project at papers/schema-as-abi/ and is built via make paper.

The paper’s Tier-1 evidence requirements pull these existing items into explicit paper-supporting roles. They are not new tracks; they are the selection lens this track applies:

  • Stage 6 session-bound invocation context migration (closes the “interface IS the permission” claim).
  • A measurement harness over make run-measure producing reproducible ring throughput, cap_enter latency, IPC handoff, and schema-dispatch numbers (closes the ring-as-sufficient-boundary claim).
  • A paper-scoped persistence proof-of-concept narrower than the storage proposal (closes the wire-format-enables-persistence claim).
  • A paper-scoped network-transparency proof-of-concept narrower than the general networking proposal (closes the wire-format-enables-network-transparency claim).
  • At least one of {promise pipelining, notification objects} (closes capnp-rpc-shaped composition beyond CALL/RECV).

Tier-2 strengtheners: ring-protocol Kani proof, full concurrent SMP scheduling, end-to-end SSH Shell Gateway, one non-toy demo beyond Adventure or First Chat.

Out of scope for the first paper (acknowledge in Future Work only): aarch64, GPU, live upgrade, formal MAC/MIC, Go/WASI, cloud metadata, production volume encryption.

When workplan slices close a paper-evidence gap they should reference docs/paper/evidence-gaps.md and update it in the same task, including the matching #todo block in papers/schema-as-abi/main.typ. A structural pre-evidence draft already exists at papers/schema-as-abi/main.typ; the abstract, the Evaluation section, the Conclusion, and any contribution claim that depends on missing Tier-1 evidence stay deferred until that evidence lands. New paper content that does not depend on missing artifacts may be drafted at any time and lives next to the existing #todo blocks.

Completed Foundation

  • Stage 0: Foundations: bitmap physical frame allocator, heap for alloc, IDT exception handling, and initial Cap’n Proto schema scaffolding.
  • Stage 1: Virtual Memory: kernel and per-process address spaces, page table abstraction, HHDM preservation, and user-half cleanup.
  • Stage 2: User-Space Transition: GDT/TSS/syscall setup and Ring 3 round-trip path.
  • Stage 3: Process Abstraction: ELF loading, process ownership of address spaces and cap tables, process exit cleanup, and the current exit / cap_enter syscall surface.
  • Stage 4: Capability Syscalls / Ring Transport: Console capability, shared-memory submission/completion rings, cap_enter, CQE transport errors, and alloc-free dispatch paths.
  • Stage 5: Scheduling Core: PIT/PIC timer preemption, round-robin scheduler, context switching, generation-tagged caps, and VirtualMemory cap.
  • Kernel Networking Smoke: in-kernel QEMU virtio-net lower-layer fixture evidence for PCI/device discovery, descriptor-accounting guards, ARP, and ICMP. TCP/UDP socket proof has moved to the Phase C userspace network-stack gates.
  • Boot To Shell / Native Shell: shell-led boot flow, split debug/terminal UARTs, local setup/login, anonymous/operator sessions, and shell REPL.
  • Verified Core: bounded local/GitHub Kani model-checking gate plus high-memory proof gate for selected cap-table, frame-bitmap, transfer rollback, and resource accounting invariants. These are bounded model checks (small input sizes such as <=8 frames and 63 ELF bytes), not unbounded proofs; they hold within the harness bounds, not for all inputs.
  • Shared-Service Demo Base: chat, adventure, NPC-as-process, and shared service harness prototypes.

Historical completion reports live in docs/changelog.md.

Stage 6: IPC And Capability Transfer

Outcome: cross-process capability calls, capability transfer, revocation, and process spawning are capability-shaped and usable by init-owned service graphs. Caller-selected service-visible identity is being replaced by session-bound invocation context: each normal process has one immutable session context, endpoint calls expose privacy-preserving caller-session metadata, and broker-granted service roots/facets carry service access.

Implemented:

  • cap_enter blocking wait
  • Endpoint kernel object
  • RECV/RETURN ring opcodes
  • cross-process IPC
  • direct-switch IPC handoff
  • legacy endpoint receiver metadata as transitional IPC machinery
  • copy/move capability transfer
  • CAP_OP_RELEASE
  • runtime handle release integration
  • epoch revocation and Revocable Read proof
  • MemoryObject substrate – the kernel-level mapping mechanism that backs zero-copy IPC. Demonstrated end-to-end by make run-memoryobject-shared (single-shot transfer) and make run-ipc-zerocopy (multi-message shared point-to-point buffer with metadata-only endpoint CALLs). The typed SharedBuffer surface and service APIs that consume it (File.readBuf, BlockDevice.readBlocks, NIC RX/TX rings) are still pending.
  • ProcessSpawner / ProcessHandle
  • init-owned manifest execution and boot package boundary cleanup
  • immutable per-process SessionContext ownership, default child-session inheritance, and trusted broker-selected child sessions, demonstrated by make run-session-context

Remaining themes:

  • typed SharedBuffer capability and consuming service APIs (storage, block, network, GPU) on top of the existing MemoryObject substrate
  • notification objects (so zero-copy producers/consumers can signal each other without per-record endpoint CALLs)
  • promise pipelining
  • CapabilityManager list/grant interface
  • stable service-audit identity for endpoint caller-session references across intentional service replacement or upgrade
  • scheduling context and resource donation
  • init ELF embedding

Details:

  • docs/backlog/session-bound-invocation-context.md
  • docs/backlog/service-object-identity-migration.md (superseded)
  • docs/backlog/stage-6-capability-semantics.md
  • docs/proposals/service-architecture-proposal.md
  • docs/proposals/storage-and-naming-proposal.md
  • docs/proposals/error-handling-proposal.md

Stage 7: SMP, Runtime, Networking, And Shell

Outcome: capOS moves from single-CPU scheduling and local-only shell access to multi-CPU execution, thread-aware runtime behavior, socket-shaped network capabilities, and agent/web shell entry points.

SMP status:

  • Phase A complete: BSP per-CPU syscall stack/current-thread state and unified kernel-entry stack hook.
  • Phase B complete: APs start through Limine MP, switch to capOS kernel paging/stacks, initialize AP-local CPU state, and park.
  • Phase C selected AP scheduler-owner proof complete: GS/swapgs, LAPIC timer/IPI, TLB shootdown, and first AP scheduler-owner proof are complete. Commit d88bca7 at 2026-04-25 11:31 UTC proves AP cpu=1 can run scheduler-owned user contexts under -smp 2 while a scheduler-owner latch keeps the BSP in kernel idle. Per-CPU scheduler ownership, the narrow idle-to-runnable reschedule-IPI wake path, and the focused process-scale proof harness are now present.
  • Multi-Process SMP Concurrency is complete at commit 3fb89923 (2026-04-30 09:45 UTC). make run-smp-process-scale records repeated raw QEMU serial logs plus per-case medians and fails closed below the 1.6x speedup threshold. The accepted KVM-backed run recorded 1.608x 1-to-2 speedup, and ordinary run-smoke/run-spawn coverage passed under -smp 2.
  • In-Process Threading Scalability has the formal capOS+Linux thread-scale evidence pair on capos-bench 2026-05-02 21:38 UTC against main commit 374f8556: capOS work 1.883x and total 1.787x clear the configured 1-to-2 gates against the then-current single-global-queue scheduler; matching Linux pthread baseline 1.988x/1.987x validates the workload shape. Its 1-to-4 row became the diagnostic that justified Phase D’s fair-share enqueue policy (capOS 1.566x/1.538x vs Linux 3.963x/3.858x on the same physical-core pin set). Phase D WFQ later manually accepted the recorded 1-to-4 diagnostic with capOS 3.088x/2.700x and matching Linux 3.974x/3.850x.

Runtime/network/shell themes:

  • reconcile in-process threading implementation status and any follow-on work
  • scheduler evolution after the accepted Phase D WFQ closeout: Phase E SchedulingContext capability authority is closed; CPU isolation housekeeping/deferred-work placement is closed; bounded SQPOLL ring mode and the clockevent/deadline substrate are closed; bounded non-periodic SQPOLL producer-wake progress is closed. The narrow single-runnable-entity and SQPOLL-coupled automatic nohz activation increments are closed (scheduler-phase-f-auto-nohz-activation, scheduler-phase-f-auto-nohz-sqpoll under docs/tasks/done/2026/); generic full-nohz for explicitly budgeted compute leases and generic SQPOLL nohz for explicitly leased caller-thread rings have since landed, while policy issuance remains future work. Keep EEVDF as a follow-on best-effort ordering evaluation and keep stateful task/job graph coordinators above CPU dispatch rather than turning them into global schedulers. Userspace policy-service AutoNoHz placement for ordinary “capable of saturating a CPU core” threads sits in Phase H of docs/backlog/scheduler-evolution.md and the “Policy-Service Userstories” section of docs/proposals/tickless-realtime-scheduling-proposal.md: the policy-service-issued CpuIsolationLease adds placement isolation only and never mints CPU-time authority, with bounded lifetime, revocation, accounting target, and operator-declared auto-claim pool
  • session lifecycle for production shell UX: mutable session liveness cells, UserSession.logout, owner-shell/gateway close propagation, and narrow renewal/recovery paths that mint fresh grants without reviving stale ordinary caps; clean local owner-shell exit now reaches the logout path, while renewal/recovery remains future work
  • Telnet Shell Demo as first TCP-backed TerminalSession proof. Plaintext, loopback-only research demo; not a shippable Telnet service.
  • Tickless idle as the near-term timer cleanup: split clocksource from clockevent, convert timeout waiters to absolute deadlines (done), migrate the scheduler idle path to a CPL0 per-CPU kernel idle thread (done), then stop the periodic tick only when no runnable work exists. After the one-SQ-consumer, CPU-isolation authority, nohz telemetry, and housekeeping placement prerequisites, bounded SQPOLL ring mode and the clockevent/deadline substrate closed, and bounded non-periodic SQPOLL progress was proven; the periodic tick is now suppressed for the narrow single-runnable-entity window and for the ring-coupled kernelSqpoll lease (scheduler-phase-f-auto-nohz-activation, scheduler-phase-f-auto-nohz-sqpoll), with the periodic tick as the fail-closed fallback everywhere else. Timeout-based auto-revoke, generic full-nohz for explicitly budgeted compute leases, and generic SQPOLL nohz for explicitly leased caller-thread rings have since landed. See docs/proposals/tickless-realtime-scheduling-proposal.md and docs/research/nohz-sqpoll-realtime.md.
  • SSH Shell Gateway as the production remote CLI successor to plaintext Telnet after host-key, authorized-key, audit, and persistence prerequisites exist
  • remote session CapSet clients as the programmatic/UI counterpart to shells: regular host apps, desktop GUI/Tauri front ends, and server-side webapp gateways authenticate through the same session/admission path, receive broker-issued remote capability views, and call granted services over Cap’n Proto RPC without turning chat, Paperclips, agent tools, or future command surfaces into shell-only protocols. The first default-run development endpoint and focused interop harness now prove this shape with schema-framed Cap’n Proto DTOs; standard capnp-rpc proxy transport remains future work. Later UI-composition caps let capOS-side services or agents propose bounded session workspace changes without receiving arbitrary browser or desktop authority.
  • self-served capOS web UI has historical focused proof evidence, but the old make run-remote-session-self-served-web-ui target is retired with the qemu-only kernel TCP listener. The replacement proof belongs to the future Phase C Web UI L4 gate. make run forwarding the guest remote-session CapSet endpoint is still not the same as capOS serving the web UI, and make remote-session-ui remains the host-side trusted development bridge. The blocked remote-session-self-served-web-ui-default-run task records the future decision and wiring gate if self-served UI should become part of ordinary make run.
  • Telnet over TLS as an optional compatibility/service-terminal transport after certificate/TLS, durable identity, and session lifecycle work exists. It should not be a default main access interface ahead of SSH/WebShell.
  • decomposed userspace NIC/network-stack milestone after driver authority gates
  • native shell agent runner
  • WebShellGateway using the same broker-issued shell/agent authority model

Remote shell priority: do not treat Agent Shell or WebShellGateway as the next default visible milestone before the driver/storage foundation unless the user explicitly redirects. SSH/WebShell production access is more useful after session lifecycle, durable account/key material, network listener authority, and serial/cloud diagnostics have credible proofs. Plaintext Telnet remains a loopback/local development proof and a simple transport for exercising TerminalSession; it is not a production cloud access target. Telnet over TLS may remain as a later optional transport, but SSH and WebShell are the main production access tracks.

Details:

  • docs/backlog/smp-phase-c.md
  • docs/backlog/scheduler-evolution.md
  • docs/backlog/runtime-network-shell.md
  • docs/backlog/remote-session-capset-client.md
  • docs/proposals/smp-proposal.md
  • docs/proposals/scheduler-evolution-proposal.md
  • docs/research/future-scheduler-architecture.md
  • docs/proposals/tickless-realtime-scheduling-proposal.md
  • docs/proposals/networking-proposal.md
  • docs/proposals/shell-proposal.md
  • docs/proposals/remote-session-capset-client-proposal.md
  • docs/proposals/llm-and-agent-proposal.md
  • docs/proposals/boot-to-shell-proposal.md

Hardware, Boot, And Storage

Outcome: capOS boots beyond the current ISO/QEMU manifest path, discovers real hardware, supports block devices, and exposes local persistent storage through typed capabilities.

Tracks:

  • hybrid BIOS+UEFI raw disk image and make run-disk
  • serial diagnostics console for cloud/hardware bring-up
  • ACPI/MADT/MCFG discovery
  • reusable interrupt and PCI/PCIe infrastructure
  • virtio-blk and NVMe block-device paths
  • boot binary ISO layout that moves ELF payloads out of the manifest blob
  • RAM-backed Store/Namespace
  • read-only local filesystem proof
  • writable local storage with recovery policy
  • installable system: boot from disk with persistent, mutable system configuration composed over the immutable boot manifest (own milestone, sequenced after the writable-local-storage milestone it builds on)
  • staged cloud boot: first serial-console boot, then provider block/NIC drivers and network shell access

Details:

  • docs/backlog/hardware-boot-storage.md
  • docs/proposals/cloud-deployment-proposal.md
  • docs/proposals/storage-and-naming-proposal.md
  • docs/proposals/installable-system-proposal.md
  • docs/dma-isolation-design.md

User Identity, Sessions, And Policy

Outcome: shell, service, and future web sessions receive narrow capability bundles based on explicit identity, freshness, policy, and audit context.

Implemented base:

  • anonymous/operator shell sessions
  • password setup/login proof
  • broker-issued shell bundles
  • redacted auth/session audit records

Remaining themes:

  • manifest-seeded local accounts, recovery identities, service identities, and initial role/resource profiles
  • disk-backed local account store over capability-native storage
  • default per-account, guest, anonymous, external, and service-account resource bundles
  • explicit external identity bindings for OIDC/passkey/cloud/certificate principals
  • durable verifier/passkey records
  • WebAuthn and passkey-only setup path
  • broader AuditLog completion
  • ABAC context such as auth freshness, session age, source, and claims
  • mandatory-policy labels and wrapper caps
  • guest and anonymous workload demos
  • POSIX profile adapter metadata
  • OIDC/OAuth2 integration

Details:

  • docs/proposals/user-identity-and-policy-proposal.md
  • docs/backlog/local-users-management.md
  • docs/proposals/oidc-and-oauth2-proposal.md
  • docs/proposals/certificates-and-tls-proposal.md
  • docs/proposals/cryptography-and-key-management-proposal.md
  • docs/security/trust-boundaries.md

Security And Verification

Outcome: trust boundaries fail closed, proof gates stay practical, and trusted build inputs remain review-visible.

Implemented base:

  • host tests for pure logic
  • Loom ring model (a bounded concurrency model of the ring protocol, not the shipped kernel/src/cap/ring.rs)
  • Miri/proptest/bounded Kani model-checking paths
  • dependency policy checks
  • pinned Limine and Cap’n Proto tooling
  • DMA isolation design gate
  • panic-surface inventory

Remaining themes:

  • Stage-6 trust-boundary refresh
  • untrusted-service hardening and quota/exhaustion smokes
  • Kani harness bounds refresh when new proof obligations are concrete
  • DMA assurance model operationalization: turn the v0 TLA+/Alloy skeletons into checked run targets (make model-dma-tla / model-dma-alloy / kani-dma-authority + a DeferredCompletionQueue Loom) reconciled with landed DMA code and wired to CI
  • Scheduler & IRQ assurance models: first formal coverage for the densest unmodeled race surface – nohz activation/rollback (TLA+ + Loom), the LAPIC one-shot timer fix (Kani + TLA+), CpuIsolationLease authority (Alloy + TLA+), and the MSI-X waiter determinism ordering (TLA+)

Details:

  • docs/backlog/security-verification.md
  • REVIEW.md
  • docs/tasks/README.md
  • docs/proposals/security-and-verification-proposal.md
  • docs/security/verification-workflow.md
  • docs/trusted-build-inputs.md

Shared-Service Demos

Outcome: multi-process demos prove resident services, shell-spawned clients, session-bound invocation context, shared harnesses, and eventually network-transparent federation.

Implemented:

  • First Chat MVP
  • Local MUD/adventure prototype
  • NPC-as-process fleet
  • shared service harness extraction
  • session-bound chat/adventure state keyed by live caller-session metadata

Remaining themes:

  • per-principal chat state and audit
  • Aurelian Frontier game-depth work after the first deterministic mission slice
  • native command-surface replacement for prototype StdIO
  • federated chat after network transparency

Details:

  • docs/backlog/shared-service-demos.md
  • docs/backlog/aurelian-frontier.md
  • docs/demos/adventure.md
  • docs/proposals/aurelian-frontier-proposal.md
  • docs/proposals/interactive-command-surface-proposal.md

aarch64 Support

Outcome: port the architecture layer after x86_64 hardware abstraction stabilizes.

Shared code expected to carry over:

  • capability model and schema
  • ring structs and transport contracts
  • userspace runtime model
  • process/capability abstractions above arch/

Architecture-specific work:

  • EL0/EL1 syscall entry/exit
  • GICv3 interrupts
  • ARM generic timer
  • PL011 UART
  • TTBR0/TTBR1 MMU setup
  • TPIDR_EL1 per-CPU data
  • kernel/linker-aarch64.ld

Future Tracks

These are not selected unless docs/tasks/state.toml or explicit user direction pulls them into active selected-milestone scope. Add root task records and backlog/proposal decomposition only when one of these tracks becomes the selected visible outcome:

  • regular Rust runtime support
  • C libcapos
  • Go GOOS=capos
  • Python runtime adapters
  • Lua scripting (Phase 0 capability-aware Lua-subset interpreter shipped in demos/lua-smoke/; PUC Lua dialect compatibility remains future, awaiting C/libcapos)
  • POSIX compatibility adapters
  • WASI runtime
  • C++ experiments
  • GPU/CUDA capability integration
  • system monitoring
  • network transparency
  • process persistence/checkpoint-restore
  • live upgrade
  • cloud metadata
  • volume encryption
  • formal MAC/MIC modeling
  • browser/WASM support
  • robotics realtime control
  • trusted time and clock authority
  • crash recovery and supervision
  • debug and trace authority

Use proposal files under docs/proposals/ and research notes under docs/research/ before promoting any future track into docs/tasks/README.md. Lua scripting should arrive as an ordinary capability-scoped userspace runner, not as kernel scripting or ambient shell authority.

seL4 HAMR (model-based high-assurance engineering)

Evaluated HAMR (High Assurance Modeling and Rapid engineering): AADL component models, Slang/GUMBO contracts, and seL4/CAmkES backend generation, and how that model-to-capability-system pipeline compares with capOS’s “the Cap’n Proto schema is the contract” model, capability partitioning, and the schema-as-ABI story. Findings: docs/research/sel4-hamr.md (reference talk: https://youtu.be/gP1klZJi04U).

Crate publication

Publish capOS’s reusable no_std crates – capos-abi, capos-lib, capos-config, and the capos/capos-rt runtime/facade – to crates.io with stable versioning, rendered docs, and license/metadata, so the ELF parser, capability table, ring/SQE wire validation, manifest/CUE loader, and typed clients can be reused and cited independently of the kernel tree. The publish-set decision is pinned in docs/backlog/capos-sdk-dual-transport.md: publish capos-abi, the capos-capnp-build build helper, capos-config, and capos-lib first; publish capos-rt and the bare capos facade with the transport seam; ship the libcapos/libcapos-posix C substrate as release artifacts only (not crates.io – their consumers link .a archives, decision 2026-06-02 16:10 UTC); the publish-set MSRV is the stable Rust 1.88.0 proven by the slice-2 dry-run (the Rust 2024 floor 1.85.0 cannot build capos-config’s let chains); and keep generated Cap’n Proto bindings inside capos-config rather than publishing a separate bindings crate. The versioning policy (pre-1.0 SemVer, schema/ABI changes as breaking bumps, lockstep across the set) and the repeatable make sdk-publish-dry-run gate are recorded in docs/backlog/capos-sdk-dual-transport.md.

This track now also covers the front-door capos SDK crate: one published crate whose typed capability clients run unchanged against two transports – the in-process capability ring (an application running inside capOS) and a remote connection (a host-side RPC client) – behind a Transport seam. The bare capos name is the facade; capos-rt provides the ring transport and the remote feature provides the host transport. The seam and facade have landed: capos-rt defines the Transport trait and the in-system RingTransport, the typed clients are transport-generic, and the standalone capos facade crate re-exports the runtime, clients, and entry_point! macro behind the default ring feature (proved in-system by make run-spawn). The remote transport backend remains ahead. Crates.io remains a flat, first-come namespace; the exact crate names were verified free before the 2026-06-05 upload and are now claimed by the capOS 0.1.0 release, while the adjacent capos-bitstruct crate from an unrelated cap-os/rust-tools repository shows the namespace contention risk. The near-term reservation work is closed: existing reusable layers were published with real content, the bare capos facade was reserved with transport-seam content, and the seam landed early. The repository-wide license file required by the public-release boundary is recorded (LICENSE-APACHE / LICENSE-MIT, MIT OR Apache-2.0 on the SDK crates). The first six-crate 0.1.0 publish completed on 2026-06-05 after the final crates.io name re-check, the custom-target SDK gate, and the local Cargo API-token upload. The capos-config docs.rs accommodation is implemented through the packaged generated-binding fallback, and the GitHub Actions trusted-publishing workflow is present for subsequent releases from refs/heads/main after a current explicit user release instruction and crates.io trusted publishers are configured for the six crates. Decomposition and publication ordering are in docs/backlog/capos-sdk-dual-transport.md; the transitional host-backend remote transport (slice 4a) can ship now, while the live-proxy capnp-rpc upgrade (slice 4b) remains gated on the remote-session async-runtime rewrite.

Observable Milestones

Completed visible milestones:

  • 2026-04-22 16:35 UTC, commit d4016ab: Unprivileged Stranger
  • 2026-04-23 08:41 UTC, commit f554e88: Native Cap Shell
  • 2026-04-23 13:39 UTC, commit e5adafb: Boot to Shell
  • 2026-04-23 16:15 UTC, commit 7f19af2: Revocable Read
  • 2026-04-23 16:34 UTC, commit 8b66c13: split UART shell session
  • 2026-04-23 22:09 UTC, commit d43b691: Verified Core
  • 2026-04-24 00:13 UTC, commit 2cd85a8: First Chat MVP
  • 2026-04-24 01:40 UTC, commit add7f9b: Local MUD/adventure prototype
  • 2026-04-24 03:13 UTC, commit da5f5e9: Ring as Black Box
  • 2026-04-24 15:37 UTC, commit b56a5c1: First Packet
  • 2026-04-24 16:47 UTC, commit a4f1722: First HTTP
  • 2026-04-25 05:36 UTC, commit 0b79054: SMP Phase A: per-CPU data on BSP
  • 2026-04-25 06:59 UTC, commit d3c30c6: SMP Phase B: APs running
  • 2026-04-25 11:31 UTC, commit d88bca7: First AP Scheduler
  • 2026-04-25 20:25 UTC, commit 2834bfc: Telnet Shell Demo
  • 2026-04-30 09:45 UTC, commit 3fb89923: Multi-Process SMP Concurrency
  • 2026-05-01 14:23 UTC, commit fb102828: Remote Session CapSet Web UI Proof
  • 2026-05-11 14:38 UTC, branch commit 28db3277: Self-Served capOS Remote Session Web UI Proof. The now-retired make run-remote-session-self-served-web-ui target booted the focused manifest, loaded browser assets from the capOS remote-session-web-ui service over its scoped listener, denied no-cookie browser commands, called backend-held SystemInfo, logged out, and then attempted the retained backend-held SystemInfo capability to prove expired-session stale failure. The host make remote-session-ui bridge remains a development tool.
  • 2026-05-13 11:05 UTC, branch commit 5f5028e7: WASI bounded environment grant smoke. make run-wasi-env boots the focused wasm-host manifest, reads the bounded initConfig.init.wasiEnv text grant, reflects it through Preview 1 environ_get / environ_sizes_get, and the Rust wasm32-wasip1 payload prints [wasi-env] CAPOS_WASI_ENV_SENTINEL=capos-wasi-env-sentinel. Missing wasiEnv remains the empty-environment behavior.
  • 2026-05-01 16:13 UTC, commit 5198e255: Remote Session Adventure Launch
  • Cloudboot run 1778230874-715a (2026-05-08 09:06 UTC), source commit 3951e275 (2026-05-08 08:50 UTC): GCP Imported-Image Serial Boot. make cloudboot-test booted the GCE imported disk image to the capos kernel starting serial landmark on a temporary no-public-IP, no-service-account e2-small instance, captured serial output, and tore down the temporary cloud resources. This is a boot-path portability milestone, not provider NIC/storage driver readiness.
  • GCP-first usable-instance provider rollup, closed 2026-06-07 05:26 UTC by commit b5fdcc3e and cloud-usable-instance-provider-nic-storage: serial-console operator access run 1779868872-2424 (source commit c92c8bc1), live legacy virtio-net raw-frame provider-nic-bound run 1780412056-e1cb (source commit 1fb65683), live NVMe Persistent Disk brokered READ run 1780806087-bf69 (source commit 28518165), and separate live gVNIC raw-frame / typed-Nic portability runs 1780794927-1aa9 (source commit 3ef8997a) and 1780796615-decc (source commit 2a0857d). This closes the selected GCP provider NIC/storage bar while leaving public L4 ingress, SSH/WebShell productization, AWS/Azure providers, broader storage, high-throughput/multiqueue NIC, and direct-remapping DMA for future tracks.
  • Device Driver Foundation (DDF) bounded-authority proof series, 2026-05-08 through 2026-05-23: read-only hardware-audit snapshots (make run-hardware-audit*), bounded DMAPool/DMABuffer result caps with parent-first release and proof-slot reuse (make run-dmapool-grant), DeviceMmio brokered read/write and Interrupt wait/ack/mask/unmask grant proofs (make run-devicemmio-grant, make run-interrupt-grant, make run-hardware-grant-cycle), a device-manager-owned DMAPool budget ledger, and the userspace provider-consumer TX/RX path (make run-ddf-provider-consumer): bounded selected-route descriptor/avail/ doorbell/used-ring/CQ handoffs, full selected TX queue-depth CQ ownership, bounded RX synthetic-token CQ identity, selected TX/RX MSI-X/LAPIC wait/ack/EOI, selected-route reset/reassignment, and teardown/stale-handle blocking. These are bounded-proof milestones, not live hardware RX used-ring ownership, full virtio-net ownership, direct DMA/IOMMU, cloud NIC/storage readiness, or production userspace driver readiness. The provider virtio-net closeout slice is commit c86374f8 (2026-05-23 16:51 UTC); the executable decomposition and remaining gates live in docs/backlog/hardware-boot-storage.md and the DDF task files under docs/tasks/. Visible demo follow-ups:
  • Adventure/shared-service follow-ups after the Local MUD prototype: 73d83aa, da51dc7, 353c8bc, e20cf07, 948c96e, and ca6300c. These refine discoverability, room context, expedition map, relic custody, explicit resume, and chat-only named actors; detailed reports live in commit history.
  • 2026-04-26 04:10 UTC, commit 5480304: Scoped Telnet Gateway Authority. telnet-gateway now uses manifest-forwarded scoped listener authority plus RestrictedShellLauncher; detailed verification history lives in commit history.
  • 2026-04-26 23:12 EEST, commit 4304b0e: Default run Telnet wiring. The default manifest starts telnet-gateway, and make run attaches host-local 127.0.0.1:2323 -> guest :23 forwarding.
  • 2026-05-01 16:54 UTC, branch commit 367117be: Default run Telnet wiring retired. The default manifest no longer starts telnet-gateway, and make run now forwards only the remote-session CapSet endpoint. The plaintext Telnet research fixture was later retired with the qemu-only kernel TCP listener; make run-telnet now exits before QEMU with a retirement diagnostic.
  • 2026-05-02 02:24 UTC, branch commit 84f5ac61: Remote Session Gate 3 auth-denial proof. Focused backend/account-store coverage rejects inactive accounts, unknown principals, and missing or retired resource profiles before remote-client bundle authority exists. The live CLI/QEMU proof now drives bad password proof, unknown account, wrong requested profile, and anonymous profile mismatch denials before any session, CapSet, or service-launch activity; denied re-login clears prior gateway/client/UI session state.
  • 2026-05-02 06:23 UTC, branch commit 482e5e07: Remote Session Adventure mutable control proof. The remote Adventure fixture and trusted web bridge now call bounded Adventure.go(direction) through the same session-bound worker/client path as status, look, and inventory, then verify movement text, changed room state, redacted transcripts, and visible-button UI automation without exposing raw capOS authority.
  • 2026-04-27 00:02 EEST, commit 7a155f4: Telnet IAC handoff fix and repeat-connect support. Telnet handoff no longer consumes raw socket input before intoTerminalSession, repeated host connections succeed, and the harness drives two consecutive sessions.
  • 2026-04-28 17:46 UTC, commit d09243d: Aurelian Phase 9 competency gates. The adventure proof now has host-testable rank/star/circle policy, status output for rank marks and standing, signifer skill gates, first-mission spell gates, and QEMU assertions for rank denial plus debrief reward.
  • 2026-04-28 18:12 UTC, commit 47dbfc5: Aurelian Phase 10 market logistics. Adventure now has typed quote/buy/sell/trade/repair calls, bounded market roles, a deterministic Maro route purchase, and QEMU assertions for market quote, successful exchange, and clean-custody trade refusal.
  • 2026-04-28 19:36 UTC, commit e204454: Aurelian Phase 11a calendar foundation. Generated content now carries fixed-smoke season/day/weather and hazard state plus bounded seasonal resources, Adventure status prints that state, and the real scenario process asserts it through Adventure.status.
  • 2026-04-30 08:56 UTC, commit 4045576: Aurelian Phase 11a calendar event metadata. Generated content now carries a fixed-smoke active festival and later military event with pure Rust validation; Adventure status prints the active event metadata, and the real scenario process asserts it through Adventure.status. Actor movement, shop mutation, witness blocking, route mutation, debrief branching, quests, gifts, and affection remain future work.
  • 2026-04-30 13:09 UTC, commit 64933131: Aurelian Phase 11a seasonal shop-stock purchase. adventure-content owns the bounded active-stock, standing-gate, remaining-stock, and depletion decision for seasonal shop purchases. The quartermaster field-rations buy path now spends audited Aurelian standing, records service-owned per-expedition seasonal stock usage, adds the ration to inventory, and the real scenario process asserts both the pre-debrief refusal and post-debrief purchase through Adventure.buy. Broader seasonal economy mutation, persistence, seeded normal-play calendars, and automatic world advancement remain future work.
  • 2026-04-28 20:08 UTC, commit 48c62db: Aurelian Phase 11b regional foundation. Generated content now carries settlement, outpost, and route metadata with validation and stable ordering; Adventure status prints a regional summary, and the real scenario process asserts it through Adventure.status.
  • 2026-04-30 12:07 UTC, commit 6afd87aa: Aurelian Phase 11b regional market transaction proof. adventure-content owns bounded reserve, commit, cancel/release, stale-version rejection, idempotent replay from ordered receipt facts, and terminal-receipt-capacity checks for one generated order-book match at a time. adventure-server keeps transaction state inside each expedition PlayerState, so fresh and resumed expeditions do not share market idempotency history. The real scenario process asserts regional quote/reserve/retry/commit/stale/release/cancel flows through existing Adventure.quote, Adventure.buy, and Adventure.sell calls.
  • 2026-04-30 13:39 UTC, commit 6605ee6a: Aurelian Phase 11b regional market delivery proof. Fresh committed field-ration receipt facts now produce a bounded player-local supply delivery into expedition inventory, while commit replay and errors do not duplicate items. The real scenario process asserts delivery of the committed quantity and no replay duplication through existing Adventure.buy and Adventure.inventory calls. NPC stores, outpost stock, currency, durable ledgers, profile balances, and crash recovery remain future work.
  • 2026-04-30 14:15 UTC, commit b1c98eb1: Aurelian ordinary inventory capacity proof. adventure-content now owns a deterministic admission helper for bounded ordinary inventory, and adventure-server routes room takes, seasonal harvests, quartermaster field-ration purchases, and regional market delivery through one helper. Regional committed delivery fails closed when the full quantity cannot fit, avoids partial duplication, and remains replayable after items are dropped.
  • 2026-04-30 14:51 UTC, commit f06aa732: Aurelian capacity replay proof. The capacity-denial path now uses authored/generated resources only, keeps transfer on the same ordinary inventory admission helper, exposes bounded repair-material collection at resource sites, and proves through the real scenario process that held regional delivery mutates no partial items and later delivers the full quantity after buy commit-field-ration from regional-market is replayed.
  • 2026-04-30 15:14 UTC, commit fd432147: Aurelian regional market currency debit proof. Fresh committed regional field-ration buys now spend two player-local Aurelian chits exactly once, expose the balance in inventory, reject insufficient balances before transaction mutation, and keep held item delivery replay independent from debit replay. NPC stores, outpost stock, durable currency ledgers, profile balances, fees, expiry advancement, and crash recovery remain future work.
  • 2026-04-30 15:53 UTC, commit 7a9a4af5: Aurelian regional outpost stock proof. Fresh committed regional field-ration buys now decrement seller ash_farm stock from six to two exactly once, expose that stock in status, reject insufficient seller stock before mutation, and keep committed replay plus held item delivery replay from decrementing again. NPC stores, broader outpost inventories, durable stock ledgers, profile balances, fees, expiry advancement, and crash recovery remain future work.
  • 2026-04-30 16:23 UTC, commit 00b18598: Aurelian regional market fee accrual proof. Fresh committed regional field-ration buys now accrue the generated buy and sell order fees into a service-owned regional-market pool exactly once, expose that pool in status, ignore release/no-cross and non-ration facts, and keep committed replay plus held item delivery replay from accruing again. NPC stores, broader outpost inventories, durable stock and currency ledgers, profile balances, durable fee ledgers, expiry advancement, and crash recovery remain future work.
  • 2026-04-30 16:57 UTC, commit bdcc23ed: Aurelian regional seller proceeds proof. Fresh committed regional field-ration buys now credit the service-owned ash_farm proceeds pool two chits exactly once, expose that pool in status, ignore release/no-cross, stale, mismatched, and non-ration facts, and keep committed replay plus held item delivery replay from crediting proceeds again. NPC stores, broader outpost inventories, durable stock and currency ledgers, durable seller-proceeds ledgers, profile balances, durable fee ledgers, expiry advancement, and crash recovery remain future work.
  • 2026-04-30 17:41 UTC, commit 29c065a9: Aurelian regional market order expiry proof. adventure-content now has pure order activity and day-aware deterministic matching; adventure-server uses the fixed smoke day for live regional-market reserve and quote, and the scenario process proves a day-73 expired field-ration reserve releases without status, inventory, currency, outpost stock, fee, seller-proceeds, or delivery mutation. Durable calendar advancement, durable order books, profile ledgers, durable fee ledgers, and crash recovery remain future work.
  • 2026-04-30 18:40 UTC, commit 205fd6a0: Aurelian regional market fee withdrawal proof. adventure-content now has a pure resolver for bounded regional-market fee withdrawal from the current pool plus applied withdrawal ids; adventure-server owns the live fee pool, applied withdrawal ids, and service treasury balance; and the scenario process proves sell withdraw-fees to regional-market moves the two accrued fee chits exactly once without mutating inventory, currency, outpost stock, seller proceeds, or delivery state.
  • 2026-04-30 19:43 UTC, commit a547db3d: Aurelian regional market receipt snapshot proof. adventure-content reconstructs RegionalMarketTransactionState from ordered receipt facts with bounded validation, and adventure-server exposes buy receipt-snapshot from regional-market to prove the old field-ration commit still replays after reconstruction without mutating live market, inventory, fee, treasury, seller-proceeds, stock, or delivery state. Durable restart loading remains future work.
  • 2026-04-30 20:07 UTC, commit 4b44b32: Aurelian regional market settlement snapshot-view proof. adventure-content checks the settlement side-effect snapshot view from applied delivery, currency debit, outpost stock decrement, fee accrual, fee withdrawal, and seller proceeds ids plus the current balances, rejects over-capacity id snapshots, and proves the already committed field-ration fact plus fee withdrawal replay as already applied. adventure-server exposes buy settlement-snapshot from regional-market, and the real scenario process proves the command leaves live status and inventory unchanged. Durable restart loading remains future work.
  • 2026-04-28 21:08 UTC, commit 0b7db05: Aurelian Phase 11c construction foundation. Generated content now carries material, facility, blueprint, artifact, and enchantment-slot metadata with pure Rust validation and deterministic property derivation; Adventure status prints a construction summary, and the real scenario process asserts it through Adventure.status. Service-mediated construction jobs are tracked by the later Phase 11c construction-job proof; escrow, durable stock ledgers, output/currency inventory, and full artifact crafting gameplay remain future work.
  • 2026-04-30 13:01 UTC, commit 9f8cfb6c: Aurelian Phase 11c construction-job proof. adventure-content owns bounded reserve/start, completion, cancel/release, stale-version rejection, idempotent replay, service-owned material hold/release facts, older terminal replay, and fact capacity checks on top of existing construction metadata. adventure-server owns per-player construction material stock and applies holds/restores only for new successful repair outcomes; completion consumes the held materials, while replay and denial paths do not mutate stock. The real scenario process asserts denial, reserve/retry, open-reserve conflict, complete/replay, stale rejection, release/replay, and reserve-after-release through existing Adventure.repair calls. Durable persistence, broad stock ledgers, outpost replenishment, output/currency inventory, job-time advancement, and general crafting remain future work.
  • 2026-04-30 22:46 UTC, commit fd57de6b: the Aurelian construction receipt snapshot follow-on is scoped to pure Rust construction receipt snapshot semantics plus a size-constrained QEMU no-mutation probe. Pure adventure-content tests reconstruct a separate construction job state from ordered facts and reject malformed, over-capacity, and non-closed snapshot shapes. The QEMU scenario drives repair receipt-snapshot with field-engineer only to confirm status, inventory, live construction state, and material stock are not mutated. The runtime command is not a proof that receipts replay into the live service, and this is not durable restart loading or a general construction persistence layer.
  • 2026-04-28 21:36 UTC, commit f53d044: Aurelian Phase 11d agent NPC budget foundation. Generated content now carries disabled-by-default optional NPC agent budget metadata with model profiles, per-session/day input/output token limits, tool-call limits, cooldown, fatigue, sleep, refusal, and audit visibility. Pure Rust fake-model tests cover spending, refusals, disabled transcript stability, bounded output, and no authority mutation from model text; Adventure status prints an aggregate budget line asserted through Adventure.status. Live LLM integration, hosted-agent execution, durable memory, autonomous NPC actions, and authority mutation from model output remain future work.
  • 2026-04-30 08:22 UTC, commit c6d887: Aurelian Phase 11d fake-agent purpose expansion. Deterministic fake-agent responses now cover personal routines, nonbinding shop negotiation flavor, and festival reactions as dialogue/proposed-action data only. Pure Rust tests cover quota spending, quota refusal, bounded lines, and no authority mutation; Adventure status prints the supported purpose count and the real scenario process asserts it through Adventure.status.
  • 2026-04-28 22:22 UTC, commit 335a9ee: Aurelian Phase 12 party foundation. Adventure now has typed local party create/invite/accept/leave/delegate calls and assist, keyed by service-created local player labels derived from live caller-session keys. The server uses the unit-tested adventure-content party transition state for invite, accept, scoped delegation, assist, and leave cleanup; the scenario process asserts the one-client cap surface and party status line. Two-client QEMU proof, transfer escrow, duel/spar/contest authority, and cross-device multiplayer remain future work.
  • 2026-04-29 06:43 UTC, commit ac49375: Aurelian Phase 12 physical-item transfer foundation. Adventure adds typed transfer for same-party service-local player labels, with ordinary inventory mutation kept atomic inside the existing service and backed by pure Rust transfer tests. The scenario process asserts one-client refusal paths without faking a second live session. Currency escrow, broad market/trade coordination, and successful two-client QEMU transfer proof remain future work.
  • 2026-04-29 18:07 UTC, commit f4a7fdb: Aurelian authority-combat verb foundation. Adventure adds the bounded challenge-authority skill and challenge authority <target> text alias for the ward-wraith proof slice: accepted ward-writ attacks hostile ward authority instead of hp, records success-only evidence/effects, and QEMU coverage exercises wrong-target, missing-authority, success, and shell-alias paths. Broader authority-combat verbs, hostile authority enemy variants, writ affixes, and rank/base reach unlocks remain future work.
  • Merged on main at commit 6678d40 (2026-04-30 03:55 UTC): Paperclips Terminal Demo follow-up. The default manifest advertises the clean-room paperclips terminal game, and system-paperclips.cue plus make run-paperclips provide the focused QEMU proof for one-at-a-time manual production, representative refusal output, explicit sales, repeatable marketing, autoclipper unlock, real-time automation, generated Cap’n Proto content loading, scaled business-phase production, precision-rollers, design-search, forecast-engine, survey-drones, and the visible == autonomous phase == transition. The demo remains outside the current SMP process scaling milestone because it exercises a standalone StdIO plus Timer terminal process rather than SMP process-count or scheduler behavior.
  • Task branch commit 88536a9e (2026-04-30 17:38 UTC): Paperclips client/server showcase first slice. The focused manifest now boots Paperclips server services plus a terminal client; the server owns generated content, game state, regular timer cadence, unlock checks, game-rule mutation, and proof-command gating, while the client receives explicit StdIO plus a PaperclipsGame endpoint.
  • Task branch commit 532207c1 (2026-04-30 20:54 UTC): Paperclips structured command-list slice. The server exposes current command specs for terminal help without changing the raw text command execution path. Normal and proof sessions use separate server endpoints, preserving proof-only run <ms> and status --json authority.
  • Task branch commit e9ae4e97 (2026-04-30 22:02 UTC): Paperclips structured plain-status snapshot slice. The server exposes PaperclipsStatusSnapshot fields for terminal-rendered plain status, while status --json remains proof-only and server-gated.
  • Task branch commit 32462e9f (2026-04-30 22:32 UTC): Paperclips structured project-list slice. The server exposes unlocked project entries for terminal-rendered plain projects, while project <id> remains raw text execution against server-owned mutable state. Remaining Paperclips showcase work includes broader structured state/events, command facets, capability transfer/revocation ergonomics, and the later web-shell client path.
  • Commit 5ef16c3 (2026-04-30 04:17 UTC): Paperclips autonomous scaling follow-up. The CUE-authored generated content now owns millisecond drone matter-conversion, factory production, probe harvest, and probe replication caps; host tests cover the bounded transitions and completion gating. The focused QEMU proof continues after == autonomous phase == through material-harvesters and foundry-lines, then asserts lower local matter, increased autonomous production, and clean process exit.
  • Commit 65f9d2c (2026-04-30 07:36 UTC): Paperclips cosmic/completion transcript follow-up. The focused QEMU proof now continues through mesh-coordination, seed-probes, == cosmic phase ==, a bounded probe interval with visible replication, cosmic-matter conversion, and clip production, then final-conversion and == complete phase ==. That proof used compact clean-room values for the cosmic matter grant and terminal conversion clip cost so the run remained representative rather than an exhaustive full playthrough.
  • Commit 52d30d2b (2026-04-30 12:00 UTC): Paperclips completion rebalance. The late-game matter and final conversion costs now prevent normal play from reaching == complete phase == within one real-time hour. The focused QEMU proof stops at the cosmic production milestone with final-conversion still locked instead of scripting a compact full win.
  • Commit 9262938b (2026-04-30 12:26 UTC): Paperclips machine-readable status follow-up. The terminal demo now supports status --json as a stable compact state snapshot, and the focused QEMU proof asserts that late-game JSON line after the cosmic milestone while preserving the human transcript checks.
  • Commit 119acaad (2026-04-30 12:53 UTC): Paperclips review-fix follow-up. Active schema, CUE content, Rust rules, generated-content guardrails, and focused smoke assertions now use clean-room Strategy internals. Purchase parsing keeps omitted counts as one but rejects explicit zero counts without mutating game state.

Recently completed visible milestone:

  • Device Driver Foundation: the selected milestone is complete by the production-authority closeout task ddf-production-authority-closeout at commit ef8d98c2 (2026-06-07 08:15 UTC; task completion recorded 2026-06-07 08:23 UTC). The DDF closeout records the landed DeviceMmio/DMAPool/Interrupt lifecycle status, the provider-driver local authority evidence, hardware-audit consumption for abort-held DMA mapping records, and the runtime fail-closed DMA backend baseline. The related GCP-first usable-instance rollup cloud-usable-instance-provider-nic-storage (2026-06-07 05:26 UTC) records live operator serial access, selected raw-frame NIC/storage evidence, and gVNIC portability, without claiming public L4 ingress, AWS/Azure support, direct-remapping production hardware, device-autonomous MSI-X delivery, full userspace smoltcp/L4 readiness, or high-throughput/multiqueue NIC readiness.
  • POSIX Adapter v0 – File/Directory fd closeout: commit f97d9833 (2026-05-23 06:23 UTC) closes the P1.4 file/directory fd surface over the existing RAM-backed root Directory cap. libcapos-posix now exposes functional open, read, write, close, lseek, opendir, readdir, and closedir for the v0 Directory-backed path, with readdir backed by a lazy Directory.list snapshot and lseek backed by the fd-table file position plus File.stat for SEEK_END. make run-posix-file boots a C process that creates "/hostname", writes and seeks through it, reads the full payload and tail, lists the root directory to find the file, proves relative paths still fail closed, exits 0, and halts QEMU.
  • POSIX Adapter v0 – Identity stubs: commit 1a8a9896 (2026-05-23 06:51 UTC) closes the P1.4 identity-stub surface. libcapos-posix now exposes getpid, getuid, and getgid from the existing unistd-style header; getpid returns the stable capos-rt bootstrap pid for the current process, while getuid and getgid return the single-identity uid/gid 0. make run-posix-identity boots a C process that prints its identity, fork/execs the same binary through the recording shim, proves the child observes a distinct pid, exits both processes cleanly, and halts QEMU. The later make run-posix-printf proof closes the printf/string subset with live formatted output, string/mem, numeric conversion, and ctype markers. Commit 90e64011 (2026-05-23 08:11 UTC) closes the signal/time surface: make run-posix-signal-time proves Timer-backed time/sleep observations plus fail-closed kill/raise signal-delivery stubs. Remaining dash-port gates are dash vendoring/patching, the multi-translation-unit C build, and run-posix-shell-smoke.
  • POSIX Adapter v0 – Pipe + fork-for-exec plus direct posix_spawn Smoke: POSIX adapter Phase P1.3 first closed at commit ceaf5475 (2026-05-07 10:04 UTC) under an in-process x86_64 setjmp/longjmp recording-shim contract. A subsequent fix slice on top – spanning commits 44838ad7 (2026-05-07 11:07 UTC) through 7c08501c (2026-05-07 14:24 UTC) and integrated into mainline-tracking history via merge commit b8c7fb43 (2026-05-07 18:16 UTC) – replaced setjmp/longjmp with the return-the-pid contract because the longjmp re-entered fork()’s already-deallocated stack frame (undefined behaviour). An iter-15..iter-22 SMP-correctness hardening cycle followed, extending the fix slice through commit 05b52873 (2026-05-07 21:07 UTC); each iteration closed a distinct kernel pipe race surface (transport-error CQE on saturated waiter restore at iter-15, deferred-error retry queue + nested-fork reset at iter-16, write-overflow queue preserving partial-write CQE at iter-17, buffer-aware EOF + combined-cap waiters + child-order fd replay + EBADF on Moved at iter-18, close+write race + fd-recording precheck + Moved self-dup2 at iter-19, same-end waiter completion on close at iter-20, close_side publishing under the buffer lock at iter-21, and the matching in-lock close re-check in handle_write at iter-22). make run-posix-pipe-smoke boots the focused manifest, links the demos/posix-pipe-shim/main.c parent and demos/posix-pipe-child/main.c child against libcapos.a + libcapos_posix.a, drives pipe(); pid_t child = fork(); if (child == 0) { dup2(); close(); child = execve(...); } close(); read(); waitpid(child); end to end through the kernel Pipe capability and the recording-shim ProcessSpawner Move-grant path, and prints [posix-pipe] read 14 bytes: hello via pipe from the parent. The parent and child both exit 0 cleanly and the QEMU scheduler halts. fork() returns 0 unconditionally; dup2/close between fork and execve record into a TLS window without mutating the parent fd table; execve() drains the recording and returns the synthetic child pid as its own return value (a deliberate v0 deviation from POSIX). The direct public posix_spawn() successor proof landed at commit b8fb3131 (2026-05-13 10:15 UTC): libcapos-posix exposes posix_spawn() plus posix_spawn_file_actions_init/destroy/adddup2/addclose, and make run-posix-spawn-smoke creates a pipe, uses file actions to move the existing posix-pipe-child stdout onto the pipe, reads [posix-spawn] read 14 bytes: hello via pipe, waitpid()s the child, and halts after both processes exit 0. argv and envp are accepted for source compatibility but remain undelivered until LaunchParameters / environment support lands. The Console-backed stdio successor proof landed at commit aa6a56d7 (2026-05-13 11:03 UTC): libcapos-posix maps POSIX fd 1/2 to the granted Console cap when no stdio_<N> Pipe grant already occupies the slot, keeps fd 0 closed without stdin backing, and make run-posix-stdio-smoke prints distinct stdout/stderr markers through POSIX write before proving the no-stdin refusal path.
  • WASI Host Adapter Phase W.4 – random_get production wiring: Phase W.4 closed at commit b0f6939f (2026-05-07 20:09 UTC); Phase W.3 closed at commit ca41ecc1 (2026-05-07 18:29 UTC; the W.3 narrative stamps from 2026-05-07 18:25 UTC predate the feat commit by a few minutes); Phase W.2 closed at commit 7bfcb1d8 (2026-05-07 10:53 UTC) across four sub-slices. The bounded environment grant smoke landed at branch commit 5f5028e7 (2026-05-13 11:05 UTC). Sandboxed wasm32-wasi is now a booted language path on capOS; the W.2 slice delivered the first WASI-hosted, sandboxed portable-payload path (native C boots already existed via the libcapos C-substrate make run-c-hello and the historical POSIX-adapter DNS resolver); W.3 added the per-instance argv text grant; W.4 wires Preview 1 random_get through the kernel EntropySource cap; the 2026-05-13 follow-up adds the bounded initConfig.init.wasiEnv text grant as the v0 environment source. make run-wasi-hello-rust, make run-wasi-hello-c, make run-wasi-cli-args, make run-wasi-env, make run-wasi-random (granted), and make run-wasi-random-ungranted (refusal) are the regression, environment-grant, and W.4 gates; the environment smoke proves one granted value reaches a Rust wasm32-wasip1 payload through Preview 1 environ_get / environ_sizes_get; the random granted variant reads N=64 bytes through random_get and prints [wasi-random] entropy_bytes=64 entropy_bound_ok=true, and the ungranted variant observes ERRNO_NOSYS = 52 from the closed-fail refusal branch which never enters the kernel. Wall-clock support stays deferred: clock_time_get(CLOCKID_REALTIME) keeps the W.2 sentinel ERRNO_NOSYS until capOS has a typed WallClock/RealTimeClock cap. The next selectable WASI work is Phase W.5 (Preview 1 filesystem), blocked on the missing Namespace/File/Store cap surface.
  • POSIX Adapter v0 – DNS Resolver Smoke: POSIX adapter Phase P1.2 Phase B completed at commit b4f1a400 (2026-05-05 21:21 UTC). The now-retired make run-posix-dns-smoke booted the focused manifest, linked the demos/posix-dns-resolver/main.c C binary against libcapos.a + the new libcapos_posix.a, sent a DNS A query for example.com through the kernel UdpSocket capability to QEMU slirp’s resolver at 10.0.2.3:53, decoded the answer-section IN/A record, and printed [posix-dns-resolver] resolved example.com -> <ipv4> (e.g. 104.20.23.154; the upstream resolver picks the value, the harness grepped loosely). The target now exits before QEMU because the qemu-only kernel UdpSocket owner was removed; rebuild the resolver on the Phase C userspace network stack before using it as validation. The vendor/dns-c-wahern/ snapshot at rel-20160808 is in-tree as a structural reference but not yet compiled into the smoke; widening the POSIX surface so dns.c can build whole is follow-on work after P1.3.
  • In-Process Threading Scalability: completed at commit 136b72de (2026-05-01 14:58 UTC) after the benchmark repair replaced the invalid 1 MiB/spinning-parent four-worker shape with a blocking-parent 16 MiB/64-round shape. Reaffirmed against the then-current single-global-queue scheduler on capos-bench 2026-05-02 21:38 UTC against main commit 374f8556 with the formal capOS+Linux 5-run pair pinned to physical-core logical CPUs 0,1,2,3: capOS work 1.883x and total 1.787x clear the configured 1-to-2 gates; matching Linux pthread baseline 1.988x/1.987x validates the shape. The 1-to-4 row became the diagnostic that justified Phase D’s fair-share enqueue policy (capOS 1.566x/1.538x vs Linux 3.963x/3.858x); Phase D WFQ later manually accepted the recorded 1-to-4 diagnostic with capOS 3.088x/2.700x and matching Linux 3.974x/3.850x. Four-worker capOS speedup remains evidence of material improvement, not a completed linear-scaling claim.
  • Multi-Process SMP Concurrency: completed at commit 3fb89923 (2026-04-30 09:45 UTC), with repeated KVM-backed process-scale evidence in target/smp-process-scale/cycle-balanced-default/ (1.608x 1-to-2 speedup) and ordinary run-smoke/run-spawn coverage under -smp 2.
  • Session-Bound Invocation Context: completed at commit 503abc9 (2026-04-30 02:26 UTC), with Gate 4 implementation verification recorded at commit faeff80 (2026-04-29 21:39 UTC). The milestone includes one immutable process session, privacy-preserving endpoint caller metadata, explicit disclosure gating, session-aware transfer scopes, chat migration, terminal/stdio bridge liveness guards, adventure shared-service cleanup, and aligned paper evidence/status text.
  • Installable System: completed through commit 12b8334a (commit timestamp 2026-06-07 18:19 UTC; task closeout 2026-06-07 18:20 UTC) for the bounded local/QEMU contract. The milestone includes persistent data-region mount, config-overlay compose/merge fallback, generation/rollback machinery, integrated installable disk packaging, target-disk install, first-boot provision, update/rollback, and structural proposal/body wording reconcile. It preserves the RAM-only Namespace caveat and does not claim secure boot/signing, production release authority, public ingress, AWS/Azure live support, direct-remapping production hardware, full userspace smoltcp/L4 readiness, or full durable account policy.

Active visible milestone:

  • GCE Self-Hosted Web UI: serve the remote-session Web UI through the Phase C userspace network stack, prove the local cloudboot L4 path, and then prove private GCE reachability before any public endpoint. The selected milestone now has the userspace smoltcp-backed TcpListenAuthority local path proved by cloud-prod-userspace-network-stack-smoltcp-local-proof and local DHCP/IPv4 address/default-route/ARP configuration proved by cloud-prod-network-stack-dhcp-ipv4-config-local-proof; the cloudboot authority inventory (remote-session-webui-cloudboot-authority-inventory) is done and records the Web UI service authority boundary for the local L4 proof. The local Web UI L4 proof (cloud-prod-remote-session-web-ui-l4-local-proof) is done: the Phase C userspace network-stack process serves remote-session-web-ui on guest port 8080 with the full fixed-name bundle, login, a backend-held SystemInfo call, logout/stale failure, and the manual viewer under make run-cloud-prod-remote-session-web-ui-l4. Web UI session hardening (remote-session-web-ui-session-hardening) is done (2026-06-09), and Web UI connection bounds (remote-session-web-ui-connection-bounds) are done (2026-06-09): per-connection request-read/response-send deadlines in the Web UI client with a drip-feed abandon proof on the L4 gate. The narrow legacy kernel socket-path retirement is done; non-qemu manifests now reject kernel network_manager / tcp_listen_authority grants and leave those sources as qemu-only fixtures. The broader cloud-prod-phase-c-kernel-smoltcp-virtio-net-removal cleanup is also done: the kernel no longer depends on smoltcp, qemu-only kernel TCP/UDP socket entry points fail closed, and the remaining virtio-net code is lower-layer QEMU fixture evidence rather than production cloud socket ownership. The local cloud-prod-remote-session-web-ui-l4-local-proof gate consumed the done DHCP/IPv4 task and landed. Legacy GCE virtio-net Web UI serving is done locally (cloud-gce-legacy-virtio-webui-serving-local-proof, 2026-06-11), the public-ingress browser hardening set (public-origin policy, SameSite policy, JSON content-type guard, headers/CSP, forwarded-scheme trust, /healthz, in-guest login hardening) is done on the L4 gate, and the no-spend provider-harness gates (private preflight, private/public evidence validators, ingress plan, teardown engine, provider-command allowlist) are done as stub-fixture evidence. cloud-gce-private-self-hosted-webui-proof remains on hold on missing firewall IAM and per-run billable authorization. Public GCE ingress and TLS remain under the separate on-hold cloud-gce-public-self-hosted-webui-ingress-tls task and require explicit authorization; the local fixture gates bound that future run but do not authorize exposure.

Paused visible milestone:

  • SSH Shell Gateway: ssh reaches the capOS login/native shell flow through an SSH-backed TerminalSession in QEMU, using host-local forwarding, public-key authentication, denied unsupported SSH features, and the same child shell capability boundary proven by Telnet. This remains planned Stage 7 work, but network-backed shell delegation should wait for durable remote-account/key prerequisites.

Candidate next visible milestones:

  • Storage Capability Substrate: add RAM-backed Store/Namespace first, then BlockDevice, local disk, and a read-only filesystem proof if the block path is ready.
  • Serial Diagnostics And AWS Serial Boot: extend the current bounded COM1 diagnostics console with richer device dumps and prove the same imported image path on AWS. GCP imported-image serial boot is already recorded.
  • Remote Shell Access: SSH, Telnet development access, and basic WebShell over the capability terminal model after session lifecycle, durable key/account, and network prerequisites are credible.
  • Cloud follow-ups after the GCP-first provider rollup: public L4 ingress and SSH/WebShell productization, AWS/Azure provider ports, broader storage variants, high-throughput/multiqueue NIC readiness, and separate cloud benchmark reruns. The completed GCP rollup record is cloud-usable-instance-provider-nic-storage.
  • Agent Shell and federated chat remain future candidates, not the default next milestones ahead of the driver/storage/cloud bring-up ladder.

Select the next milestone in docs/tasks/state.toml only after the current selected milestone is achieved and recorded, or when the user explicitly changes the selected milestone. Update or add task records and linked backlog/proposal decomposition in the same change when the new milestone needs different execution context.

Backlog

Detailed task decompositions for work that is not useful in mandatory agent context.

Start from docs/tasks/state.toml for the selected milestone, then use root-level task records under docs/tasks/ to choose dispatchable work and the source links in those records to reach the relevant long-form decomposition.

Archived

Retained for historical context only – do not select work from these.

Runtime, Networking, And Shell Backlog

Detailed decompositions for runtime, networking, shell, agent, and web shell work. docs/tasks/README.md links here but should not inline these subtasks.

Scheduler/Park Measurement

Pre-thread dispatch instrumentation and compact-vs-generic ParkBench comparison are historical context. In-process threading later closed the first blocked/resume measurement path with QEMU samples for private ParkSpace wait/wake. Future measurement work should be tied to a concrete runtime or SMP change, especially per-thread/per-CPU ring behavior.

In-Process Threading Implementation

Current implementation subgates recorded in the old workplan were all marked complete, but the parent task still appeared unchecked. Before starting follow-up work, reconcile this status against code, docs/roadmap.md, and docs/changelog.md.

Completed subgates retained for context:

  • Add Thread state with per-thread kernel stack, registers, and FS base.
  • Change scheduling from process-level to thread-level while preserving process-owned address spaces and cap tables.
  • Add ThreadSpawner/ThreadHandle and basic join/exit smoke.
  • Implement the first park authority capability and contended-path measurements.

Runtime Ring Reactor Bridge

The current kernel ABI still exposes one process-owned capability ring. A multithreaded runtime therefore needs a compatibility bridge until per-thread kernel rings land.

Ordered gates:

  • Add one runtime-owned process-ring CQ drainer.
  • Map user_data completions back to ParkSpace-backed per-thread wait records.
  • Prove sibling threads can issue ordinary calls and receive out-of-order completions without both draining the process CQ.
  • Retire the bridge when per-thread capability rings and completion routing by generation-checked ThreadRef become the kernel ABI.

Telnet Shell Demo

Historical, fully retired track (2026-06-10). The visible outcome below was delivered and later retired together with the qemu-only kernel TCP listener and socket owner: make run-telnet and its sibling kernel-socket smokes now exit with retirement diagnostics, the telnet-gateway / ssh-gateway-terminal-host / network-client demos and their manifests are removed, the kernel SocketTerminalSession is deleted, and TcpSocket.intoTerminalSession fails closed in every dispatch path. Remote shell access belongs to the in-guest login surface (web UI over the Phase C userspace network stack) and the future SSH Shell Gateway (docs/proposals/ssh-shell-proposal.md); a network-backed TerminalSession must be re-built as a userspace terminal-session service over the userspace TCP stack if a byte-stream terminal transport is needed again.

Original visible outcome: make run-telnet boots capOS in QEMU with hostfwd=tcp:127.0.0.1:2323-:23, a telnet-gateway boot service listens on guest port 23 through the kernel TCP capability surface, and a scripted host smoke runs telnet 127.0.0.1 2323, logs in through the existing credential flow, issues one shell command, and sees a clean disconnect.

Ordered gates:

  • Add the Phase B TCP interfaces to the canonical shared schema: NetworkManager, TcpListener, and TcpSocket. Keep this milestone TCP-only; UdpSocket, DeviceMmio, DMAPool, and Interrupt are decomposed-NIC / userspace-driver scope.
  • Replace the synthetic 10 ms smoltcp clock with scheduler-driven polling on real TICK_COUNT; the HTTP proof now persists as a retained smoltcp runtime polled from scheduler ticks. Depends on Timer.
  • Close the delegated endpoint relabeling gap before exposing shell launch over Telnet. A remote shell user must not be able to type an arbitrary endpoint identity such as badge 200 and spawn a child that acts as a different chat/adventure participant. Omitted shell syntax now preserves the delegated source identity, and the low-level spawn hardening proof keeps the legacy badge-zero encoding covered. The containment gates in docs/backlog/stage-6-capability-semantics.md are complete; do not expose Telnet shell launch to any future badge-selection regression. Normal shell help and smoke-help expectations no longer advertise badge syntax.
  • Implement NetworkManager, TcpListener, and TcpSocket as kernel CapObjects wrapping the existing smoltcp smoke path. Reuse ring dispatch; do not add syscalls. accept and recv may be blocking calls for this milestone, with bounded result buffers and explicit close behavior. Initial implementation landed in commit 7446e04 at 2026-04-25 14:48 UTC; follow-up review fixes removed timer-path allocation from deferred completion, hardened result-cap cleanup, and added make qemu-network-client-harness coverage for userspace NetworkManagerClient, TcpListenerClient.accept, and TcpSocketClient send/recv/close.
  • Complete the next endpoint-identity containment transition before unrelated Telnet gateway work: Gate 1 representation plus the minimum trusted mint path landed as the historical service-object routing proof. The selected follow-on is now Session-Bound Invocation Context: keep production remote shell launch blocked until one-session-per-process, privacy-preserving endpoint caller-session metadata, and shared-service migration settle.
  • Add the socket-backed terminal handoff needed by the demo. capos-shell must still receive a cap named terminal with TerminalSession interface id, backed by the accepted TCP socket. Do not pass raw TcpSocket, ByteStream, or StdIO as a replacement for the login terminal boundary. Satisfy this either by adding typed service-export / grant support so a userspace telnet-gateway endpoint can be presented as a TerminalSession, or by implementing a real kernel socket-backed TerminalSession CapObject. Implemented as TcpSocket.intoTerminalSession, which consumed a connected socket cap and returned a move-only TerminalSession result cap, backed by the kernel SocketTerminalSession cooked-mode line-discipline shim. Retired 2026-06-10: the kernel socket owner behind it was removed by the Phase C userspace network-stack migration, so the shim and its handoff were deleted; TcpSocket.intoTerminalSession now fails closed with a retirement error in every dispatch path, and the consumer smokes (qemu-network-client-harness, run-telnet, run-ssh-gateway-terminal-host) exit with retirement diagnostics.
  • Add a telnet-gateway demo binary and system-telnet.cue manifest. The trusted demo gateway gets bootstrap NetworkManager and ProcessSpawner authority, plus pass-through creds, sessions, audit, and broker caps needed to spawn capos-shell with the same login/session semantics as the UART shell. The spawned shell must not receive raw network or broad process-spawn authority.
  • Add make run-telnet and a scripted qemu-telnet-harness host smoke that drives the full login/command/exit sequence and requires a proof line.
  • Document in docs/proposals/networking-proposal.md and docs/proposals/shell-proposal.md that telnet is demo-only plaintext, binds only to host loopback in the QEMU harness, preserves the TerminalSession boundary, and will be replaced by the SSH gateway once host-key, user-key, account, audit, and persistence prerequisites land. Implemented by branch commit 5d11b12 at 2026-04-25 20:06 UTC. make qemu-telnet-harness proves 127.0.0.1:2323 -> guest :23, password login, caps, the session command, and clean exit with no password, raw NetworkManager, raw ProcessSpawner, raw TCP, or unknown-cap leakage in the host transcript. Replacing the gateway’s factory network/spawn authority with scoped listener and shell-launch caps is tracked in task records; it is not required for the host-local visible demo.

Telnet Over TLS Optional Track

Telnet over TLS is not a default main access interface. Keep it as an optional future transport for service terminals or certificate-heavy deployments after the certificate/TLS, durable identity, session lifecycle, audit, and scoped listener-authority prerequisites exist. SSH remains the main production operator CLI track, and WebShellGateway remains the main browser/agent access track.

Ordered gates before this can be considered production-shaped:

  • Certificate/TLS server configuration, private-key custody, trust-store, and rotation primitives exist outside the kernel TCB.
  • Client identity maps through durable account/session policy, preferably mTLS client certificates with password fallback only by explicit policy.
  • Session lifecycle close propagation exists for terminal disconnect, process-tree exit, explicit logout, and administrator revocation.
  • The gateway receives only scoped listener, TLS config, terminal-factory, session, broker, audit, and restricted-launch grants; no raw broad network or process-spawner authority.
  • QEMU and host-network harnesses prove TLS handshake, failed client-auth behavior, terminal login, disconnect cleanup, and transcript redaction.

Remote Session CapSet Clients

Programmatic and GUI remote clients are a sibling track to terminal shells. A regular host app – CLI, native GUI, Tauri backend, webapp gateway, desktop tool, service client, or agent runner – should authenticate through the capOS session/admission path, obtain a broker-issued remote CapSet view in its trusted backend, and call provided capabilities over Cap’n Proto RPC. It should not be forced to spawn capos-shell, and it should not be reduced to a special-purpose chat proxy.

The first implementation slice exposes this path as a host-local development endpoint. Default make run starts remote-session-capset-gateway and forwards guest port 2327 to a loopback host port, preferring 127.0.0.1:2327 while falling back to a free port when another QEMU run is already using it. The focused make run-remote-session-capset-interop harness runs the Linux Rust client, authenticates through SessionManager, lists a broker-shaped remote CapSet, calls session/system-info DTO operations, and proves denial/stale paths. This slice uses schema-framed Cap’n Proto DTOs; standard capnp-rpc proxy transport and endpoint-backed service calls remain the next gates.

Detailed decomposition lives in Remote Session CapSet Client. Keep this track coordinated with SSH/WebShell work:

  • SSH remains the production operator CLI terminal transport.
  • WebShellGateway remains the browser/agent terminal and tool-proxy surface.
  • Remote session CapSet clients are the programmatic and UI API surface for Linux host tools, desktop/Tauri apps, webapp gateways, service clients, and server-side agent runners.
  • Optional UI-composition caps let capOS-side services and agents propose bounded panes, command palettes, visualizations, layout hints, and theme tokens through host-validated surfaces instead of treating “remote GUI” as only a window or terminal frame.
  • All three paths consume the same SessionManager and AuthorityBroker model and must support non-password admission methods where policy enables them.
  • Browser JavaScript and model providers must not receive raw capOS caps; gateway-side workers hold the session CapSet and expose only terminal frames, command metadata, or bounded tool requests.

SSH Shell Gateway

Visible outcome: make run-ssh-shell boots capOS in QEMU with a host-local forward to guest SSH, an ssh-gateway service authenticates a normal OpenSSH client with a configured public key, launches capos-shell with an SSH-backed TerminalSession, runs one shell command, and disconnects cleanly. The shell must see the same terminal/session/broker boundary as the Telnet demo, not raw TCP or SSH protocol authority.

Blocked by: Telnet Shell Demo for socket-backed TerminalSession, cryptography/key-management for sign-only host keys, local account/key records for authorized SSH keys, audit records for remote authentication decisions, and persistent storage before production host or authorized keys are treated as durable.

Closeout prerequisite: before this milestone closes, reconcile its target name and host-harness placement with the run-target/init-mandate policy in docs/backlog/run-targets-and-init-policy.md (Gate A naming split, Gate B init mandate, Gate C test split, Gate D default-make run integration). The current make run-ssh-shell working name and any scripted host harness may need to become test-ssh-shell and be relocated, and default-run exposure has to be addressed there, not as another run-ssh-* recipe.

Ordered gates:

  • Document the first SSH gateway contract in docs/proposals/ssh-shell-proposal.md: gateway authority, host-key custody, authorized-key mapping, accepted channel set, denied SSH features, terminal handoff, audit, resource limits, and teardown.
  • Close or explicitly preserve the scoped gateway authority gap for SSH before implementation: the gateway must receive a manifest-declared scoped listener or listener factory for only the configured SSH port, and the spawned shell must receive no raw NetworkManager, TcpListener, TcpSocket, or transport protocol authority. A temporary host-local demo compromise must stay documented in a task record and the harness must prove the child boundary with caps. - [x] Scoped listener authority sub-slice: tcp_listen_authority manifest grants use the cap badge as a validated TCP port and mint a one-shot TcpListenAuthority that can create only that listener; make run-tcp-listen-authority proves generic init can forward the scoped cap to a child without raw NetworkManager.
  • Terminal-host wiring sub-slice: ssh-gateway-terminal-host used manifest-scoped TcpListenAuthority on the SSH development port and RestrictedShellLauncher to hand a socket-backed TerminalSession to capos-shell while proving the child lacked raw network, TCP, spawn, key-store, host-key, SSH gateway, terminal-factory, and launcher authority. This closed the scoped gateway authority gap for the bounded host-local proof. The demo and its smoke were retired 2026-06-10 with the kernel socket owner and SocketTerminalSession; the final OpenSSH transport must be rebuilt as a terminal host over the userspace network stack.
  • Add manifest-declared shell launch authority for the gateway. Prefer a shell-only launcher or supervisor grant that can start only capos-shell with reviewed pass-through caps; do not grant broad ProcessSpawner authority to the SSH gateway unless it is explicitly recorded as a host-local development compromise. - [x] Restricted shell launcher sub-slice: restricted_shell_launcher manifest grants forward an init-held RestrictedShellLauncher cap to a child service. make run-restricted-shell-launcher proves the child service has no raw ProcessSpawner, launchShell has no binary selector and launches only capos-shell, session/profile mismatch and dangerous grant attempts fail closed, and the spawned shell uses the supplied session while lacking raw network, TCP, host-key, authorized-key-store, SSH gateway, and restricted-shell-launcher authority.
  • Add schema/design stubs for the minimum SSH support objects: SshGateway or equivalent service contract, sign-only SshHostKey wrapper around a KeyVault/PrivateKey, AuthorizedKeyStore, and SSH-backed TerminalSession construction. Do not expose private-key bytes, raw authorized-key storage, or vault administration to the spawned shell. Implemented as schema/type-surface stubs for SshGateway, SshHostKey, AuthorizedKeyStore, SshTerminalFactory, TcpListenAuthority, and RestrictedShellLauncher; no bootable kernel or userspace implementation is implied by this gate.
  • Add a development host-key path. Manifest-seeded keys may be used only for QEMU proof and must be labeled non-production; production host keys require the key-management and storage path. Implemented as kernelParams.sshDevelopmentHostKey plus the narrow ssh_development_host_key kernel source. The focused proof is make run-ssh-host-key; the development cap signs bounded ssh-ed25519 exchange hashes from the manifest seed, verifies against the configured public key in QEMU, denies wrong algorithms, and remains explicitly non-production. Persistent production host-key storage, rotation, and key management remain future work.
  • Add public-key user authentication. Accepted SSH keys map to principals and allowed shell profiles; SessionManager mints the session only after signature verification, and AuthorityBroker still decides the actual shell bundle. - [x] Public-key session bridge sub-slice: SessionManager.sshPublicKey checks a configured AuthorizedKeyStore record plus bounded fixture auth bytes/signature, mints a UserSession with the accepted principal/profile and publicKey auth strength, and make run-ssh-public-key-auth proves unknown, disabled, unsupported, and bad-signature paths fail closed before broker bundle minting. This is not full SSH transport authentication or shell launch wiring. - [x] AccountStore-bound session sub-slice: SessionManager.sshPublicKey consults the bootstrap RamAccountStore after signature verification (lookup_by_principal), so non-Active account statuses (Disabled, Locked, RecoveryOnly) and missing principals fail closed before a session is minted. Each denial cause maps to a stable, principal-blanked auth= audit code (ssh-key-unknown, ssh-key-disabled, ssh-key-profile-not-allowed, ssh-bad-signature, ssh-account-missing, ssh-account-disabled, ssh-account-locked, ssh-account-recovery-only, ssh-account-lookup-failed, ssh-profile-kind-invalid, ssh-profile-not-interactive, ssh-auth-bytes-invalid). make run-ssh-public-key-auth covers the non-account-status codes; the ssh-account-* codes need an AccountStoreManagerCap kernel cap source for runtime-mutated QEMU proofs (tracked in docs/backlog/local-users-management.md Gate 2).
  • Reject unsupported SSH features with protocol failures and audit reason codes: password auth when disabled, exec, SFTP/subsystems, port forwarding, agent forwarding, X11 forwarding, arbitrary environment import, and multiple active shell channels. - [x] Policy-surface sub-slice: capos-config::ssh_policy returns allowed/denied decisions, SSH protocol failure classes, and stable audit reason codes for the narrow allowed path and the denied feature set, including second session-channel opens before any shell request. Password auth remains fail-closed until a real verifier/backoff path is part of the gateway policy. make run-ssh-feature-policy proves the table in QEMU. The full gateway item remains open until this policy is invoked by ssh-gateway.
  • Implement the gateway as a terminal host. It owns SSH packet/channel state and gives capos-shell only a cap named terminal plus the normal scoped launch grants. The child must not receive raw network, host-key, authorized-key-store, key-vault, or broad spawn authority. - [x] Bounded terminal-host wiring sub-slice (retired 2026-06-10 with the kernel socket owner and SocketTerminalSession; the smoke now exits with a retirement diagnostic and a future terminal host must target the userspace network stack): make run-ssh-gateway-terminal-host proved a generic-init child service can combine scoped TcpListenAuthority, AuthorizedKeyStore, SessionManager, AuthorityBroker, and RestrictedShellLauncher grants to deny an unknown key, mint a publicKey session from a configured key, reject a mismatched broker profile, accept the matching broker profile, convert one host-local TCP socket into a TerminalSession, and launch capos-shell without giving the shell raw network, process-spawner, TCP listener/socket, host-key, authorized-key-store, SSH gateway, SSH terminal-factory, or restricted-shell-launcher authority. The proof keeps the listener service-live across shell exits, proves a second host TCP connection succeeds, and externally stops QEMU through the harness pidfile instead of treating service exit as success. This remains a bounded plain-TCP proof and does not complete full SSH packet/channel ownership or the OpenSSH harness gate.
  • Add system-ssh-shell.cue, make run-ssh-shell, and a host harness using ssh against the forwarded port. The harness must prove one successful public-key login, one shell command, clean exit, unknown-key denial, disabled-password denial, denied forwarding/subsystem requests, and cleanup after client disconnect. - [ ] OpenSSH version-exchange slice: add a real ssh-gateway service and system-ssh-shell.cue skeleton that accepts one host-local OpenSSH TCP connection, exchanges RFC 4253 identification strings, records the client software/version in bounded audit/proof output, and disconnects before key exchange without launching a shell. The normal compatibility harness should use /usr/bin/ssh; a separate low-level hostile TCP/banner fixture should prove malformed banners plus overlong identification strings fail closed. - [ ] KEXINIT and algorithm-selection slice: parse the unencrypted KEXINIT binary-packet exchange far enough to negotiate a pinned development algorithm set, reject unsupported algorithms with SSH disconnects, and keep the negotiated algorithm names out of any authority decision. The initial reviewed set should be exactly one modern KEX, ssh-ed25519 host keys, one AEAD cipher/MAC pair, and none compression until rekey and broader algorithm policy exist. - [ ] Development key-exchange slice: complete the negotiated KEX, derive traffic keys from the shared secret, exchange hash, and session id per RFC 4253, call SshHostKey.signExchangeHash for the SSH exchange hash, and complete the OpenSSH handshake without exposing private host-key bytes or raw entropy to the gateway’s child shell. Entropy is input for ephemeral KEX material, padding, and challenges; this remains non-production until host keys are durable and the entropy source has a reviewed production-quality policy. - [ ] OpenSSH public-key userauth slice: bind the OpenSSH userauth transcript to SessionManager.sshPublicKey so the accepted key maps to the configured principal/profile, unknown keys are denied generically, and disabled password auth returns the expected SSH failure without invoking CredentialStore. - [ ] Channel policy slice: invoke capos-config::ssh_policy for session-channel open, PTY, window-change, shell, exec, subsystem, forwarding, agent, X11, environment, and second-channel requests. The harness must prove the allowed shell path plus the denied feature requests with protocol-visible failures and sanitized audit reason codes. - [ ] SSH terminal launch slice: replace the plain-TCP terminal-host driver with the SSH channel-backed terminal path, launch capos-shell through RestrictedShellLauncher, run session, caps, and exit over OpenSSH, and prove disconnect cleanup for both client-close-before-shell and shell-exit-before-client-close.
  • Update docs/proposals/shell-proposal.md, docs/proposals/boot-to-shell-proposal.md, docs/security/trust-boundaries.md, and docs/proposals/index.md when implementation begins so remote SSH login policy, terminal authority, and audit records stay aligned with the code.

Decomposed NIC Milestone

Move the NIC driver and TCP/IP stack out of the kernel into dedicated userspace processes after the Telnet Shell Demo made the socket interfaces capability-shaped. The Phase C userspace NIC driver and smoltcp network-stack process have since landed and own the production socket path; make run-telnet and the other kernel-socket consumer smokes are retired rather than preserved end-to-end, because the qemu-only kernel TCP listener and socket owner were removed with that migration.

  • Define first DeviceMmio, DMAPool, and Interrupt schemas (landed with the DDF capability surface).
  • Move virtio-net ownership into a userspace driver process holding only DeviceMmio, Interrupt, and DMAPool caps (Phase C userspace NIC driver slices).
  • Split smoltcp into a separate userspace network-stack process that holds the Nic cap from the driver and re-exports the Phase B socket interfaces (Phase C userspace network-stack process).
  • The kernel no longer depends on smoltcp, and the userspace network-stack process re-exports the socket interfaces. The make run-telnet end-to-end confirmation was retired instead of re-proven: the gateway demos sat on the removed kernel socket owner, and remote-shell coverage moved to the in-guest login surface and the future SSH gateway over the userspace stack.

Agent Shell / Agent Runner

The native shell’s agent mode must land before exposing the shell through a browser. The shell remains the trusted runner and session-cap holder. The model service receives prompts and returns structured tool calls, but never receives session caps, terminal caps, launcher authority, raw tokens, or secrets. Use a deterministic test model for the first proof.

Visible outcome: make run-agent-shell boots capOS in QEMU, grants capos-shell a broker-issued LanguageModel cap plus per-tool permission map, enters agent mode, exposes the current session bundle as typed tool descriptors, executes one read-only tool call automatically, requires consent or step-up for a mutating/admin-shaped call, handles user cancellation, and records redacted audit output.

Ordered gates:

  • Add the first agent-runner schema/interfaces: LanguageModel, ModelInfo, ToolDescriptor, ToolCall, ToolResult, permission mode metadata, and bounded streaming/cancel semantics. Keep tool calls structured; do not parse model text as shell commands.
  • Extend AuthorityBroker session profiles so an operator shell can receive a LanguageModel cap and a per-tool permission map without receiving model-admin, model-catalog, or provider-token authority.
  • Add a deterministic in-tree LanguageModel test service that emits scripted tool calls for QEMU proofs. Do not block this milestone on large local model weights, remote providers, GPU, or storage.
  • Implement native shell agent mode: build the tool table from granted session caps and schema metadata, stream model turns, gate each tool call through auto / consent / stepUp / forbidden, invoke only the capabilities held by the shell runner, and feed outcomes back into the loop.
  • Wire consent, step-up, cancellation, timeout, quota, and audit behavior. User interrupts beat model momentum; denied or cancelled tool calls become ordinary tool outcomes instead of hidden control flow.
  • Add make run-agent-shell and a scripted QEMU harness that proves read-only auto execution, denied forbidden/admin tool exposure, one consent or step-up prompt, cancellation, and redacted audit records.
  • Update docs/proposals/llm-and-agent-proposal.md, docs/proposals/shell-proposal.md, and docs/tasks/README.md to record that WebShellGateway hosts this agent-capable shell/runner instead of defining a separate browser-side agent authority model.

WebShellGateway

Add the browser-hosted terminal and authentication gateway after both remote TerminalSession proof and agent shell are in place. The gateway owns HTTP/WebSocket or equivalent transport, TLS/origin/RP-ID validation, WebAuthn challenge/response, terminal rendering, and session teardown. It launches the same agent-capable native shell with the same broker-issued session profile.

Blocked by: Telnet Shell Demo for socket-backed TerminalSession, Agent Shell / Agent Runner, passkey challenge/credential support in auth/session services, and TLS/origin/RP-ID policy. OIDC is a follow-up path on the same gateway, not a prerequisite for the first WebAuthn shell.

Visible outcome: make run-webshell boots capOS in QEMU with host-local forwarding to the web gateway, a headless browser harness opens the terminal UI with a virtual WebAuthn authenticator, authenticates, runs one shell or agent command, logs out or closes the tab, and verifies clean shell/process/session teardown plus a recorded transcript/proof line.

Ordered gates:

  • Define the web terminal stream protocol over WebSocket or an equivalent browser transport: input, output, resize, paste, close, cancellation, flow control, session IDs, and bounded buffering.
  • Add WebAuthn/passkey credential and challenge support: public-credential records, single-use bounded challenges, entropy fail-closed behavior, origin/RP-ID binding, user-presence/user-verification policy, sign-count handling, rate limiting, and redacted audit events.
  • Add TLS and browser origin policy for QEMU and deployment modes. The first harness may use a local development trust path, but the gateway must have explicit Host/Origin/RP-ID checks and no production plaintext mode.
  • Implement WebShellGateway as a terminal host service: accept browser sessions, authenticate, request the narrow shell/agent bundle from AuthorityBroker, create or wrap a web-backed TerminalSession, spawn capos-shell, proxy terminal events, and release all session resources on logout, tab close, timeout, or shell exit.
  • Add system-webshell.cue and manifest/grant wiring. The gateway gets only listen/TLS/auth/session/broker/restricted-launch grants needed for the job; the spawned shell does not receive raw network, raw auth material, model-provider tokens, or broad process-spawn authority.
  • Add make run-webshell and qemu-webshell-harness with a headless browser virtual authenticator, transcript capture, login/command proof, logout/close proof, and assertions that failed auth and stale browser sessions do not leave a live shell.
  • Add optional OIDC authorization-code + PKCE login on the same gateway after the OAuth/OIDC service exists. ID-token verification and acr/amr mapping feed SessionManager/AuthorityBroker; raw tokens do not enter the shell or browser terminal transcript.
  • Update docs/proposals/boot-to-shell-proposal.md, docs/proposals/shell-proposal.md, docs/proposals/llm-and-agent-proposal.md, and security trust-boundary docs with WebShellGateway authority, auth, terminal, audit, and teardown rules.

Network Usability And Post-smoltcp Backlog

This page decomposes the work that makes capOS networking usable after the Phase C userspace L4 stack exists. It deliberately sits beside the lower-layer Phase C track in Phase C Userspace NIC Driver Relocation and the cloud/Web UI chain in Hardware, Boot, and Storage.

The first public GCE Web UI path remains IPv4-first. Its network blockers are Phase C userspace L4, DHCP/IPv4 configuration, ARP/default-route reachability, private GCE proof, and the reviewed public HTTPS ingress posture. DNS, ping, IPv6, packet tracing, and advanced transport policy improve usability and diagnostics, but they do not block first public self-hosted Web UI unless a later ingress policy explicitly chooses them as health or routing requirements.

Current State Boundaries

  • Production non-qemu L4 has a local Phase C 7c-ii(b) serve-from-userspace proof: cloud-prod-userspace-network-stack-smoltcp-local-proof boots the non-qemu cloudboot manifest, grants an application client only a userspace-served TcpListenAuthority, and completes one hostfwd TCP request/response through served TcpListener/TcpSocket caps. The qemu-only kernel smoltcp / virtio-net path still exists for local fixtures and transitional TCP/UDP caps; the legacy kernel socket owner is cleanup-only after the served-socket proof.
  • The current Nic cap is raw-frame oriented. It copies frames as inline Data through manager-owned buffers and exposes no host-physical or device-usable address to userspace.
  • The landed Nic.receive @1 is single-frame per call: it posts one RX buffer, drains one frame (or resets the device on an empty poll), and frees the buffer – it keeps no pool of RX buffers armed between calls and has no non-resetting “no frame yet” path. Multi-frame asynchronous TCP needs a sustained, keep-armed receive, designed as the receivePoll @4 bounce-RX-pool primitive in Phase C Userspace NIC Driver Relocation and landed by cloud-prod-nic-driver-userspace-sustained-receive-pool-local-proof. That slice is the prerequisite for Phase C 7c-iii (TcpListener/TcpSocket).
  • The first local DHCP IPv4 configuration proof is done: cloud-prod-network-stack-dhcp-ipv4-config-local-proof follows the served userspace smoltcp/socket proof, acquires a DHCPv4 lease over the Nic cap, installs IPv4 address/default-route state, resolves gateway and same-subnet ARP neighbors, and feeds userspace-served NetworkManager.getConfig. Renewal/rebind/expiry lifecycle, DNS option publication, and operator-visible lease status remain follow-up work.
  • A POSIX DNS smoke exists: demos/posix-dns-resolver/. It manually builds one DNS A query and sends it through the kernel UdpSocket cap to QEMU slirp DNS at 10.0.2.3. It is not a system resolver service, not a typed DnsResolver cap, and not a getaddrinfo / /etc/resolv.conf bridge.
  • IPv6 is already decomposed as a separate lane in Hardware, Boot, and Storage. Do not duplicate that lane here; link to it when diagnostics or resolver work needs dual-stack behavior.

User-facing Stories

Usable networking means operators and ordinary services can answer concrete questions without reading QEMU logs or proof tokens. Each story below maps to the task record that owns it and is classified against the first public GCE Web UI critical path stated above: Critical path items block the first public self-hosted Web UI proof; Diagnostics and Completeness items improve usability but do not block it unless a later ingress policy explicitly promotes one (see the IPv4-first scoping in the page header and DHCP Plan below).

Two of these stories are satisfied by configuration proofs that already live in the Current State Boundaries and DHCP Plan sections rather than by a usability tool: an operator gets a non-fixture address, default route, and userspace-served config status from cloud-prod-network-stack-dhcp-ipv4-config-local-proof (the first local DHCPv4/IPv4 config proof, critical path), and the basic socket substrate a server binds against comes from the Phase C socket-cap and TcpListener/TcpSocket proofs, with production manifest wiring owned by cloud-prod-userspace-network-stack-smoltcp-local-proof (critical path). The usability tasks below layer status, resolution, diagnostics, and server semantics on top of those.

Operator stories

What an operator needs to observe and diagnose the running network without holding raw NIC, DMA, or NetworkManager authority:

Operator storyOwning task recordWeb UI critical path
What interfaces exist, is link up, what MAC/address/prefix/default route/DNS config is active, and did it come from DHCP, static manifest, or a test fixture?network-operator-status-tool-local-proofDiagnostics (non-blocking)
Which sockets/listeners are active, which authority granted them, what peer/port is bound, and are calls blocked on accept/recv/send/backpressure?network-operator-status-tool-local-proof over network-transport-status-cap-local-proof (done)Diagnostics (non-blocking)
Does the stack publish DHCP-derived IPv4 address, default route, and gateway-neighbor state instead of a static fixture?cloud-prod-network-stack-dhcp-ipv4-config-local-proofCritical path
Can a service bind a listener after boot without depending on the static QEMU 10.0.2.15 assumption?cloud-prod-remote-session-web-ui-l4-local-proof over the done DHCP config proof and Phase C socket capsCritical path
Is a DHCP lease active, and what are its renewal/rebind/expiry state and operator-visible status?network-dhcpv4-lease-lifecycle-local-proof (done)Completeness (non-blocking)
Can an operator run bounded ping / route / DNS-lookup / socket-status checks?network-ping-diagnostics-tool-local-proof (done), network-operator-status-tool-local-proofDiagnostics (non-blocking)
Can an operator run bounded IPv6 ping6?network-ping6-diagnostics-tool-local-proof (over the IPv6 lane)Diagnostics (non-blocking)
Can a debugging authority capture bounded per-interface packets/summaries without arbitrary NIC, DMA, or raw network-manager authority?network-packet-trace-authority-local-proofDiagnostics (non-blocking)

Application stories

What an ordinary service or POSIX program needs to use the network through narrowly-scoped capabilities instead of raw socket/manager authority:

Application storyOwning task recordWeb UI critical path
Can a process resolve a hostname through a typed resolver capability instead of holding raw UDP socket authority?network-system-dnsresolver-cap-local-proof (done)Completeness (non-blocking)
Can POSIX software call getaddrinfo and read resolver config through the adapter without owning a broader NetworkManager?posix-getaddrinfo-system-resolver-bridge-local-proofCompleteness (non-blocking)
Can a long-lived server rely on readiness, cancellation, and backpressure instead of assuming every socket call eventually completes?network-socket-readiness-poll-cancel-backpressure-local-proof (done)Completeness (non-blocking)
Can POSIX software wait for socket readiness through poll/select over the settled readiness model?posix-socket-poll-select-bridge-local-proof (done)Completeness (non-blocking)
Can a server set keepalive and connect/accept/recv timeouts?network-transport-keepalive-timeout-policy-local-proof (done)Completeness (non-blocking)
Can a server read connection state, backpressure depth, active keepalive/timeout, congestion controller, and interface MTU/MSS?network-transport-status-cap-local-proof (done)Completeness (non-blocking)

DNS resolution is listed as Completeness rather than Critical path because the selected public ingress can route to a backend by configured address/load-balancer target; it becomes a deployment-policy dependency only under the conditions in System Resolver Plan below.

DHCP Plan

DHCP belongs in the userspace network-stack process or a narrowly-authorized userspace configuration service, not in the kernel. The kernel should stage only the minimal capabilities needed to start the network stack and deliver socket/result caps. Lease parsing, renewal timers, rebind behavior, expiry, DNS/search-domain extraction, and status reporting are policy/state-machine work and should not be added to the qemu-only kernel smoltcp path.

The ordering is:

  1. Phase C slice 7a proves smoltcp can run in a userspace process over the Nic cap.
  2. Phase C 7b, 7c-i, 7c-ii(a), and 7c-iii prove the socket-cap and TcpListener/TcpSocket substrate; 7c-ii(b) locally proves the production manifest through the selected serve-from-userspace path.
  3. cloud-prod-network-stack-dhcp-ipv4-config-local-proof is done. It implements the first local DHCPv4 lease/configuration proof: lease acquisition, IPv4 address, prefix/netmask, default gateway, and ARP neighbor proof.
  4. network-dhcpv4-lease-lifecycle-local-proof is done. It extends that first proof into the full DHCPv4 lease lifecycle. A deterministic in-process fixture DHCP/ARP responder drives the real userspace smoltcp DHCPv4 client under a harness-controlled synthetic clock through initial lease acquisition, T1 unicast renewal, T2 broadcast rebind, and lease expiry; the served NetworkManager.getConfig status surface reports a fail-closed zero state on expiry (never stale lease data) and resolves static-config precedence over a live DHCP lease; DNS server and search-domain options are extracted from the wire and held as resolver inputs without being exposed through getConfig. Proof: make run-network-dhcpv4-lease-lifecycle. The real-network initial acquisition over the Nic cap stays proven by make run-cloud-prod-network-stack-dhcp-ipv4-config.

System Resolver Plan

capOS should expose DNS through a typed resolver capability, not by making every consumer hold NetworkManager or raw UDP authority. The first resolver should be a stub resolver service, not a recursive resolver:

  • Inputs: DHCP-provided nameserver/search-domain options from the IPv4 config path and optional static manifest resolver config.
  • Authority: one narrowly-scoped UDP socket or resolver-upstream authority plus Timer; no broad NetworkManager unless the slice explicitly justifies it.
  • Output: a typed DnsResolver cap with bounded query names, record types, timeouts, response-size limits, negative/error mapping, and observable configuration provenance.
  • POSIX bridge: getaddrinfo and a bounded /etc/resolv.conf projection call into DnsResolver; POSIX callers should not parse raw DHCP state or own upstream sockets.

The typed resolver capability landed as network-system-dnsresolver-cap-local-proof. The POSIX bridge landed as posix-getaddrinfo-system-resolver-bridge-local-proof: libcapos-posix now implements getaddrinfo / freeaddrinfo / gai_strerror over a granted dns_resolver endpoint (resolver status -> typed addrinfo/EAI_*; no ambient UDP fallback), plus a read-only /etc/resolv.conf projection derived from the resolver status (writes fail-closed EACCES, absent without the cap). Proof: make run-posix-getaddrinfo. AAAA / sockaddr_in6, AI_* flags, and an /etc/services table remain follow-ups (getaddrinfo fails closed on each: EAI_FAMILY / EAI_BADFLAGS / EAI_SERVICE).

DNS does not normally block the first GCE Web UI proof because the selected public ingress path can route to a backend by configured address/load-balancer target. DNS becomes a deployment-policy dependency when capOS itself must resolve outbound names, when the public proof asserts a DNS hostname end to end, or when IPv6 ingress adds AAAA/certificate policy.

Beyond smoltcp

The near-term plan is not to replace smoltcp or hand-roll TCP algorithms. Phase C should first move smoltcp out of the kernel, preserve the existing socket contract, and make its behavior observable. The distinction this lane keeps is between relocation (Phase C slices 7a-7c: run the selected smoltcp build in userspace and preserve the socket contract) and transport policy/status (the capOS control plane around that stack, decomposed below). Relocation does not require any new transport mechanic; the policy/status work starts only after the stack is observable.

What the selected smoltcp build actually exposes

smoltcp is pinned at version 0.13.0 (Cargo.lock). capOS does not build the crate’s default feature set; it enables narrow per-proof subsets:

  • The qemu-only kernel fixture (kernel/Cargo.toml) enables alloc, medium-ethernet, proto-ipv4, socket-tcp, and socket-udp.
  • The early Phase C userspace 7a/7b demos demos/cloud-prod-network-stack-process-smoltcp-skeleton-smoke and demos/cloud-prod-network-stack-smoltcp-socket-caps-smoke enable alloc, medium-ethernet, proto-ipv4, and socket-udp only. Those early demos are UDP-only and should not be read as the current full Phase C L4 status.
  • The later Phase C TCP proofs demos/cloud-prod-network-stack-smoltcp-tcp-listener-roundtrip-smoke and demos/cloud-prod-network-stack-smoltcp-tcp-socket-cap-ipc-smoke enable alloc, medium-ethernet, proto-ipv4, and socket-tcp. The completed cloud-prod-userspace-network-stack-smoltcp-local-proof builds on that substrate and proves a local served TcpListenAuthority / TcpListener / TcpSocket request/response through the userspace network stack.
  • The selected IPv4 Web UI path now has a local DHCP/IPv4 configuration proof over smoltcp’s socket-dhcpv4 path in the Phase C userspace stack. Landed proof stops at config/status, route, and ARP neighbor evidence; the local bounded ICMPv4 Echo Reply proof is also done for diagnostics. socket-dns, the operator IPv4 ping tool, the local remote-session-web-ui L4 proof, private GCE reachability, and public ingress/TLS remain separate gates.

None of the IPv4 TCP builds cited above enables socket-tcp-reno or socket-tcp-cubic. Those features are what compile smoltcp’s Reno and CUBIC controllers into the CongestionControl enum; without them the only available variant is CongestionControl::None, which is also smoltcp’s default. capOS therefore runs with no congestion control today as a consequence of its build configuration, not as a reviewed policy choice. Selecting Reno (or CUBIC, which uses f64) is a build-feature flip plus a set_congestion_control call, not a custom algorithm.

For read-only status, smoltcp’s TCP socket already exposes the introspection capOS would surface: connection state (state() over the TCP state machine), local_endpoint()/remote_endpoint(), liveness predicates (is_open/is_active/is_listening, may_send/may_recv, can_send/can_recv), buffer sizes (send_capacity/recv_capacity), and the current backpressure depth (send_queue/recv_queue bytes). Keepalive and idle timeout are policy setters with matching getters (keep_alive/set_keep_alive, timeout/set_timeout). There is no per-socket getter for negotiated MSS, RTT, or retransmission counts in 0.13.0; MTU is an Interface/phy::Device property, so MTU/MSS status must be sourced from the interface and device capabilities, not from the TCP socket.

Status capOS must surface

Read-only transport status the socket/listener caps should expose, each backed by an existing smoltcp getter (or interface property) so it records selected behavior rather than asserting new mechanics:

Statussmoltcp / interface source
Connection statetcp::Socket::state()
Local / remote endpointlocal_endpoint() / remote_endpoint()
Send/receive backpressure depthsend_queue() / recv_queue() vs send_capacity() / recv_capacity()
Readiness / livenessmay_send/may_recv, can_send/can_recv, is_active/is_listening
Active keepalive / idle timeoutkeep_alive() / timeout()
Active congestion controllercongestion_control() (today always None)
Interface MTU and configured-MTU sourceInterface/phy::Device capabilities, manifest config
Listener backlog pressureaccepted-socket count vs configured backlog
Close / error / reset reasonsocket close transition plus the cap/network.rs error mapping

v0 classification

  • v0 policy inputs (operator/service-settable): per-socket keepalive interval and connect/recv/idle timeout (smoltcp set_keep_alive / set_timeout plus connect/accept/recv deadlines), and listener backlog bound. These map to existing smoltcp setters and to call-level deadlines.
  • v0 read-only status: the status table above — exposed through the socket, listener, and NetworkManager-side status surface without letting callers mutate stack internals.
  • Deferred until workload evidence: congestion-control algorithm selection, path-MTU discovery, TCP-mechanic tuning (window scaling, Nagle/quickack policy), and any stack replacement. The default is to observe and surface the selected stack’s behavior first.

Decomposed follow-ups

  • Cancellation, readiness, close, and backpressure semantics are settled by network-socket-readiness-poll-cancel-backpressure-local-proof (done); the POSIX poll/select bridge over that model is settled by posix-socket-poll-select-bridge-local-proof (done). The settled readiness states map to POSIX event bits in the shared capos-rt::pollselect core (POLLIN/POLLOUT/POLLHUP/POLLERR/POLLNVAL, no stale readable/writable after close/release); the libcapos-posix C poll()/select() surface and <poll.h>/<sys/select.h> headers delegate to it and fail closed on unsupported event bits / bad nfds / closed fds. The proof is an in-process smoltcp fixture (harness=in-process-smoltcp-fixture, posix_surface=demo-local-model) plus the c-libc-surface C-surface checks; make run-posix-socket-poll-select. Blocking readiness (a Pollable cap) is the follow-up lane, since the v0 UdpSocket/Pipe caps expose no non-blocking readiness method.
  • Read-only transport status (the table above, including congestion-control reporting and interface MTU/MSS reporting) is settled by network-transport-status-cap-local-proof (done). The local proof is an in-process smoltcp fixture (harness=in-process-smoltcp-fixture, status_surface=demo-local-model); the production cap/schema wiring of the status surface is the follow-up lane.
  • Keepalive and connect/accept/recv timeout policy inputs are owned by network-transport-keepalive-timeout-policy-local-proof (done). The local proof is an in-process smoltcp fixture (harness=in-process-smoltcp-fixture, policy_surface=demo-local-model); the production cap/schema wiring of these inputs is the follow-up lane. That lane should model connection-refused as its own terminal call outcome (the v0 demo’s DeadlineWaiter only distinguishes timeout from a still-parked call, proving refused-vs-timeout distinctness at the socket layer rather than in the waiter abstraction).
  • Congestion-control evaluation is a deliberately deferred lane, not a runnable task. It may only open after the read-only transport-status proof lands and a workload produces evidence (loss/throughput/latency under a real capOS network server) that the default CongestionControl::None is inadequate. Its entry criteria are: a reproducible workload, a recorded baseline under None, and a decision to flip the socket-tcp-reno/ socket-tcp-cubic build feature (configuration, still not a custom algorithm) before any hand-rolled TCP mechanic is even considered. Replacing smoltcp’s TCP mechanics remains speculative until that evidence exists.

Task Lanes

Docs/status lanes (both done 2026-06-03):

Blocked behavior/read-side lanes:

  • network-operator-status-tool-local-proof is done: it adds the operator-visible ip addr / route / DNS / link / socket-state equivalent over the Phase C userspace stack. A network-stack server acquires a real IPv4 DHCP lease, snapshots link/MAC/address/prefix/ route/gateway-neighbour/DNS/search-domain/lease-state/socket state, and serves it over a read-only network_status endpoint to a separately spawned status tool. The tool holds no NetworkManager cap, prints a bounded status table reflecting the live stack state (distinguishing available DNS from the unavailable search domain), and observes the fail-closed rejection of a forged socket-creation call. Proof: make run-network-status-tool. Promoting the demo-local status surface to a first-class NetworkStatus schema interface is deferred (it would cross the schema/generated-bindings conflict domain).
  • network-dhcpv4-lease-lifecycle-local-proof is done: it extends the first DHCP config proof into a real lease lifecycle (renewal, rebind, expiry/fail-closed, static precedence, DNS option publication) via make run-network-dhcpv4-lease-lifecycle.
  • network-system-dnsresolver-cap-local-proof is done: it adds a typed DnsResolver capability with a strict cross-process authority split. A resolver server owns the upstream-DNS authority (it runs the query over a real smoltcp UDP socket against a configured upstream, with the upstream isolated in-process as a deterministic DNS responder under a synthetic clock), sources resolver config from a static-manifest entry plus a modelled DHCP option-6 entry with observable provenance, and serves a read-only DnsResolver endpoint. A separately spawned resolver tool holds only the endpoint – no NetworkManager, Nic, or UDP socket authority – so it resolves bounded A/AAAA hostnames through the cap and cannot resolve names by ambient network authority. The proof exercises a resolved A record, a resolved AAAA record (A/AAAA-capable API shape), NXDOMAIN -> not-found, a silent upstream -> typed timeout, fail-closed unavailable with no upstream config, a status surface reporting config source/active upstreams/last error (no packet payloads or raw DHCP leases), and the fail-closed rejection of a forged raw-upstream call on the read-only endpoint. No schema, kernel, or capos-rt change: like the operator status surface, the resolver endpoint is an interface-agnostic protocol local to the demo, and promoting it to a first-class DnsResolver schema interface is deferred to avoid the schema/generated-bindings conflict domain. Proof: make run-network-system-dnsresolver.
  • posix-getaddrinfo-system-resolver-bridge-local-proof bridges POSIX getaddrinfo / resolver configuration to DnsResolver.
  • network-ping-diagnostics-tool-local-proof is done: it adds the bounded local IPv4 ping diagnostics tool over the done ICMPv4 Echo Reply lane, proving same-subnet and gateway-routed echo success, malformed-reply drop, timeout/unreachable classification, and retry/payload bounds. Proof: make run-network-ping-tool.
  • network-ping6-diagnostics-tool-local-proof is done: it adds the bounded local IPv6 ping diagnostics tool over the existing IPv6/ICMPv6 lane without changing the IPv4-first Web UI critical path.
  • network-socket-readiness-poll-cancel-backpressure-local-proof is done: it settles usable server semantics for readiness (accept/read/write/closed/reset/config-unavailable), parked-call cancellation, close/stale-waiter rejection, and send/receive backpressure. A single proof process drives two real userspace smoltcp interfaces wired by an in-process frame shuttle under a synthetic clock and asserts each case straight from real smoltcp getters (state, may_*/can_*, *_queue vs *_capacity). Proof: make run-network-socket-readiness. The POSIX poll/select bridge landed in posix-socket-poll-select-bridge-local-proof (done), which exposes the surface only once implemented and proven.
  • network-transport-status-cap-local-proof is done: it surfaces read-only transport status (connection state, endpoints, send/recv backpressure depth, active keepalive/timeout, active congestion controller, interface link/IP MTU with MSS marked derived/not-exposed, listener backlog pressure, and the close/reset reason mapped onto the cap/network.rs NetworkError vocabulary) over the userspace stack, and proves the status read is strictly read-only (fingerprint unchanged, zero frames emitted). A single proof process drives two real userspace smoltcp interfaces wired by an in-process frame shuttle under a synthetic clock (harness=in-process-smoltcp-fixture, status_surface=demo-local-model). Proof: make run-network-transport-status; the production cap/schema wiring of the status surface is the follow-up lane.
  • network-transport-keepalive-timeout-policy-local-proof (done) adds keepalive and connect/accept/recv timeout policy inputs over the userspace stack.
  • network-packet-trace-authority-local-proof adds bounded per-interface packet/debug trace authority for diagnostics. The local proof (make run-network-packet-trace) feeds every transmit/receive frame of one real userspace-smoltcp DHCP bring-up path through a bounded PacketTrace: a fixed capture capacity (so the drop counter is exercised), a fixed per-packet header-only byte cap (payload_policy=header-only-no-body – at most the leading L2/L3/L4 header bytes of any frame are recorded, packet bodies are never captured), a direction filter, an expiry deadline, and capture/drop/admission counters with grant provenance (which authority enabled the trace and why). The captured trace is served over a read-only endpoint to a reader granted only a console plus that endpoint – no Nic, DMAPool, DeviceMmio, Interrupt, or NetworkManager – so the diagnostic authority is strictly observe-only: forged transmit/reconfigure/open-socket calls are rejected fail-closed, and a sibling probe holding no trace cap cannot observe any packet. Payload-visibility policy: the trace exposes only bounded headers for protocol diagnosis (DHCP/ARP/IPv4-UDP classification), never application payload, and it transfers no device or socket authority – this is why the authority is diagnostic-only and is grounded in Debug, Trace, and Profiling Authority (the read-only sampler authority class, not the read/write DebugSession class). Promotion to a first-class PacketTrace schema interface is deferred to avoid the schema/generated-bindings conflict domain, matching the sibling status/DNS diagnostic proofs.

POSIX Adapter Phase P1.4: Running dash

Long-form decomposition for the POSIX adapter Phase P1.4 dash port. Root task records under docs/tasks/ select dispatchable POSIX work and link here; the executable per-step checklist is in docs/proposals/posix-adapter-proposal.md Task 4; the design rationale and validation smoke contract are in docs/proposals/posix-adapter-proposal.md Phase P1.4 and Open Questions §1 (shell candidate) + §7 (fd 0 backing). Open Question §6 (fork policy = Variant A recording shim) is already a final decision in the proposal and does not gate P1.4.

What “Running dash” Means in v0

The validation smoke is make run-posix-shell-smoke. It boots a focused manifest that grants:

  • a TerminalSession cap for stdio,
  • a read-only bootstrap-granted Directory cap rooted at a tiny in-rodata pseudo-fs (the resolver remains Namespace-shaped for forward parity; the v0 manifest grants a Directory because that is what Storage Phase 3 slice 2 ships as a kernel CapObject),
  • a ProcessSpawner narrowed to one allowed binary (ls-shim),
  • and a Timer cap.

tools/qemu-posix-shell-smoke.sh pipes the heredoc ls; echo done into the shell’s fd 0, asserts done on the kernel log, asserts two clean-exit log entries (shell + ls-shim), and asserts clean QEMU exit. Stretch goal: cat foo | grep bar end-to-end against demos/cat-shim/ and demos/grep-shim/, exercising the P1.3 Pipe primitive through a shell pipeline.

This is intentionally narrow: no job control, no signal delivery, no real filesystem persistence, no ulimit (a v0 chdir / cwd string with cwd-relative resolution has since landed – see Slice 4 below). The point is to prove that a real POSIX C program (not a capOS-native shell) boots, parses scripts, dispatches subprocesses through fork+execve, reads stdin, writes stdout, and exits cleanly under QEMU.

Prerequisites Already Landed

  • P1.1 libcapos C substrate (fe5f5208, 2026-05-05 13:28 UTC): Rust staticlib mirror of capos-rt, _start shim, fixed heap, malloc / free / calloc / realloc, console_write_line.
  • P1.2 UDP + DNS resolver smoke (2026-05-05 21:21 UTC): libcapos-posix errno TLS cell, clock_gettime / gettimeofday over Timer, fd-table dispatch shape, __errno_location().
  • P1.3 Pipe + recording-shim fork-for-exec (2026-05-07 09:55 UTC, fix-slice through 05b52873 2026-05-07 21:07 UTC): kernel Pipe cap, ProcessSpawner.createPipe, fd-table FdBacking::Pipe, recording-shim fork / execve / waitpid / _exit, direct posix_spawn / posix_spawn_file_actions_*. The Variant A contract: execve() returns the synthetic child pid on success.
  • Storage Phase 3 slices 1-3 (slice 1 d06dff6b at 2026-05-14 19:31 UTC, slice 2 b11ec9e4 at 2026-05-14 22:30 UTC, slice 3 804a3f41 at 2026-05-14 23:23 UTC): RAM-backed File / Directory / Store / Namespace CapObjects with KernelCapSource::file / directory / store / namespace grant sources. These are the v0 backing for the dash smoke’s read-only in-rodata pseudo-fs.
  • WASI bounded env grant (5f5028e7, 2026-05-13 11:05 UTC): reference shape for a bounded text env grant on initConfig.init (wasiEnv :Text). The dash port mirrors this for its env vector.
  • setjmp / longjmp precursor (libc-setjmp-longjmp, 2026-05-25 21:11 UTC): the x86_64 SysV setjmp / longjmp C-ABI primitive plus jmp_buf and a <setjmp.h> header. This was absent from the original P1.4 surface table, but dash’s exception/interpreter control flow is built on setjmp / longjmp over a real jmp_buf (pervasive in error.h / main.c / eval.c / parser.c / trap.c / …), so it is a hard precursor for the dash build pipeline and shell smoke. Implemented in libcapos/src/setjmp.rs (global_asm), exposed through libcapos-posix/include/capos/posix/setjmp.h, and proven in QEMU via make run-posix-setjmp (direct call returns 0, a longjmp from a deep recursion resumes setjmp with the passed value, and longjmp(env, 0) returns 1). No sigsetjmp / siglongjmp: dash uses only the plain primitive and the v0 signal layer has no asynchronous delivery.

The Phase 2 Open Question §1 (dash candidate) and §7 (fd 0 backing = TerminalSession) are now promoted from working answers to final decisions (docs/proposals/posix-adapter-proposal.md “## Open Questions” §1/§7, Decided (P1.4 Slice 1, 2026-05-24 00:53 UTC)); that promotion was the first dispatch slice of P1.4 (Slice 1 below).

Decomposition

Slice 1: open-question closures (docs-only)

Status: closed (P1.4 Slice 1, 2026-05-24 00:53 UTC). Open Questions §1 (dash 0.5.13.x) and §7 (fd 0 backing = TerminalSession) are now Decided in docs/proposals/posix-adapter-proposal.md; the §1 candidate-survey cross-reference and the Phase P1.4 “Open question closures” bullet are reconciled to match.

Two open questions in docs/proposals/posix-adapter-proposal.md must become final decisions before any code lands:

  • §1: confirm dash 0.5.13.x as the v0 candidate. Alternatives surveyed: busybox ash, oksh, toysh, custom Rust shell. dash wins on size, POSIX strictness, and single-purpose /bin/sh posture.
  • §7: confirm TerminalSession as the canonical fd 0 / 1 / 2 backing for the v0 smoke. An FdBacking::Terminal variant in libcapos-posix/src/fd.rs plus posix_inherit_stdio() adoption is the implementation shape.

Promotion = strike the “Working answer” phrasing in the proposal, replace with “Decided (P1.4 Slice 1, )” and the rationale.

Slice 2: typed clients in capos-rt

Status: closed. The typed clients and interface-ID re-exports are available from capos-rt; make run-posix-file now exercises them through libcapos-posix.

TerminalSessionClient and the TERMINAL_SESSION_INTERFACE_ID re-export already ship from capos-rt/src/client.rs and capos-rt/src/lib.rs; no work there. The net-new wrappers, mirroring the existing PipeClient / UdpSocketClient shape, are:

  • FileClient: read, write, stat, truncate, sync, close over the File interface methods.
  • DirectoryClient: open, list, mkdir, remove, sub over the Directory interface methods, returning typed FileClient / DirectoryClient projections of the transferred result caps.

Add re-exports for the existing FILE_INTERFACE_ID / DIRECTORY_INTERFACE_ID constants (already defined in capos-config/src/lib.rs) from the pub use capos_config::{...} block in capos-rt/src/lib.rs.

Slice 3: fd backing for File / Directory / Terminal

Status: closed. FdBacking::File, FdBacking::Directory, and FdBacking::Terminal are present; read, write, close, lseek, opendir, readdir, and closedir are wired for the RAM-backed root Directory path. The stat / fstat / access / unlink metadata/remove follow-up is closed by the posix-p1-4-file-metadata slice: stat / fstat fill a struct stat (sys/stat.h) from File.stat, access is an existence check (single-identity v0, mode ignored), and unlink resolves the parent Directory and calls Directory.remove. Proven by the extended make run-posix-file smoke. The file-resize follow-up is partially closed by the posix-ftruncate-truncate-file-resize slice: ftruncate(fd, length) and truncate(path, length) drive File.truncate @3 over the RAM-backed root and are proven by the same make run-posix-file smoke (ftruncate shrink ok / truncate by-path ok markers). The fsync(2) / fdatasync(2) C shims over FileClient::sync_wait are implemented in libcapos-posix/src/file.rs; writable-disk (writable_fs) truncate beyond the RAM-backed root remains open.

Extend libcapos-posix/src/fd.rs with three new FdBacking variants:

  • FdBacking::File { client: FileClient, pos: u64 } – the seek position lives in the fd table, not the kernel File cap (the schema-level read/write take an explicit offset).
  • FdBacking::Directory { client: DirectoryClient, iter: ... } – iteration state for readdir.
  • FdBacking::Terminal { client: TerminalSessionClient }.

Route the existing read / write / close C entry points through these variants. Add file-path-only C entry points (open, lseek, stat, fstat, access, unlink, opendir, readdir, closedir) in libcapos-posix/src/file.rs and libcapos-posix/src/directory.rs.

Slice 4: path resolver over root Directory

Status: closed for the bootstrap root Directory shape. A read-only absolute-path resolver in libcapos-posix/src/path.rs:

  • Input: an absolute UTF-8 path and a bootstrap-granted root Directory cap.
  • Walk Directory.sub() for each prefix segment; mint a leaf File / Directory cap directly with Directory.open() / Directory.sub().
  • A v0 per-process cwd string landed (libcapos-posix/src/cwd.rs, make run-posix-cwd): getcwd / chdir plus cwd-relative resolution for open / opendir / stat / access / unlink / mkdir. chdir validates the target directory through the resolver, stores the normalized absolute string, and drops the cap; cwd inheritance across spawn is still deferred. No .. collapsing – escape is prevented by the kernel Directory cap’s lack of a parent edge, not a resolver clamp.
  • Returns typed File / Directory result caps that flow into the fd-table backing.

The future Namespace.resolve + Store.get shape remains planned for a real filesystem service; the v0 dash smoke uses the bootstrap-granted root Directory, no Store / content-addressed hashes.

Slice 5: stdio over TerminalSession

Status: closed. posix_inherit_stdio() adopts a bootstrap-granted TerminalSession cap as fds 0 / 1 / 2 (FdBacking::Terminal), with the pipe-backed inheritance path retained for posix_spawn-driven pipeline children; proven by make run-posix-stdio-terminal-smoke.

posix_inherit_stdio() already adopts pipe-backed fds 0 / 1 / 2 from the recording-shim execve path. Extend it to also adopt a bootstrap-granted TerminalSession cap as fd 0 / 1 / 2 when the manifest supplies one (the posix-pipe pipeline children stay on the existing pipe path). The shell binary calls posix_inherit_stdio() once from main() before reading the heredoc.

Slice 6: env vector + getenv / setenv / putenv

Mirror the WASI host adapter’s wasiEnv :Text shape:

  • Add a bounded posixEnv :Text (or per-key `posixEnvEntries
    List(Text)) grant on initConfig.initinschema/capos.capnp. This is the only P1.4 schema touch; queue on the shared schema serial surface per docs/backlog/index.mdConcurrency Notes when selected. Regenerate the checked-in capnp bindings;make generated-code-check` must pass.
  • Read the grant from the bootstrap CapSet at startup; populate a per-process env vector in libcapos-posix/src/env.rs.
  • C entry points: getenv, setenv, putenv, unsetenv.
  • LaunchParameters remains a follow-on for non-v0 callers.

Slice 7: printf / string subset

Status: closed. The focused C library subset now ships from libcapos-posix, and make run-posix-printf proves formatted output plus string/mem, numeric conversion, and ctype behavior from a live capOS C process.

A focused C library subset shipped from libcapos-posix (not a full libc, not a musl port):

  • stdio.h subset: printf, fprintf (fd 1 / fd 2 only), vprintf, vfprintf, snprintf, vsnprintf, putchar, puts, fputs, fputc. No fopen / FILE * – those route through the fd-table surface.
  • string.h subset: memcpy, memmove, memset, memcmp, strlen, strcmp, strncmp, strchr, strrchr, strcpy, strncpy, strcat, strncat, strdup.
  • stdlib.h subset: atoi, strtol, strtoul. Process termination still uses the existing libcapos _exit path; exit / abort stay outside this focused printf/string slice.
  • ctype.h subset: isspace, isdigit, isalpha, isalnum, isupper, islower, tolower, toupper.

malloc / free / calloc / realloc already ship from libcapos.

Slice 8: signal stubs

Status: closed for the v0 dash-port surface. signal() and sigaction() validate and store handlers in a per-process table, but handlers are never delivered. kill() fails closed with EPERM because this POSIX layer has no target ProcessHandle authority, and raise() fails closed with ENOSYS because self-delivery is not implemented. make run-posix-signal-time proves the documented behavior from a live capOS C process.

Header-and-stub-only signal, kill, sigaction, plus a TLS-stored handler table that accepts handler registration but never delivers a signal. dash registers a SIGCHLD handler at startup; the stub records the handler pointer and returns 0. Documented out of scope: real SIGCHLD / SIGTSTP delivery, job control, controlling terminals.

Slice 9: time additions

Status: closed. time(2), nanosleep, and sleep reuse the existing Timer cap path already used by clock_gettime / gettimeofday. make run-posix-signal-time proves monotonic-since-boot time() output, bounded nanosleep(), and one-second sleep() from a live capOS C process.

time(2), nanosleep, sleep over the existing Timer cap; clock_gettime / gettimeofday already landed under P1.2 Phase B.

Slice 10: identity stubs

Status: closed for the ready-task surface (getpid, getuid, getgid) at commit 1a8a9896 (2026-05-23 06:51 UTC). getpid returns the stable capos-rt bootstrap pid for the current process, and the recording-shim child-pid allocator avoids colliding with the caller’s pid. getuid / getgid return the hardcoded single-identity uid/gid 0. The geteuid / getegid alias follow-up is closed (task posix-geteuid-getegid): both delegate to getuid / getgid since the effective ids equal the real ids under the v0 single-identity model, declared in unistd.h, and asserted by run-posix-identity via the printed euid=0 egid=0 fields.

Slice 11: dash vendoring + Variant A patch

Status: closed (posix-p1-4-dash-vendor, 2026-05-24 19:40 UTC). dash v0.5.13.4 is vendored mirror-as-is under vendor/dash/ (full upstream tree, byte-identical) with vendor/dash/VENDORED_FROM.md. The Variant A fork-exec patch set lives under vendor/dash/patches/ as two .patch files (0001-execve-return-synthetic-pid.patch over src/exec.c/src/exec.h; 0002-vforkexec-adopt-synthetic-pid.patch over src/jobs.c), cumulative diff 45 changed lines (< 50). Design evidence only – nothing compiles or runs at this slice; the C-build slice (posix-p1-4-c-multifile-build) and shell smoke (posix-p1-4-dash-shell-smoke) prove the behavior end-to-end.

  • Vendor dash 0.5.13.x under vendor/dash/ at a pinned tag, mirror-as-is. Add vendor/dash/VENDORED_FROM.md recording the upstream URL, commit, tag, and refresh procedure (mirror the existing vendor/dns-c-wahern/VENDORED_FROM.md shape).
  • Apply the Variant A per-call-site patch: at each fork-exec site, capture execve()’s synthetic pid return value, bail on -1, and assign back to child. Patches live under vendor/dash/patches/ with one .patch per call site; the cumulative diff against upstream is < 50 lines.
  • Inter-call dup2 / close between fork and execve already records through libcapos-posix and needs no per-call patching.
  • Carried into Slices 12-13: patch 0001 de-noreturns shellexec(), so the two no-fork exec-replace callers (src/eval.c evalcommand() EV_EXIT path and execcmd(), each with /* NOTREACHED */) now fall through under the recording shim. A single non-interactive command (dash -c '/bin/echo hi') takes the EV_EXIT path, not vforkexec(). Slice 12/13 must disable the EV_EXIT in-place-exec optimization under the recording shim (fork-exec-then-exit) or add an exec-replace-then-exit patch before the binary runs. Details in vendor/dash/VENDORED_FROM.md.

Slice 12: C-build pipeline for vendored multi-file C sources — CLOSED

The existing c-build helper compiles single-file demos/*/main.c smokes against libcapos.a + libcapos_posix.a. dash is a multi-translation-unit C codebase (main.c, eval.c, exec.c, expand.c, input.c, jobs.c, mail.c, memalloc.c, miscbltin.c, mystring.c, nodes.c, options.c, output.c, parser.c, redir.c, show.c, trap.c, var.c, plus generated tables).

Closed by posix-p1-4-c-multifile-build: the Makefile gained the reusable capos-c-multitu-elf define (instantiated via $(eval $(call ...))) that

  • accepts a list of .c files,
  • compiles each to an object with clang --target=x86_64-unknown-none-elf -nostdlib -static -I libcapos/include -I libcapos-posix/include,
  • links the objects with libcapos_posix.a + libcapos.a,
  • produces a userspace ELF without dragging in an external libc.

The proof demo demos/c-multifile/ (main.c + greet.c + greet.h) builds through the rule; greet.c uses libcapos-posix strlen/memcpy, so the link resolves symbols from both archives. make run-c-multifile boots the two-TU ELF and asserts the greet=/checksum= line computed in the helper TU, proving the cross-TU call executed. The rule is reusable for future C ports (busybox utilities, dash).

Slice 12.5: dash build pipeline — LANDED

Status: landed (2026-05-26 05:11 UTC, task posix-p1-4-dash-build-pipeline). The sysroot/libc precursor landed first (2026-05-25 22:23 UTC, task libc-dash-sysroot-surface).

The build pipeline lives under vendor/dash/capos/ (outside the mirror-as-is src/): config.h (pinned autotools config) and gen-tables.sh (stages a patched source copy under target/dash/src and runs dash’s six host generators into target/dash/gen). The Makefile dash target funnels dash_CFILES + the five generated tables through capos-c-multitu-elf against libcapos_posix.a + libcapos.a in the -nostdinc sysroot include mode, producing target/dash/dash.elf (statically linked, 0 undefined symbols, _start from capos-rt, the two Variant A fork-exec patches compiled in). A clean tree (rm -rf target/dash && make dash) regenerates deterministically; the mksignames signal-name table is the one host-<signal.h>-derived table (cosmetic on capOS v0). Runtime behavior (including the EV_EXIT residual) is the dependent posix-p1-4-dash-shell-smoke. Config derivation + host-table caveat: vendor/dash/VENDORED_FROM.md.

Original precursor notes (posix-p1-4-dash-build-pipeline is now ready): The build-pipeline mechanics were validated by a -nostdinc compile/link probe over the full vendored dash TU set (branch posix-p1-4-dash-build-pipeline, gitignored target/dash-probe/):

  • A pinned capOS config.h (SMALL=1, JOBS=0, HAVE_* mostly undefined, _PATH_* literals, PRIdMAX "lld", USE_TEE/USE_MEMFD_CREATE 0) drives the preprocessor and gates.
  • All six generators run deterministically and emit the tables: mktokens (token.h/token_vars.h), mksyntax (syntax.c/syntax.h, needs token.h on its compile include path), mknodes (nodes.c/nodes.h), mksignames (signames.c), mkbuiltins over a preprocessed builtins.def (builtins.c/builtins.h), and mkinit over the 27-file TU list (init.c). mkinit and mkbuiltins take their inputs as separate arguments — mind shell word-splitting in the Makefile recipe.

The blocker was that dash includes bare POSIX headers (<unistd.h>, <fcntl.h>, <signal.h>, …), which resolved to the host /usr/include under the existing flags, plus a broad missing libc surface. libc-dash-sysroot-surface closed both:

  • Sysroot. libcapos-posix/sysroot/include/ holds bare-name headers (stdio.h, unistd.h, sys/types.h, termios.h, wchar.h, …) that forward to the capos/posix/* source of truth. Consumed with -nostdinc
    • four -isystem roots (clang freestanding builtins, the sysroot, and the two capOS namespaces) via the existing capos-c-multitu-elf rule (CAPOS_C_SYSROOT_INCLUDE in the Makefile). The focused proof is make run-c-libc-surface (qsort / strerror / umask / strtoll / strstr / S_IS*, all through bare includes).
  • Surface. The inventory plus several items the original table understated: the C/POSIX-locale multibyte layer expand.c needs unconditionally (mbrtowc/mbrlen/mbsrtowcs/wcschr/wctype/iswctype + the isw* family, <wchar.h>/<wctype.h>), strpbrk, lstat, getgroups, wait3, vfork, htonl/htons/ntohl/ntohs (used by bltin/printf.c), the S_IS* file-type macros, DT_LNK/…, environ, and the sys_siglist array dash’s own strsignal reads.

A -nostdinc compile of the full vendored TU set (27 hand-written + 3 bltin + 5 generated, using the probe’s generated headers) against the real sysroot now reports 0 errors, and a symbol audit shows 0 unresolved libc symbols once the dash objects and the two capOS archives are combined (evidence: ~/capos-evidence/libc-dash-sysroot-surface/).

Config.h the pipeline slice must pin (these are autotools/feature flags, not libc surface — they belong to posix-p1-4-dash-build-pipeline): the probe set already documented (SMALL=1, JOBS=0, _PATH_*, PRIdMAX "lld", USE_TEE/USE_MEMFD_CREATE 0) plus HAVE_ALLOCA_H 1 (so <alloca.h> is included; the sysroot provides it as a __builtin_alloca alias), HAVE_WAIT3 1 (dash’s #else branch is a non-compiling 4-arg waitpid; with the flag it uses the wait3 symbol the surface provides), HAVE_ISALPHA 1 (capOS <ctype.h> declares the classifiers, so dash uses them directly instead of its _isXXX rename shims), and the stat64/lstat64/fstat64/open64/readdir64/ dirent64/glob64* → unsuffixed #define fallbacks. Do not define HAVE_STRSIGNAL or HAVE_SYSCONF — dash provides those itself (and consumes sys_siglist/its noreturn sysconf).

Slice 13: ls-shim + manifest + smoke harness

Status: closed (2026-05-27 09:36 UTC) by make run-posix-shell-smoke: a real vendored dash boots as PID 1, reads the heredoc off its fd 0 TerminalSession, creates two entries in its bootstrap RAM root Directory (> /alpha, > /beta), opens that directory as fd 3 (exec 3< /), dispatches /ls-shim through fork/execve, prints done, and exits; ls-shim lists the inherited directory (alpha, beta) over the shared terminal and both processes exit cleanly. The earlier block (2026-05-27 00:46 UTC) was the fd-inheritance premise conflict (vanilla dash forwarded no capability to ls-shim); it was resolved by the posix-recording-shim-full-fd-inherit + posix-terminal-session-forwardable + posix-open-directory-fd precursors, so assembly needed only three minimal additions: a vendor/dash/patches/ runtime bootstrap (synthesize argv[0] + posix_inherit_stdio(); the runtime entry passes argv=NULL and wires no POSIX stdio), a basename map in the recording-shim spawn (the kernel matches the manifest binary name, so /ls-shim resolves to ls-shim; a no-op for the bare-name smokes), and the ls-shim / manifest / harness assembly. No dash EV_EXIT patch was needed: the heredoc never makes an external command the last command, so /ls-shim takes vforkexec(). Full historical analysis: see the completed task record docs/tasks/done/2026-05-27/posix-p1-4-dash-shell-smoke.md.

Finding 1 (vanilla dash forwards no cap) resolved 2026-05-27 by posix-recording-shim-full-fd-inherit (done): the recording shim now inherits the parent’s full live fd table by default (POSIX fork+execve), so dash’s open stdio flows to ls-shim with no dup2, and a held read-only Directory fd inherits as the child’s cwd source. Its kernel precursor posix-terminal-session-forwardable (done) lets the terminal forward non-destructively (Raw), so dash keeps its own terminal. Close-on-exec is enforced and an aliased non-destructive backing Copy-shares; proof make run-posix-fd-inherit-default. The remaining Slice 13 items are the secondary gaps: posix-open-directory-fd (open(dir, O_RDONLY) -> FdBacking::Directory, done 2026-05-27, proof make run-posix-open-dir-fd; needed only if a N</ redirection is used – a dirfd(opendir()) forward also works), the slash-bearing /ls-shim PATH-stat workaround, and the dash EV_EXIT in-place exec-replace residual (see Slice 11).

  • demos/ls-shim/main.c: open a hardcoded in-rodata directory path, iterate with opendir / readdir / closedir, print each entry name, exit cleanly. This is the only allowed spawn target in the smoke.
  • system-posix-shell.cue: a focused-proof manifest (own CUE package, imports capos.local/cue/defaults) granting TerminalSession, a read-only Directory over an in-rodata pseudo- fs containing exactly the entries the heredoc references, a ProcessSpawner narrowed to ls-shim, and a Timer.
  • Makefile vendor-dash, libcapos-posix-shell, manifest-posix-shell.bin, capos-posix-shell.iso, and run-posix-shell-smoke targets.
  • tools/qemu-posix-shell-smoke.sh host harness: pipe ls; echo done heredoc into fd 0, assert done, two clean-exit log entries (shell + ls-shim), the scheduler halt line, and QEMU exit status 1 (isa-debug-exit).

Slice 14 (stretch): cat | grep pipeline — DONE (2026-05-27)

Drives cat foo | grep bar end-to-end through dash’s pipeline parser:

  • demos/cat-shim/main.c: writes an in-rodata three-line corpus to stdout (only the middle line contains bar).
  • demos/grep-shim/main.c: reads stdin line by line, writes lines containing argv[1] to stdout. The initial Slice 14 proof baked bar as a compile-time fallback because child argv did not yet cross the recording-shim execve boundary; Slice 20 below now seeds grep-shim through the private posix_argv pipe, and the fallback no longer matches the corpus. The shell smoke therefore fails if /grep-shim bar does not deliver bar as argv.
  • system-posix-shell.cue: both shims as ProcessSpawner targets.
  • tools/qemu-posix-shell-smoke.sh: asserts match bar here reaches the terminal, the two non-matching corpus lines do not, and ≥4 clean child exits (dash + ls-shim + cat-shim + grep-shim).

This proves the P1.3 Pipe primitive end-to-end through dash’s own pipeline parser, not just the recording-shim posix_spawn_file_actions path.

Reconciliation (posix-dash-pipeline-exec-reconcile, DONE 2026-05-27). The premise conflict the blocked attempt found – a real cat foo | grep bar page-faulted dash after the first element because evalpipe sets EV_EXIT on every element and the recording-shim patch set had only reconciled vforkexec – is resolved by dash patch 0004-pipeline-evexit-recording-shim.patch plus a libcapos-posix wildcard reap:

  • 0004 (eval.c/jobs.c/jobs.h). evalcommand()’s EV_EXIT in-place shellexec() stashes the synthetic pid (capos_exec_pid) and breaks instead of faulting through case CMDBUILTIN. evaltree() suppresses its EV_EXIT exraise(EXEND) while that pid is pending. evalpipe()’s child arm calls evaltree() (not the __noreturn__ evaltreenr()) so it returns under the no-separate-address-space recording fork, then adopts the pid into the pipeline job via the now-exported forkparent(), re-suppressing interrupts to balance the child-arm INTON. forkshell() sets vforked around fork()/forkchild() so the recording-shim “child” does not freejob() the pipeline job. The cumulative dash patch budget was raised past 50 lines for this; see vendor/dash/VENDORED_FROM.md for the decision and the remaining residuals (the single trailing external command residual is closed by Slice 16 below, and the exec foo builtin residual by Slice 17; compound pipeline elements remain unsupported under the recording shim).
  • libcapos-posix wildcard waitpid(-1)/wait3. dash reaps pipeline children with wait3 -> waitpid(-1); the v0 surface now reaps any tracked child (blocking) and honors WNOHANG as “no child ready”, which is all waitforjob -> dowait needs.

Proof: make run-posix-shell-smoke (extended with the pipeline line). Regression-clean: make run-posix-pipe-smoke, make run-posix-execve-inherit-smoke.

Slice 15: PID-1 argv channel (posixArgs) — DONE (2026-05-30)

Closes the “capOS delivers no argv” gap for the manifest-launched binary, without a schema or kernel change:

  • libcapos-posix/src/args.rs: posix_args(int *argc) returns a process-lifetime, NUL-terminated char ** built from initConfig.init.posixArgs (a CueValue text list), read off the granted boot BootPackage cap. It mirrors capos-wasm/src/payload.rs::read_wasi_args and reuses the posixEnv blob-streaming/CueValue helpers (Slice 6, promoted to pub(crate)). Bounded by 32 entries / 4096 bytes-per-entry / 8192 bytes-total, fail-open to empty argv on any malformed or absent grant. Delivery is opt-in: the C main(argc, argv) trampoline in libcapos/src/entry.rs is untouched (still 0, NULL), so bare-name demos are unaffected.
  • dash patch 0003: when argv == 0, pull posix_args() and use it for procargs() when non-empty, keeping the {"sh", 0} fallback. A manifest seeding posixArgs: ["dash", "-c", "CMD"] now reaches evalstring(minusc, ...), so dash -c is invokable.
  • Proof: make run-posix-args-smoke (manifest seeds ["posix-args-smoke", "alpha", "beta"]; the C process prints argc=3 and each argv[i]). Regression: make run-posix-shell-smoke (the argv==0 fallback path is unchanged).

Follow-up closed by Slice 20 below: cross-execve argv inheritance to recording-shim children, needed before a spawned grep-shim could receive argv[1].

Slice 16: trailing top-level external command waits and exits (sh -c) — DONE (2026-05-30)

Closes the largest posix-dash-pipeline-exec-reconcile runtime residual: a single trailing top-level external command (the sh -c 'cmd' shape), unblocked by the Slice 15 argv channel.

  • Premise correction. The blocked task assumed 0004’s capos_exec_pid guard left the trailing child orphaned-but-spawned. In fact dash’s EV_EXIT optimization execs in place without forking, and the recording shim only spawns from an execve() inside a fork()-opened record window — so an unforked top-level in-place shellexec() returns ENOSYS (process.rs::execve with recording.active == false) and dash exits 126, never spawning the child. The fix is therefore to fork, not to wait on a pid that was never produced.
  • dash patch 0005-evexit-trailing-extcmd-wait.patch (eval.c): a capos_pipe_arm counter, raised by evalpipe() around its child-arm evaltree() call, gates the in-place EV_EXIT shellexec() to pipeline arms only (capos_pipe_arm > 0, where forkshell() already opened the window). A top-level EV_EXIT command (capos_pipe_arm == 0) takes the forking vforkexec() path, whose existing waitforjob() blocks for the child and returns its status; evaltree() then exraise(EXEND)s and exits with it. Substantive change: four lines; the rest are explanatory comments.
  • system-posix-shell.cue + tools/qemu-posix-shell-smoke.sh: the proof moves off the fd-0 heredoc onto a manifest posixArgs dash -c script (granted the boot BootPackage cap). The script keeps the directory-setup, pipeline-parser (/cat-shim foo | /grep-shim bar), and successful-listing (/ls-shim 3< /) proofs, then ends with a trailing /ls-shim that lacks fd 3 and exits 31. dash (the last process to exit, after waiting for that child) exits 31 — a value it has no other code path to produce, so it is a non-tautological wait-and-exit discriminator (an orphan-and-continue would have exited 0).
  • Proof: make run-posix-shell-smoke. Regression-clean: make run-posix-args-smoke, make run-posix-pipe-smoke, make run-posix-execve-inherit-smoke.

Remaining residual at the time: the execcmd() / exec foo command (process-image replace) path, closed by Slice 17 below. The later cross-execve argv inheritance gap is closed by Slice 20 below.

Slice 17: exec foo command builtin forks, waits, and exits (execcmd) — DONE (2026-05-31)

Closes the last posix-dash-pipeline-exec-reconcile in-place-shellexec() residual: the exec builtin’s command form (exec foo command), which must replace the shell with foo and exit with foo’s status.

  • Why 0005 did not cover it. execcmd() (eval.c) runs as a CMDBUILTIN (EXECCMD) through evalbltin(), which is dispatched from evalcommand()’s case CMDBUILTINbypassing the default:-case EV_EXIT fork gate 0005 added. So 0005’s capos_pipe_arm gate never sees the exec builtin; its in-place shellexec()->execve() has no fork()-opened record window and returns ENOSYS, and execcmd() ignores it and return 0s, continuing the script.
  • dash patch 0006-execcmd-fork-replace-wait.patch (eval.c): the argc > 1 form of execcmd() now forks via vforkexec(NULL, argv + 1, pathval(), 0) (which opens the record window), waitforjob()s for the replacement child, sets savestatus to its status, and exraise(EXEXIT)s — the exact shell-exit channel the exit builtin (exitcmd) uses, so EXITRESET copies savestatus into exitstatus and the shell exits with the replacement command’s status. n == NULL is safe (forkchild() early-returns under vforked; forkparent() reads n only when jobctl is set, never in a non-interactive dash -c). The no-command form (exec 3< /, argc == 1) keeps its return 0 and popredir() redirection-permanence path unchanged. Substantive change: five lines; the rest are explanatory comments.
  • system-posix-shell.cue + tools/qemu-posix-shell-smoke.sh: the proof replaces the trailing bare /ls-shim with exec /ls-shim (no fd 3, exits 31) followed by a poison tail > /gamma; /ls-shim 3< /. Correct exec-replace exits dash with 31 and the poison tail never runs (its [ls-shim] listed 3 entries / entry: gamma markers are absent); a buggy ignore-and-continue would instead create /gamma, list 3 entries, and exit 0. The directory-setup, pipeline-parser, and successful-listing proofs are kept intact.
  • Proof: make run-posix-shell-smoke. Regression-clean: make run-posix-args-smoke, make run-posix-pipe-smoke, make run-posix-execve-inherit-smoke.

Remaining residual closed by Slice 20 below: cross-execve argv inheritance.

Slice 18: read VAR builtin reads a line off fd 0 TerminalSession — DONE (2026-05-31)

Closes the one interactive-stdin path every prior P1.4 smoke skipped: dash’s read builtin (miscbltin.c readcmd()) consuming a line off its fd 0 TerminalSession cooked-mode line discipline and binding it to a shell variable. run-posix-shell-smoke drives a dash -c script and feeds no stdin (its harness only sleeps); the fd-0 -> TerminalSession.readLine read path was fully wired but never exercised through dash’s own read.

  • No dash patch and no libcapos-posix change needed. readcmd() reads via the buffered pgetc() -> preadbuffer() -> preadfd() -> read(0, buf, BUFSIZ) path. input_init() keys the buffering mode off tcgetattr(0), which libcapos-posix synthesizes as canonical (c_lflag & ICANON), so stdin_bufferable() is true and dash takes the plain read() branch (not the Linux tee()/splice history path, which would otherwise return a non-EINVAL error and be misread as EOF). libcapos-posix read() over FdBacking::Terminal returns exactly one line plus a synthesized \n per call — the canonical-tty contract dash expects — so the two sequential reads each consume one line.
  • system-posix-read-builtin.cue: a focused manifest granting terminal (TerminalSession), timer, and boot (for posixArgs), with the dash -c script printf 'rb-ready\n'; read NAME; printf 'got=[%s]\n' "$NAME"; read -r RAW; printf 'raw=[%s]\n' "$RAW". printf %s (not echo) echoes the bound values so the asserted bytes bypass echo’s backslash-escape interpretation.
  • tools/qemu-posix-read-builtin-smoke.sh + run-posix-read-builtin: the drive step handshakes rather than blind-sleeps. The kernel line discipline has no inter-read input buffer (it consumes UART bytes only while a readLine is pending) and the UART carries no EOF, so a line fed before userspace is draining is lost and the blocked read hangs to the QEMU timeout. drive tails the terminal-UART log: feed line 1 after the rb-ready banner, feed line 2 after the got=[ echo (so the two distinct lines never collide in the single shared UART FIFO), then hold stdin open until the raw=[ echo. A byte arriving just before its readLine is posted is still caught by handle_read_line()’s synchronous FIFO drain, so the banner/echo gates are a sufficient ordering guarantee.
  • Proof: make run-posix-read-builtin — observed got=[hello world] and the byte-preserved raw=[raw\back\slash] on the terminal UART, dash exit code 0, scheduler halt. Non-tautological: the echoed values are the harness-fed fd-0 bytes, which the script has no other source for. Regression-clean: make run-posix-shell-smoke (the exec/pipeline/$? paths are untouched; it still feeds no stdin).

Remaining residual closed by Slice 20 below: cross-execve argv inheritance.

Slice 19: test/[ file-test builtin stats the root Directory — DONE (2026-05-31)

Closes the last unexercised reachable-cap dash builtin path: test -e/-f/-d/-r FILE (and [ ... ]), the single most common shell file-predicate, reaching libcapos-posix stat/lstat over the bootstrap root Directory and discriminating file vs directory vs absent. Every prior P1.4 smoke exercises stdio, pipelines, exec-replace, argv, or interactive read, but none drives the test/[ builtin against the filesystem.

  • No dash patch and no libcapos-posix change needed. dash’s src/bltin/test.c filstat() calls stat64(nm, &s) / lstat64(nm, &s) (capos/config.h maps stat64->stat, lstat64->lstat) and switches on S_ISREG/S_ISDIR/FILEXIST; testcmd() registers both test and [ (src/builtins.def.in). HAVE_FACCESSAT is unset in capos/config.h, so -r/-w/-x route through filstat()->test_access() (a dash-internal check on the stat result, no extra libc call); under capOS’s single-identity euid=0 test_access(R_OK) short-circuits to true on any successful stat (before the st_mode read), so r=yes proves -r reached a real stat of /alpha, distinct from the absent-path miss, not the mode-bit comparison itself. libcapos-posix stat() (src/file.rs) resolves the path against the bootstrap root Directory, filling S_IFREG|0644 for files (write_file_stat) or S_IFDIR|0755 for directories (write_dir_stat, incl. the root via is_root_path); lstat() and access() are landed too, and the S_IS* macros ship from libc-dash-sysroot-surface.
  • system-posix-test-builtin.cue: a focused manifest granting terminal (TerminalSession), timer, boot (posixArgs), and root (source: {kernel: "directory"}). No process_spawnertest/[ are in-process builtins, no fork/exec; the > /alpha redirect is dash’s own open(O_CREAT) over the root Directory (already proven by system-posix-shell.cue). The dash -c script creates /alpha, then runs the six predicates with printf markers, using the && ... || ... form for -d /alpha and -e /nope so the negative-branch markers prove real discrimination.
  • tools/qemu-posix-test-builtin-smoke.sh + run-posix-test-builtin: a blind boot+sleep harness (dash feeds no fd 0, so no stdin handshake — same shape as run-posix-shell-smoke). The assert checks all six markers, the absence of the true-branch d=alpha-dir (a blanket-true test would emit it), the clean exit, and the scheduler halt.
  • Proof: make run-posix-test-builtine=yes, f=yes, d=alpha-notdir, root=dir, absent=yes, r=yes on the terminal UART, dash exit 0, scheduler halt. Non-tautological: each marker gates on a distinct stat result the script has no other source for; the d=alpha-notdir / absent=yes else-branches and the absent d=alpha-dir prove file/dir/absent discrimination, not a blanket-true builtin. Regression-clean: make run-posix-shell-smoke.

Remaining residual closed by Slice 20 below: cross-execve argv inheritance.

Slice 20: recording-shim execve argv inheritance — DONE (2026-06-07)

Closes the remaining recording-shim child-argv gap without changing the generated ProcessSpawner.spawn(name, binaryName, grants) surface:

  • libcapos-posix/src/process.rs: execve(path, argv, envp) snapshots the C argv vector before consuming the fork-recording window, rejects over-budget or malformed vectors before fd-action replay, writes a bounded binary argv record into a private kernel Pipe, and grants only the read end to the child as posix_argv. The existing full-fd-table inheritance path is unchanged: stdio_<N> grants still carry inherited fd backings, and direct posix_spawn() continues to accept but ignore argv/envp until a broader LaunchParameters design lands.
  • libcapos-posix/src/args.rs: posix_args() first looks for the posix_argv pipe grant and decodes it into the same process-lifetime char ** store used by manifest posixArgs; when the grant is absent it falls back to the manifest boot BootPackage path. The recording-shim payload is capped by the existing 4 KiB Pipe transport, so it is narrower than the manifest 8 KiB-total posixArgs channel but still uses the same 32-entry / bounded C-string shape.
  • demos/posix-execve-inherit-* + tools/qemu-posix-execve-inherit-smoke.sh: the focused smoke now proves both sides: an over-budget argv vector is rejected with E2BIG before the recorded dup2 mutates the parent fd table, and the successful child prints inherited argv[0..2] before listing the inherited Directory entries.
  • demos/grep-shim/main.c + run-posix-shell-smoke: grep-shim now calls posix_args() and uses argv[1] as the filter pattern. Its fallback does not match the corpus, so the existing cat foo | grep bar shell proof now depends on /grep-shim bar crossing the recording-shim execve boundary.

Proofs: cargo build --features qemu, make run-posix-execve-inherit-smoke, make run-posix-shell-smoke.

Conflict Surface Coordination

P1.4 does not touch kernel/src/cap/, kernel/src/sched.rs, or any device-driver foundation file. The schema half is limited to the optional posixEnv bounded text grant on initConfig.init (Slice 6); queue on the shared schema serial surface per docs/backlog/index.md Concurrency Notes when that slice dispatches. Every other slice is parallel-safe with the current selected milestone and DDF follow-up kernel surfaces because it avoids the kernel-core device-driver files.

Out of Scope for P1.4

  • Job control, real signal delivery, controlling terminals.
  • ulimit.
  • A userspace Store / Namespace service over a real backing store – that remains the next Phase 3 item in the storage proposal and is not required for the v0 dash smoke.
  • Real filesystem persistence (block device, virtio-blk, FAT).
  • A POSIX terminal line discipline owned by libcapos-posix – cooked-mode line discipline still lives kernel-side until networking proposal Phase C.
  • Hosted C++. Tracked separately in docs/proposals/userspace-binaries-proposal.md.

Success Criteria

  • make run-posix-shell-smoke exits cleanly under QEMU. A real dash runs a manifest dash -c script that drives the directory listing, the cat | grep pipeline, and an exec /ls-shim whose status (31) dash replaces-and-waits with (the poison tail after it never runs); the harness asserts the listing, the pipeline filter, dash’s exec-replace-and-wait status, the absent poison-tail markers, the clean-exit children, and the scheduler halt line.
  • The vendored dash source under vendor/dash/ is mirror-as-is at a pinned tag with a VENDORED_FROM.md and a patches/ directory whose cumulative diff vs upstream is < 50 lines.
  • libcapos-posix exposes the file / dir / stdio / env / printf / string / signal / time / identity surface listed above; the surface ships from headers under libcapos-posix/include/capos/posix/ with no dependency on an external libc.
  • make workflow-check, make fmt-check, make generated-code-check, cargo test-config, cargo test-lib, cargo build-demos-capos, make capos-rt-check, make run-smoke, make run-c-hello, make run-posix-dns-smoke, and make run-posix-pipe-smoke all remain green.
  • The proposal stamps the phase closeout with merge SHA and a minute-precision timestamp.

Go VirtualMemory Contract

Design slice for the review finding “Go-style VirtualMemory reserve/commit/decommit semantics are missing.” This file does not change the selected milestone; it records the contract that the Go/runtime allocator implementation must satisfy.

Implementation status as of 2026-04-26 18:51 EEST: the kernel, schema, generated bindings, capos-config, capos-rt, host tests, and QEMU proof coverage implement this contract. The closure summary and verification gates are recorded in the done task records and commit history.

Design Context

Grounding

Project docs and code read for this slice:

  • docs/tasks/README.md
  • docs/roadmap.md
  • docs/proposals/go-runtime-proposal.md
  • docs/architecture/memory.md
  • docs/architecture/userspace-runtime.md
  • docs/proposals/oom-and-swap-proposal.md
  • schema/capos.capnp
  • capos-config/src/lib.rs
  • capos-rt/src/client.rs
  • kernel/src/cap/virtual_memory.rs
  • kernel/src/mem/paging.rs

Relevant research files:

  • docs/research/llvm-target.md
  • docs/research/zircon.md

docs/research/llvm-target.md records that the Go runtime path depends on mapping sysAlloc, sysReserve, sysMap, and sysUnused/madvise-like behavior onto VirtualMemory. docs/research/zircon.md is relevant prior art because Zircon separates virtual address regions from memory objects and names commit/decommit as range operations on VMOs; capOS should keep the same separation of virtual reservation authority from physical backing.

Current Gap

The current schema exposes only:

interface VirtualMemory {
    map @0 (hint :UInt64, size :UInt64, prot :UInt32) -> (addr :UInt64);
    unmap @1 (addr :UInt64, size :UInt64) -> ();
    protect @2 (addr :UInt64, size :UInt64, prot :UInt32) -> ();
}

The current implementation allocates zeroed physical frames during map, records ownership per committed anonymous page, charges the caller’s frame_grant_pages ledger immediately, rejects non-readable protection, and frees frames during unmap. This is a useful baseline, but it is not the contract Go expects:

  • sysReserve needs address-space reservation without physical frames.
  • sysMap/sysUsed need explicit physical commit inside a prior reservation.
  • sysUnused needs decommit that releases frames while preserving the virtual reservation.
  • sysFree needs unmap-style reservation release so returned arenas do not leak virtual quota or address-space ranges.
  • Stack and arena guard pages need PROT_NONE semantics that reliably fault or fail validation without implying the reservation is gone.
  • Virtual reservation pressure and physical commit pressure need separate quotas and separate auditability.

Contract

The future schema should preserve existing method ids and add explicit reservation operations:

interface VirtualMemory {
    map @0 (hint :UInt64, size :UInt64, prot :UInt32) -> (addr :UInt64);
    unmap @1 (addr :UInt64, size :UInt64) -> ();
    protect @2 (addr :UInt64, size :UInt64, prot :UInt32) -> ();

    reserve @3 (hint :UInt64, size :UInt64) -> (addr :UInt64);
    commit @4 (addr :UInt64, size :UInt64, prot :UInt32) -> ();
    decommit @5 (addr :UInt64, size :UInt64) -> ();
}

Protection constants become:

#![allow(unused)]
fn main() {
pub const VM_PROT_NONE: u32 = 0x0;
pub const VM_PROT_READ: u32 = 0x1;
pub const VM_PROT_WRITE: u32 = 0x2;
pub const VM_PROT_EXEC: u32 = 0x4;
}

VM_PROT_NONE is the only valid zero-bit protection value. Unknown bits are rejected. Any non-NONE protection must include VM_PROT_READ; write-only or execute-only user mappings are rejected rather than silently upgraded, because x86_64 cannot represent a present user page that lacks read access. Writable and executable mappings remain rejected. VM_PROT_NONE must be represented by ledger state plus a non-present user PTE, not by relying on hardware “no read” permission.

map remains as a compatibility/convenience operation equivalent to reserve(hint, size) followed by commit(addr, size, prot), with atomic rollback if the commit or result serialization fails. Existing runtime clients can keep using it until the Go allocator switches to the explicit reserve path.

Semantics

All sizes are rounded up to 4 KiB pages after rejecting zero-size ranges and overflow. All non-zero addresses must be page-aligned, and the entire rounded range [addr, addr + size) must fit at or below USER_ADDR_LIMIT without overflow. Ranges overlapping the capability ring or CapSet page remain invalid for reserve, map, commit, decommit, unmap, and protect.

reserve(hint, size):

  • Reserves a contiguous virtual range in the caller’s address space.
  • Allocates no physical frames and installs no user-accessible PTEs.
  • Charges only the virtual reservation ledger.
  • With hint == 0, chooses a free range in the user address space.
  • With hint != 0, acts as fixed no-replace placement: overlap with any live reservation, committed page, object mapping, ring page, or CapSet page fails.
  • Returns the base address of the reservation.

commit(addr, size, prot):

  • Requires the whole range to lie inside existing anonymous reservations owned by the same address-space-bound VirtualMemory cap.
  • Requires every page in the range to be currently uncommitted.
  • Allocates zeroed physical frames, charges the physical commit ledger, and records the committed state per page.
  • Installs present user PTEs for readable/writable/executable protections.
  • For VM_PROT_NONE, allocates and charges frames but leaves user PTEs non-present while retaining the frames in the reservation ledger.
  • This is for committed inaccessible memory whose contents must survive a later protection restore. Pure stack or arena guard pages should stay reserved but uncommitted so they consume virtual quota without consuming physical commit budget.
  • Is all-or-nothing: allocation, page-table updates, ledger charge, and TLB completion reservation must either all become visible or all roll back.

decommit(addr, size):

  • Requires the whole range to lie inside existing anonymous reservations owned by the same cap.
  • Allows committed and already-uncommitted pages in the range.
  • Removes any present PTEs, releases frames for committed pages, drops physical commit charges, and preserves the virtual reservation.
  • Leaves every page in the range in the uncommitted reserved state.
  • Must perform the same local flush and remote shootdown discipline as unmap and protect before a frame can return to the allocator.

protect(addr, size, prot):

  • Requires the whole range to be committed anonymous pages owned by the same cap. It does not commit uncommitted reserved pages.
  • May set VM_PROT_NONE; the kernel keeps the committed frames charged and associated with the pages, removes present user PTEs, and denies user access until a later protect restores readable permissions or decommit releases the frames.
  • Preserves existing zeroed/data contents when moving between VM_PROT_NONE and accessible protections.
  • Keeps W^X enforcement and rejects unknown bits.

unmap(addr, size):

  • Releases the reservation for the whole range.
  • Frees committed frames and physical commit charges for committed pages.
  • Releases virtual reservation charges for every page.
  • Fails if the range is not wholly covered by anonymous reservations owned by the same cap.

Page faults and validation:

  • Access to an unreserved page is an ordinary unmapped access.
  • Access to a reserved uncommitted page is a reservation fault. The initial Go contract should fail closed; demand commit on fault is a later policy choice, not implicit behavior in this slice.
  • Access to a committed VM_PROT_NONE page is a protection fault and must not release the reservation or physical frame.
  • A pure guard page is a reserved uncommitted page, not a committed VM_PROT_NONE page, unless the runtime deliberately needs hidden retained contents.
  • Kernel user-buffer validation and copy helpers must treat reserved uncommitted pages and committed VM_PROT_NONE pages as inaccessible.

Ledgers

The implementation needs two ledgers of record:

  • Virtual reservation pages: charged by reserve, released by unmap, and unchanged by commit, decommit, or protect. Compatibility map charges this ledger because it creates an implicit reservation.
  • Physical commit pages: charged by commit or map, released by decommit or unmap, and unchanged by protect.

The current ResourceLedger::frame_grant_pages can continue to represent physical commit pressure if the implementation gives anonymous committed pages, held MemoryObject caps, and borrowed object mappings one shared physical-page budget. Virtual reservations need a separate process-owned quota; do not hide virtual reservation pages in the physical frame ledger.

Address-space ownership tracking must become reservation-based instead of a flat list of committed anonymous pages. The reservation ledger must be sparse: Go-scale reservations can be terabytes, so reserve must not allocate one metadata entry per reserved page. A minimal host-testable model should track non-overlapping reservation intervals and sparse committed state inside those intervals, such as committed subranges or a committed-page map keyed only by pages that currently hold frames.

  • Reserved
  • Committed { frame, prot }

MemoryObject borrowed mappings stay outside anonymous reservations for this slice. Any future design that allows object mappings inside sub-reservations must explicitly define ownership and teardown interaction.

Implementation Gates

  1. Add VM_PROT_NONE, reserve, commit, and decommit to schema, generated bindings, capos-config, and capos-rt clients while preserving current method ids.
  2. Replace committed-page-only anonymous ownership tracking with a sparse reservation ledger that can represent large uncommitted intervals plus committed accessible and committed VM_PROT_NONE pages without allocating per-page metadata for every reserved page.
  3. Add a virtual reservation quota separate from the physical frame-grant ledger and make quota errors distinguish virtual exhaustion from physical commit exhaustion.
  4. Rework VirtualMemoryCap map/unmap/protect around the reservation ledger, including rollback paths, TLB shootdown completion, and process-exit cleanup.
  5. Keep ring and CapSet virtual pages reserved outside caller control.
  6. Update capos-rt so allocator paths can use caller-owned scratch buffers for reserve, commit, decommit, protect, and unmap without allocating during heap growth.
  7. Add host tests for overlap rejection, fixed no-replace hints, partial decommit, recommit zero-fill, VM_PROT_NONE protect/restore, quota accounting, rollback, and process teardown.
  8. Add QEMU proof coverage before closing the review finding: reserve-without-frame-commit, commit and write, protect to VM_PROT_NONE, restore and preserve contents, decommit and recommit zero-fill, unmap reservation release, virtual-quota exhaustion, and physical-commit quota release after decommit.

Non-Goals

  • Demand paging on first access.
  • Swap or overcommit policy.
  • File-backed mappings.
  • Copy-on-write snapshots.
  • Hierarchical VMAR/sub-address-space capabilities.
  • Sharing anonymous reservations across processes.

Those can build on the reservation ledger later, but the Go allocator contract must not depend on them.

Memory Authority Model Backlog

This backlog turns Memory Authority Model into reviewable work. It does not replace the selected milestone in docs/tasks/state.toml. Use it when a task touches memory authority, VirtualMemory, MemoryObject, SharedBuffer, pins, DMA, swap, OOM, or page-table mutation semantics.

Grounding

Project files read while creating this backlog:

  • docs/architecture/memory.md
  • docs/backlog/go-virtual-memory-contract.md
  • docs/proposals/oom-and-swap-proposal.md
  • docs/proposals/resource-accounting-proposal.md
  • docs/dma-isolation-design.md
  • docs/architecture/park.md
  • docs/architecture/scheduling.md
  • docs/architecture/userspace-runtime.md
  • docs/proposals/storage-and-naming-proposal.md
  • docs/proposals/go-runtime-proposal.md
  • docs/security/verification-workflow.md
  • docs/research/capability-systems-survey.md
  • REVIEW.md
  • docs/tasks/README.md

Relevant research grounding:

  • docs/research/zircon.md
  • docs/research/genode.md
  • docs/research/sel4.md
  • docs/research/eros-capros-coyotos.md
  • docs/research/llvm-target.md

Validation Expectations

  • For docs-only slices, run a documentation build or the narrowest available link/check command; QEMU is not required unless behavior changes.
  • For implementation slices, add host tests, Kani, QEMU, or targeted instrumentation according to the proof table in the proposal.
  • Behavior changes should record concrete design grounding and verification evidence in the changed proposal, backlog, review note, or workplan entry.

Slice A: Memory-State Inventory

Goal: make current memory transitions auditable before changing behavior.

  • Inventory anonymous VM operations in kernel/src/cap/virtual_memory.rs and kernel/src/mem/paging.rs: reserve, commit, protect, decommit, unmap, address-space drop, and rollback.
  • Inventory MemoryObject operations in kernel/src/cap/frame_alloc.rs: allocation, result-cap publication, map, unmap, protect, cap release, borrowed mapping teardown, and result serialization rollback.
  • Inventory page-table mutation and TLB shootdown paths in kernel/src/mem/paging.rs, kernel/src/arch/, and scheduler residency tracking.
  • Inventory user-buffer validation/copy/read paths and classify which ones already hold the address-space stability guarantee.
  • Inventory ParkSpace cleanup interactions with VirtualMemory.unmap, VirtualMemory.decommit, MemoryObject.unmap, process exit, and future shared waiters.
  • Record a compact state-transition table in docs/architecture/memory.md or a follow-up design note.

Exit criteria:

  • The inventory names every current state transition, authority object, ledger, lock, and cleanup path relevant to user memory.
  • Any missing proof becomes a concrete backlog item rather than a vague TODO.

Slice B: Host-Testable VM Ownership Model

Goal: move the parts of memory ownership that are pure logic into stronger host-test coverage where practical.

  • Decide whether sparse anonymous reservation interval logic should live in capos-lib or stay kernel-local with mirrored tests.
  • Add tests for fixed no-replace hints, overlap rejection, middle reservation split, tail split, adjacent split behavior, and full-range release.
  • Add tests for committed-page bookkeeping under partial decommit, VM_PROT_NONE protect/restore, and recommit zero-fill assumptions.
  • Add tests for borrowed mapping provenance: anonymous reservations and object-backed mappings must not overlap, and object-specific unmap must reject a different backing object.
  • Add ledger tests that virtual reservation, physical commit, held object backing, and borrowed mapping charges release exactly once on success, error, rollback, and process exit.

Exit criteria:

  • Pure memory ownership rules are tested without QEMU when they do not need hardware page tables.
  • Any remaining kernel-only rule is documented with the reason it cannot be moved into host-testable logic.

Slice C: Shared Mapping Identity and Pins

Goal: unblock future shared park words and real SharedBuffer APIs without using raw virtual addresses as authority.

  • Define the mapping identity record for MemoryObject-backed user pages: object id, object generation or backing epoch, page offset, mapping generation, address-space id, and address-space generation.
  • Decide whether shared waiters need explicit object pins, mapping pins, or a validation/use critical section around key derivation and wait registration.
  • Define how object pins are charged, released, and revoked, and which ledger owns the pin count or pinned page count.
  • Extend ParkSpace design only after shared key derivation can prove object identity and stale mappings cannot wake new owners.
  • Define service-owned SharedBuffer metadata for producer/consumer rings, notification, bounds, and role-specific permissions before file/network APIs consume it.

Exit criteria:

  • Shared wait/wake and service-owned shared buffers have an object-identity rule that survives unmap, remap, transfer, release, and reuse.
  • Reviewers can reject any future shared-memory API that relies only on a raw user virtual address.

Slice D: TLB and Frame-Reuse Proof

Goal: make stale CPU observers part of the proof, not a local implementation assumption.

  • Identify all paths that remove or weaken PTEs and later free or reuse frames.
  • Add targeted counters or QEMU diagnostics showing local flush and remote generation completion before frame return on an address space resident on multiple CPUs.
  • Exercise VirtualMemory.decommit, VirtualMemory.unmap, VirtualMemory.protect, MemoryObject.unmap, process exit, and failed rollback under SMP where possible.
  • Record which paths only need local flush because the address space cannot be resident remotely.
  • Cover huge-page (1 GiB / 2 MiB) frame teardown when huge mappings are eventually introduced. Today the Drop for AddressSpace walk in kernel/src/mem/paging.rs (huge-page branches at lines 450 and 462) skips HUGE_PAGE PTEs with a TODO pass-through, so once huge pages are mapped into a user address space the backing 1 GiB / 2 MiB frames would leak on process exit. The work is blocked until huge-page support is added but must be filed against any branch that introduces huge user mappings.

Exit criteria:

  • A branch that changes page-table mutation can cite a proof that frames are not reused while stale TLB entries can still access them.

Slice E: OOM Boundary Normalization

Goal: make memory failures distinguish validation, quota, global pressure, and fatal execution failure.

  • Audit VirtualMemory, MemoryObject, FrameAllocator, and ProcessSpawner allocation failures for inconsistent failed vs overloaded behavior.
  • Define typed result or exception mapping for virtual quota exhaustion, physical commit exhaustion, global frame pressure, and result-cap publication failure.
  • Add hostile exhaustion tests for each allocation boundary that can be reached by an untrusted process.
  • Add process-exit status design for future OOM page-fault termination.

Exit criteria:

  • Capability calls return predictable typed memory failures, and execution faults have an explicit lifecycle path rather than generic panic text.

Slice F: DMA and Swap Preconditions

Goal: keep later device and swap work blocked on the memory model pieces they actually require.

  • Before userspace DMA drivers, implement or prove device-owner states, generation-checked handles, stale interrupt/completion handling, resident unswappable DMA pages, and scrub-before-reuse.
  • Before swap, define page eligibility bits, slot metadata, encrypted and authenticated page storage, per-boot keying, and faulting-process termination on restore failure.
  • Keep MemoryObject, shared IPC pages, ring/CapSet pages, secret pages, and DMA pages out of phase-1 swap unless a later proposal explicitly changes the model and adds proofs.

Exit criteria:

  • DMA and swap implementation branches have explicit prerequisite checklists and cannot merge by relying on generic frame ownership alone.

Session-Bound Invocation Context

Selected milestone backlog for replacing caller-selected endpoint identity without continuing the Service Object Identity Migration.

The detailed design lives in Session-Bound Invocation Context.

Design Target

The final model has one live session context per process:

  • Process.session_context is immutable after spawn.
  • Endpoint calls deliver privacy-preserving caller session metadata to the server. Subject details are not disclosed unless the caller explicitly asks for disclosure through the service call and a broker/service disclosure scope allows the requested fields.
  • Broker-granted capabilities decide which service roots/facets a process may invoke.
  • Services key user-facing state by caller session plus service-local records.
  • Request payload fields are data and cannot select authority.
  • Cross-session raw transfer is governed by cap transfer scope: same_session, cross_session_shareable, or service_regrant_only. If a cap crosses sessions, the receiver session supplies the future invocation subject context.

The existing service-object routing proof remains historical coverage for receiver-cookie spoofing, lifecycle, and transfer behavior. It is not the application authority model.

Gate 1: Process Session Invariant

Visible proof: a focused QEMU session/process smoke shows a spawned shell and child process have exactly one immutable session context, inherited by default, while an attempt to inject or use a second independent invocation subject fails.

Implementation scope:

  • Add process-owned session context metadata with explicit system/service session support.
  • Make ProcessSpawner select the child session context through inherit or a trusted broker/session-manager path.
  • Prevent ordinary processes from holding or using multiple independent UserSession values as ambient invocation subjects.
  • Keep SessionContext internal to process/session mechanics. Do not expose principal, profile, account, role, tenant, auth-factor, external-claim, or display fields through endpoint defaults or proof-only shortcuts.
  • Add host tests for spawn/session validation and QEMU proof output for child inheritance.
  • Hostile proof cases must show that copied UserSession caps, payload data, shell strings, and manifest grant data cannot install a second process session or select another child session outside the trusted broker/session path.
  • Define the fail-closed freshness rule used by later endpoint work: normal endpoint calls from dead, revoked, or stale workload sessions fail except explicit recovery, logout, or renewal caps.
  • Preserve existing anonymous/operator shell behavior while making guest shell behavior explicitly manifest-gated and narrow.

Verification gate:

  • make fmt-check
  • cargo test-config
  • relevant host tests for session metadata
  • focused QEMU process/session proof
  • one existing login or shell proof touched by the session path

Status 2026-04-28 17:01 UTC: the kernel now gives each process an immutable SessionContext, ProcessSpawner inherits the caller context by default, and trusted broker/session paths can mint launchers fixed to a validated child context. make run-session-context proves a copied UserSession cap cannot relabel the child invocation context and that a broker profile mismatch fails closed. It also proves an expired guest session cannot refresh a broker shell bundle. Endpoint-delivered caller-session metadata, payload spoofing, and field-granular disclosure remain Gate 2 work.

Gate 2: Endpoint Caller Session Metadata And Disclosure

Visible proof: an endpoint server receives only an opaque service-scoped caller session reference by default, rejects payload attempts to spoof user, session, role, or participant identity, and receives bounded subject details only when the caller explicitly requests disclosure and a broker/service disclosure scope permits the requested fields.

Implementation scope:

  • Extend endpoint delivery metadata with an opaque service-scoped caller session reference and minimal freshness/liveness information.
  • Add an explicit disclosure mechanism for bounded subject fields, such as a per-call disclosure flag or a SessionDisclosure cap, and require a matching broker/service disclosure scope before fields are delivered.
  • Decide the first freshness enforcement point needed to close the open session expiry review finding.
  • Keep endpoint receiver metadata internal and non-authority-bearing.
  • Add hostile endpoint tests proving request bytes cannot override caller session context or force subject disclosure.
  • Add transfer-scope tests proving a same_session cap cannot cross into another session, while a cross_session_shareable cap invokes under the receiver’s session context after transfer.
  • Default transfer scope is fail-closed for cross-session movement: user/session-local caps use same_session or service_regrant_only, while cross_session_shareable must be explicitly chosen by the service or broker.
  • Add expiry/revocation cases before shared-service migration: broker refuses fresh bundles for stale sessions, stale normal endpoint invocations fail or report the documented freshness failure, and service-scoped session refs cannot be replayed as authority.

Verification gate:

  • make fmt-check
  • cargo test-lib
  • cargo test-config
  • focused endpoint/session QEMU proof
  • make run-spawn

Status 2026-04-28 17:43 UTC: commit 687511a implements the first Gate 2 slice. Endpoint delivery includes only a service-scoped opaque caller-session reference, epoch, and live/stale flags by default; it does not expose principal, profile, account, role, tenant, auth-factor, external-claim, display-name, or source-network fields. Normal endpoint calls from stale process sessions fail closed before transfer preparation or enqueue. make run-session-context proves a live child endpoint call carries nonzero opaque metadata despite spoofed user, session, and role payload labels, then proves the same child cannot invoke the endpoint after its session expires.

Status 2026-04-28 18:38 UTC: commit f0cb74b implements Gate 2 transfer-scope enforcement. Cap holds now distinguish same_session, cross_session_shareable, and service_regrant_only transfer policy. Endpoint IPC, endpoint returns, and spawn grants reject cross-session movement unless the scope permits it; fixed-session broker/launcher paths can regrant service_regrant_only caps. make run-session-context proves same-session spawn denial, raw IPC denial for a service-regrant-only UserSession, and receiver-session invocation after an allowed endpoint-cap transfer. Remaining Gate 2 work at that checkpoint was the explicit field-granular disclosure mechanism.

Status 2026-04-28 19:33 UTC: commit 0f92d77 completes the Gate 2 explicit disclosure mechanism. CALL SQEs carry a field-granular disclosure request mask, capability holds carry service/broker disclosure scope, and endpoint delivery exposes only the requested-and-allowed subject fields. The focused QEMU proof covers all three privacy cases: request without scope exposes no fields, scope without request exposes no fields, and request plus matching scope exposes only allowed fields while narrowing broader requests. Gate 3 is the chat session-keyed migration.

Gate 3: Chat Session-Keyed Migration

Visible proof: make run-chat shows chat membership keyed by an opaque service-scoped caller session reference and broker-granted chat capability, with no user-facing badge or receiver selector. Payload identity spoofing, unauthorized subject disclosure, and unauthorized cross-session participant id reuse fail closed.

Implementation scope:

  • Replace legacy chat receiver-selected member identity with session-keyed records.
  • Treat ChatRoot possession plus caller session context as sufficient for join, subject to broker/profile policy.
  • Keep global principal/account metadata private by default. If chat needs display name or guest/operator class, obtain it through explicit disclosure with a matching disclosure scope. If it only needs narrower behavior, use a broker-granted chat facet that encodes policy without revealing subject fields.
  • Add a narrower moderator facet if moderator behavior is needed; do not use payload roles or generic rights bits.
  • If chat supports multiple participant records per session, make returned participant ids server data scoped to the caller session, not transferable authority.
  • Decide chat cap transfer scope explicitly. Plain ChatRoot may be same-session or broker-shareable; participant-like state must not raw-transfer across sessions unless chat defines a share/regrant method. If chat accepts a share, future calls use the receiver session as the invocation subject.
  • Update shell examples and chat docs.

Verification gate:

  • make fmt-check
  • cargo test-config
  • cargo test-lib
  • make run-chat
  • hostile chat spoofing QEMU coverage

Status 2026-04-28 20:06 UTC: commit dc7ece4 implements the Gate 3 chat session-keyed migration. chat-server now serves with endpoint caller metadata, derives an opaque live caller-session key, and uses that key for member records, channel membership, sends, leaves, and polls. Calls without a live session key fail closed. system-chat.cue no longer assigns static chat badges to the shell or bot, and make run-chat proves normal chat runs through operator-session chat-client processes while the attempted delegated endpoint relabel remains rejected. The handle join field is request data only, not membership authority. After review, chat-visible sender labels are also service-assigned member-N values, so request handles do not drive displayed sender identity.

Gate 4: Shared-Service And Legacy Cleanup

Visible proof: normal shared-service demos no longer expose caller-selected service-visible identity, and service-object identity planning is retired from the active path.

Implementation scope:

  • Applied the session-keyed model to shared service state and terminal/stdio bridges that previously depended on legacy receiver metadata as identity. Aurelian ordinary player state is keyed by live endpoint caller-session metadata, and the focused adventure manifest grants NPC/chat authority through service or manifest capabilities without caller-chosen selectors.
    • Stdio bridges bind parent-side servicing to opaque live endpoint caller-session metadata and reject a bridge that later changes caller session, without asking the child to disclose global subject fields.
  • Remove normal shell and manifest syntax that lets a caller select a badge or receiver selector.
  • Keep low-level receiver metadata only as internal endpoint transport state or hostile-test fixture.
  • Update docs/capability-model.md, docs/architecture/ipc-endpoints.md, docs/security/trust-boundaries.md, demos, and status pages.

Verification gate:

  • make fmt-check
  • cargo test-lib
  • cargo test-config
  • make run-smoke
  • make run-spawn
  • make run-chat
  • make run-adventure
  • make docs

Status 2026-04-28 20:48 UTC: the guest-bundle cleanup slice narrows one Gate 4 identity/policy leak without touching adventure content. SessionManager.guest now requires an explicit manifest guest seed, AuthorityBroker.shellBundle returns no default guest service endpoints, and guest launchers use a resource-profile launcherProfile instead of the full manifest binary list. The default guest profile has an empty launcher; the session-context proof uses a dedicated one-binary guest profile for session-context-child; and the default smoke proof covers manifest-without-guest-seed denial.

Status 2026-04-28 21:36 UTC: the session-expiry review finding is closed for current shell/broker authority. Endpoint CALLs already required live caller sessions. Retained broker-issued non-endpoint bundle caps now expire at their bound session boundary: RestrictedLauncher rejects spawn/list calls after the minted session expires, and broker-issued SystemInfo caps are session-bound wrappers. Raw process-spawner, capability-manager, and process-handle control caps opt into live caller-session dispatch for any path that still exposes them. make run-local-users proves an expired operator shell cannot keep launcher authority through an already-issued bundle, and make run-session-context proves the narrow guest proof launcher also fails closed after expiry.

Status 2026-05-01 08:47 UTC: default password-authenticated local operator sessions no longer use fixed wall-clock expiry. The expiry enforcement proof still exists through manifests that set a non-default operator lifetime, and guest/anonymous/focused proof sessions remain short-lived.

Status 2026-04-28 22:02 UTC: the normal shell parser now rejects explicit client @... badge N grants and preserves delegated client endpoint identity when badge syntax is omitted. Default MOTD and adventure docs use omitted-badge launches, while hostile selector fixtures remain in low-level smoke coverage.

Status 2026-04-29 05:59 UTC: the focused chat manifest now routes the same kernel singleton chat_endpoint through init to the resident chat server that the broker facets into operator shell bundles. The focused chat shell no longer receives the resident chat-server export directly from system-chat.cue; the normal shell path uses the broker-issued operator bundle chat endpoint, while the resident bot keeps its manifest service grant.

Status 2026-04-29 06:17 UTC (the socket-backed SocketTerminalSession and TcpSocket.intoTerminalSession were later retired with the kernel socket owner, 2026-06-10; the UART-backed gate remains): terminal output is now behind the same live caller session dispatch gate as terminal input. Both UART-backed TerminalSession and socket-backed SocketTerminalSession require a live caller session for write, writeLine, and readLine, so stale shell sessions cannot keep a terminal bridge useful through write-only calls. TcpSocket.intoTerminalSession continues to return a move-only terminal cap, but the result hold is explicitly cross-session shareable because the Telnet gateway converts an accepted socket in its service session and then grants the terminal to the broker-minted shell session.

Status 2026-04-29 09:00 UTC: shell-serviced stdio bridge waits now bind to opaque live endpoint caller-session metadata during the active child wait and reject mismatched callers without asking the child to disclose global subject fields. Normal StdIO.close exits cleanly, rejected calls drain transferred caps before returning, and make run-session-context covers a transfer-bearing cross-session rejection. demos/service-common no longer exposes a badge-serving helper or badge field on EndpointCaller; new shared endpoint loop code uses EndpointUserData, with the old badge-named user-data alias kept only as a source-compatible alias after checked-in adventure code moved onto caller-session metadata.

Status 2026-04-29 09:44 UTC: the non-adventure endpoint caller-session reference is widened to 128 bits while keeping scoped_ref as the low 64-bit compatibility half and adding scoped_ref_hi as the high half. Endpoint delivery fills both halves from independently domain-separated, nonzero hashes, keeps epoch separate, and non-adventure service/session-context/stdout bridge guards now require and compare both halves. This remains proof-grade opaque reference derivation; a true keyed secret, scope-key rotation, and rotation lifetime policy are still deferred.

Status 2026-04-29 10:20 UTC: endpoint caller-session references now use an entropy-backed boot secret and HMAC-SHA256 over a non-reused endpoint service-scope id plus the kernel session id. scoped_ref remains the low ABI field, but it is no longer value-compatible with the old unkeyed low-half hash; scoped_ref_hi is the high ABI field of the same keyed opaque reference. epoch stays a separate field and is also domain-separated under the boot key so stale/freshness audit correlation rotates with boot-key and endpoint-scope changes. References rotate on reboot and endpoint object replacement. Stable service-audit identity across service upgrades remains future work.

Status 2026-04-29 11:00 UTC: the session-context QEMU proof now calls two distinct endpoint service scopes from one child process/session before expiry and asserts their opaque caller-session reference tuples differ while both remain live. This covers endpoint-object replacement/scope changes at the demo-proof level; stable service-audit identity across upgrades remains future work.

Status 2026-04-29 20:33 UTC: the session-bound proposal and shared-service backlog distinguished landed Aurelian ordinary player-state migration from the then-remaining adventure NPC/service-authority cleanup. The server keys ordinary player records by live endpoint caller-session metadata.

Status 2026-04-29 21:40 UTC: Gate 4 implementation and verification are closed for mainline. make fmt-check, cargo test-lib, cargo test-config, make run-smoke, make run-spawn, make run-chat, make run-adventure, focused make docs, and git diff --check passed after commit faeff80 hardened the docs PDF render path for automated builds. The focused adventure manifest check rejects legacy badge: selectors, and make run-adventure covers selector-free Adventure/chat service grants plus the resident scenario test. The follow-up paper/status alignment records this as landed C1 evidence in docs/paper/evidence-gaps.md, docs/paper/plan.md, and papers/schema-as-abi/main.typ.

Follow-Up: Session Lifecycle, Logout, And Renewal

The completed milestone closed the stale-session authority hole for current shell and endpoint paths, but it did not make short fixed wall-clock expiry a complete interactive session UX. Follow-up work belongs with identity, runtime-network-shell, and local-user management rather than reopening this completed milestone.

Target:

  • Keep one immutable SessionContext per process.
  • Add a trusted mutable session liveness cell keyed by session id/epoch with states live, logged_out, revoked, expired, and recovery_only.
  • Move liveness checks from timestamp-only immutable metadata toward session-manager state that can be logged out, administratively revoked, expired, or renewed without relabeling a running process.
  • Implement UserSession.logout and make owner-shell exit / gateway disconnect close the sessions they own.
  • Add a narrow SessionManager.renew or broker refresh path that is allowed only for explicit renewal/recovery methods after expiry.
  • Make renewal mint fresh grant leases or wrapper caps when policy needs a new decision. Renewal must not make stale ordinary grants fresh by accident.
  • Preserve explicit revocation as stronger than renewal, except for separately audited recovery policy.
  • Treat password-authenticated local operator shells as logout/connection/ process-tree/admin-revoke driven by default, with idle lock, renewal prompt, or configured hard maximum as policy choices. Guest, anonymous, remote, federated, and elevated grants can remain short-lived.

Status 2026-05-02 08:43 UTC: the remote-session lifecycle slices add the kernel live/logged_out liveness cell for SessionManager-minted sessions, wire UserSession.logout through the remote CapSet gateway, and reject already-admitted endpoint returns after caller logout/session death. Broker and restricted-launcher reconstruction now resolves the existing kernel liveness cell by minted session id and fails closed when it is missing or logged out. Endpoint RETURN rechecks the caller session after target CQ-space checks and before copying result bytes or installing result caps; stale callers receive an invoke-failed completion when possible, the in-flight call is canceled rather than restored, and prepared result-cap move sources roll back. This closes explicit remote logout, connection-close propagation, and already-admitted endpoint result delivery after session death for the current kernel endpoint path. A 2026-05-11 follow-up also makes clean local owner-shell exit call the held UserSession.logout() path before process exit, with the shell smoke asserting the scheduler-observable hook. The full lifecycle target still needs renewal, administrator revocation, live remote proxy object cleanup, and complete audit reason separation.

Verification gates before this is closed:

  • host tests for liveness cell state transitions and renewal denial after revoke;
  • QEMU proof that exit/terminal close on an owner shell logs out the session and prevents future broker bundle refresh. The logout propagation half is complete for clean shell exit; broker refresh refusal remains covered by the existing logged-out liveness checks and should be re-proven when renewal or replacement shell UX lands;
  • QEMU proof that pre-expiry renewal keeps the shell usable while old ordinary grant epochs do not silently refresh;
  • QEMU proof that post-expiry normal calls fail while explicit renew/logout paths remain available;
  • QEMU proof that result-cap delivery after session expiry does not install fresh caps into a stale caller, including a move-source rollback case;
  • audit output distinguishing expiry, explicit logout, renewal, administrator revoke, process-exit cleanup, and stale-use denial.

Deferred Work

  • Remote capability transport and network transparency.
  • Durable account store and external identity binding persistence.
  • Full quota service and scheduling-context donation policy.
  • Mutable session liveness cells, explicit logout/close propagation, and renewal/recovery paths for usable long-running shells.
  • Explicit cross-session sharing UX and audit workflow.
  • Stable service-audit identity for endpoint caller-session references across intentional service replacement or upgrade.
  • Delegated-subject / act-on-behalf-of context. See docs/proposals/delegated-subject-context-proposal.md.

Service Object Identity Migration

ARCHIVED — superseded by Session-Bound Invocation Context and the active backlog Session-Bound Invocation Context (2026-04-28). This file is retained as historical context only; do not select work from it. The Big Chunk 2 subject/proof-root and shared-service migration here are NOT on the active mainline path.

Status: superseded on 2026-04-28 14:35 UTC by Session-Bound Invocation Context and the active selected backlog Session-Bound Invocation Context. The Big Chunk 1 synthetic routing/lifecycle proof remains useful historical coverage, but Big Chunk 2 subject/proof root opening and shared-service service-object migration should not proceed on the active mainline path.

Historical plan for replacing caller-selected service-visible identity with kernel-routed service object capabilities and userspace-verified subject capabilities.

This backlog intentionally uses large implementation chunks. Each chunk should land as a coherent reviewed branch with one focused end-to-end QEMU proof plus the affected host tests, rather than splitting the transition into dozens of small branches that each require full verification.

Design Target

The final model has two separate authorities:

  • Subject/proof capabilities: issued by trusted userspace services such as SessionManager, service-principal issuers, workload-identity issuers, anonymous/guest issuers, or AuthorityBroker.
  • Service object capabilities: minted by a trusted service root/factory after it validates subject/proof authority and policy context.

The kernel enforces generic capability mechanics only:

  • live generation-tagged cap-table entries;
  • endpoint/object routing;
  • receiver immutability across copy, move, IPC transfer, and spawn;
  • trusted mint authority;
  • revocation/lifetime checks;
  • generic queue, byte, cap-count, and scheduling bounds.

Userspace services enforce policy:

  • trusted issuer selection;
  • subject facts, roles, sessions, guests, service accounts, and workloads;
  • external subject admission and local/pseudonymous principal mapping;
  • audit context;
  • quota bucket selection;
  • application object records and facets;
  • whether a subject must stay live for a given object.

Request fields remain data. They must never select service authority.

External Subject Alignment

This migration must preserve the identity model already described in docs/proposals/user-identity-and-policy-proposal.md, docs/proposals/oidc-and-oauth2-proposal.md, and docs/backlog/local-users-management.md.

External subjects enter capOS only through an admission pipeline:

  1. An external verifier validates the provider assertion, such as an OIDC ID token, passkey assertion, certificate chain, cloud workload token, or remote gateway-authenticated claim.
  2. Admission normalizes the external key as provider kind, issuer, tenant, and subject, then either maps it through ExternalIdentityBinding to an existing local principal or admits it as an explicitly configured pseudonymous/guest/service principal.
  3. SessionManager, a service-principal issuer, or a workload-identity issuer mints the local subject/proof capability. Imported provider groups, roles, tenant IDs, acr, amr, device posture, and token age are ABAC inputs for this mint decision, not downstream object authority.
  4. A service root validates the local subject/proof capability and policy context before minting a service object capability.

Consequences:

  • Service objects store verified local subject facts and audit context, not raw external tokens or provider-specific claim bags.
  • A provider claim can influence the object minted at open time only through trusted admission, broker, or verifier capability paths.
  • A stale, disabled, or unbound external subject must fail before a service object is minted.
  • Remote gateways translate connection authentication into local subject/proof caps; ordinary application services should not authorize directly from network connection identity.
  • The same object-capability migration should work for local password/passkey sessions, OIDC users, cloud workload identities, service accounts, anonymous/guest sessions, and future remote cap transports.

Network Transparency Alignment

The first implementation is local to one kernel and one endpoint object graph, but it must not block future network-transparent capability transport.

Design constraints for this migration:

  • Do not serialize kernel receiver selectors, cap-table handles, endpoint object ids, generation values, or server cookies as portable object names.
  • Treat service object capabilities as live references. A future remote bridge should export/import them through connection-local tables, not through global URLs or raw selector strings.
  • Preserve Cap’n Proto-style disconnect behavior: if the local endpoint, server, or remote connection dies, imported references become broken and calls fail explicitly rather than silently rebinding to a new server.
  • Keep persistent restore separate from live object routing. If a service needs a durable object reference, restore should go through a capability-bearing persistence/naming service that can authorize and mint a fresh live object.
  • Keep subject admission separate from transport identity. A remote bridge may authenticate a TLS/OIDC/certificate/session channel, but application services should still receive local subject/proof caps and service object caps.
  • Keep object equality out of the first implementation. If future remote transport needs equality, expose it through a deliberate service or transport protocol rather than assuming global kernel object identity is comparable across hosts.

The local receiver-cookie model should therefore be an implementation detail behind local ServiceRef capabilities. The portable concept is the authority to call a typed object reference, not the routing selector used by one kernel.

Non-Negotiable Invariants

  • Only trusted mint paths create a new service object identity.
  • Ordinary copy, move, IPC transfer, and spawn preserve the same service object.
  • A child that receives an object capability acts through that same object, unless a service method explicitly mints a new delegated object.
  • Endpoint routing is derived from the invoked capability, not from request bytes, shell text, manifest user input, process id, user name, role name, or numeric labels.
  • The kernel never interprets users, accounts, roles, tenants, service accounts, rooms, NPCs, moderators, file owners, or workload names.
  • Server cookies used for object dispatch must be generation-safe and must not be raw pointers in the first implementation.
  • Move transfer remains transactional in capOS: a failed delivery or canceled receive rolls back reserved source authority rather than silently dropping it before adoption.
  • Application rights should be represented by typed interfaces or narrower object capabilities, not by generic permission bitmasks.

Big Chunk 1: Core Service-Object Routing And Lifecycle

Visible proof: a synthetic service in QEMU mints two distinct object capabilities, routes calls by kernel-delivered receiver cookie, transfers one object through IPC and process spawn, and proves copied/moved/spawned handles still reach the same object record. The proof also injects forged identity and selector-like bytes into request payloads and shows they do not affect routing.

Implementation scope:

  • Make service-object terminology explicit in cap-table and endpoint code while preserving compatibility with current hold metadata.
  • Introduce or formalize endpoint-scoped receiver records with generation-safe server cookies.
  • Add a trusted mint interface/path owned by endpoint owner or explicit mint authority.
  • Deliver receiver cookie plus interface/method/payload/cap grants to the server.
  • Preserve receiver identity across copy, move, IPC transfer, and spawn.
  • Add lifecycle behavior for receiver close/revoke and stale-generation reuse sufficient for the synthetic proof. Broader release/exit cleanup remains in Big Chunk 4.
  • Add a synthetic service-object demo, manifest, shell/host harness, and hostile checks.

Verification gate:

  • make fmt-check
  • cargo test-lib
  • cargo test-config
  • focused QEMU proof target for the synthetic service-object routing demo
  • make run-spawn or another focused spawn/transfer proof that exercises the modified grant path

Review notes:

  • 2026-04-28: workplan/service-object-routing-core added the first focused service-object routing proof, but does not close this whole chunk. The branch introduced CapGrantMode.serviceObject as the explicit spawn-grant spelling for endpoint-scoped service object facets, kept clientEndpoint as compatibility spelling, added receiver-cookie preservation host checks, and added make run-service-object-routing for a synthetic two-object QEMU proof with payload spoofing, copy and move service-object IPC transfer, and nested spawn delegation. The proof also rejects service-object minting through the legacy ProcessSpawner endpoint-result facet path, keeping that compatibility exception scoped to clientEndpoint. At that checkpoint, generation-safe server cookie representation beyond fixed demo constants and explicit receiver lifecycle/close/revoke coverage still remained.
  • 2026-04-28 14:10 UTC: commit a4655f0 completed Big Chunk 1. The focused demo now encodes service receiver cookies as receiver-index plus generation, stores service-side object records, proves close and revoke rejection for later calls, and queues a stale alpha call before reusing the alpha record slot so the stale generation is rejected instead of dispatching to the reused record.
  • Review must inspect capos-lib/src/cap_table.rs, kernel/src/cap/endpoint.rs, kernel/src/cap/ring.rs, kernel/src/cap/transfer.rs, capos-config/src/ring.rs, capos-rt/src/ring.rs, capos-rt/src/client.rs, and the new demo.
  • Do not migrate chat, adventure, or stdio in this chunk. The synthetic proof should isolate the kernel/runtime semantics first.

Big Chunk 2: Subject/Proof Authority And Service Root Opening

Visible proof: a root service accepts a trusted local subject/proof capability, validates it through a verifier or broker capability, mints a service object, and rejects fake same-shape subject objects, expired/stale proofs, wrong audience, and payload identity spoofing. A spawned child that receives only the object cap can use the object but cannot open sibling objects.

Implementation scope:

  • Add minimal schema/runtime surface for a local subject/proof verifier. Keep it local and bounded; do not require full remote cryptographic identity yet.
  • Model the verifier result in the same shape used by external admission: local or pseudonymous principal id, principal kind, auth strength, policy profile, resource profile, audit context, and optional claim-derived ABAC attributes.
  • Bind subject/proof data to audience, purpose, and freshness enough for the proof.
  • Add service-root open semantics over the core service-object mint path.
  • Store verified subject, audit context, quota placeholder, policy mode, and optional liveness link in service-owned object records.
  • Add hostile checks for fake subject providers and request-field identity spoofing.
  • Add explicit delegation behavior: raw object-cap transfer preserves the same object; explicit service delegation mints a new object only through a service method.

Verification gate:

  • make fmt-check
  • cargo test-config
  • relevant host tests for subject/proof encode/decode and validation
  • focused QEMU proof target for subject/proof service-root open
  • one existing session proof such as make run-login or make run-ssh-public-key-auth if touched by the subject path

Review notes:

  • Avoid broad cryptographic protocol work in this chunk. The target is local issuer-verifiable subject/proof authority, not production remote federation.
  • Keep application role policy out of the kernel and out of generic rights bits.
  • Do not bypass ExternalIdentityBinding or admission policy when adding external-subject tests. If a fixture models OIDC, passkey, cloud, or certificate input, it must first resolve to a local subject/proof cap before any service object opens.

Big Chunk 3: Shared-Service Migration

Visible proof: existing shared-service demos run without caller-selected service-visible identity. Chat, stdio/terminal child bridges, and adventure receive service object capabilities directly or open them through root/factory interfaces. Existing shell workflows still work, but children cannot choose or rewrite the object identity they receive.

Implementation scope:

  • Convert Chat into root/object interfaces such as ChatRoot and ChatParticipant, with subject binding at the root boundary.
  • Convert stdio or terminal child bridges that depend on endpoint-client identity into service object caps or a narrowed terminal/session object.
  • Convert adventure player/NPC authority to service objects, including room speech over migrated chat object caps.
  • Update shell launch examples and spawn grant parsing so ordinary grants name existing capabilities only.
  • Preserve compatibility only where a focused legacy smoke still needs it, and mark it as transitional.
  • Add hostile smokes proving request-field identity spoofing, child relabeling, and unauthorized sibling minting fail.

Verification gate:

  • make fmt-check
  • cargo test-config
  • cargo test-lib
  • make run-chat
  • make run-adventure
  • focused stdio/terminal or shell proof touched by the migration
  • one hostile service-object delegation QEMU proof

Review notes:

  • This is deliberately a large branch. Avoid stopping after only chat unless the branch becomes too risky to review coherently.
  • If adventure or stdio exposes an implementation blocker, record it as a task under docs/tasks/ with concrete remediation before merging partial migration.

Big Chunk 4: Legacy Compatibility Retirement And Naming Cleanup

Visible proof: no normal shell, manifest, shared-service demo, or docs path exposes caller-selected service-visible identity. Internal names match the implemented model where the field is only an endpoint-scoped receiver selector.

Implementation scope:

  • Rename internal fields, docs, and diagnostics from legacy identity language to receiver-selector terminology where behavior has migrated.
  • Remove compatibility grant syntax and manifest fields that can no longer be used by supported smokes.
  • Remove the default MOTD adventure launch commands that still expose explicit legacy receiver selectors, or replace them with service-object-safe commands after the shared-service migration.
  • Tighten validation so service object authority cannot be constructed from user input.
  • Add release/exit cleanup coverage for service object caps with queued calls, in-flight returns, server-owned object records, and receiver revocation.
  • Update docs/capability-model.md, docs/architecture/ipc-endpoints.md, docs/security/trust-boundaries.md, docs/proposals/service-object-capabilities-proposal.md, docs/proposals/user-identity-and-policy-proposal.md, docs/status.md, and relevant backlog files, including notes about future network-transparent import/export and persistent restore boundaries.

Verification gate:

  • make fmt-check
  • cargo test-lib
  • cargo test-config
  • cargo test-ring-loom if ring metadata changes
  • make run-smoke
  • make run-spawn
  • make run-chat
  • make run-adventure
  • make docs
  • generated-code check if schema or generated bindings changed

Review notes:

  • This branch should close the compatibility migration or explicitly preserve only low-level hostile-test fixtures.
  • Do not leave user-facing syntax or docs that imply clients may choose service object identity.

Deferred Work

  • Remote capability transport and network-transparent object references.
  • Production cryptographic subject proof protocols.
  • Persistent restore of service objects across server restart.
  • Full quota service and scheduling-context donation policy.
  • Cross-host federation and external identity mapping.

Stage 6 Capability Semantics Backlog

Detailed decompositions for Stage 6 follow-up work. docs/tasks/README.md links here but should not inline these subtasks.

Notification Objects

Implement a lightweight signal/wait primitive for interrupts and event delivery without full endpoint message overhead.

  • Define schema/ABI and wait semantics.
  • Add kernel object plus ring operations or methods.
  • Add QEMU smoke for signal, wait, timeout, and revoke/drop cases.

Promise Pipelining

Implement promised-answer targeting for CALL SQEs after transfer/result-cap insertion is stable.

  • Define promised-answer IDs, dependency encoding, and failure rules. Existing design decision: pipeline_dep is the process-local promised-answer ID allocated by the runtime, and pipeline_field is a zero-based sideband CapTransferResult record ordinal in that answer’s completion. It is not a Cap’n Proto schema field or payload path. Unsupported mappings fail closed, with concrete transport error codes left to the implementation slice before the kernel accepts CAP_SQE_PIPELINE.
  • Resolve dependency chains in the kernel without userspace round-trips.
  • Add runtime placeholders and an IPC pipeline smoke. The smoke must prove pipeline_dep is the promised-answer ID, pipeline_field resolves the selected sideband result-cap ordinal, and mismatched result payload bytes do not affect kernel dependency resolution.

CapabilityManager

Add management-only introspection and grant helpers after transfer/release semantics are stable.

  • Define list/grant schema and authority boundaries.
  • Implement read-only cap table introspection.
  • Add grant smoke and hostile checks for non-manager callers.

Session-Bound Invocation Context

Replace caller-selected endpoint identity with session-bound invocation context as described in docs/proposals/session-bound-invocation-context-proposal.md. The selected 2026-04-28 migration plan lives in docs/backlog/session-bound-invocation-context.md.

Current status: Gate 0 delegated-client relabeling containment, the transitional representation substrate, the synthetic service-object routing/lifecycle proof, Gate 1 process-session invariant, Gate 2 privacy-preserving endpoint caller-session metadata, and Gate 3 chat session-keyed migration have landed. Existing code still has a badge-named u64 field in several transport structs, but the active design treats that field as legacy receiver metadata, not as service capability. Commit a4655f0 at 2026-04-28 14:10 UTC completed the historical service-object routing proof with generation-checked receiver cookies, service-side object records, close/revoke rejection, stale-cookie rejection after record reuse, receiver-cookie routing despite spoofed request bytes, copy/move IPC transfer, and nested spawn delegation.

Gate 4 in docs/backlog/session-bound-invocation-context.md is implemented and verified on mainline: shared-service legacy cleanup has moved normal chat, adventure, and terminal/stdio paths off caller-selected receiver metadata. Do not continue the superseded subject/proof root-opening path from docs/backlog/service-object-identity-migration.md unless the selected milestone changes again.

Paper prerequisite. Gate 2 endpoint caller-session metadata, Gate 3 chat session-keyed migration, and Gate 4 shared-service cleanup have landed. The paper/status closeout for whitepaper claim C1 (“schema-typed methods replace parallel rights”) remains peer-owned: docs/paper/evidence-gaps.md, docs/paper/plan.md, and the matching #todo block in papers/schema-as-abi/main.typ still need to reflect the landed evidence.

Gate 0: delegated-client relabeling containment

This is the first Telnet Shell Demo blocker. It must land before shell launch can be exposed through any network-backed terminal.

  • Add hostile coverage proving an ordinary shell or delegated endpoint client cannot re-label a client endpoint by choosing a different identity in a spawn grant. Cover explicit badge N, the legacy badge-zero encoding that old omitted syntax used to produce, and current omitted shell syntax preserving the delegated source identity. Worker B checkpoint: normal shell help and smoke-help assertions no longer advertise badge N. Worker C checkpoint: init spawn hardening now mints a nonzero delegated client facet into a child init process and asserts that explicit-badge and badge-0 relabel spawn attempts fail.
  • Change ProcessSpawner so ClientEndpoint grants from delegated client facets preserve the source identity and reject attempts to set a different value. Endpoint owners and trusted parent endpoint result caps remain the only transitional paths that may mint a new client identity.
  • Remove arbitrary badge N from normal capos-shell help and smoke-help launch examples; keep legacy manifest/debug syntax only where the kernel enforcement still rejects delegated-client relabeling. The default MOTD adventure launch commands now omit explicit legacy selectors; Gate 4 in docs/backlog/session-bound-invocation-context.md still owns retiring remaining manifest-level selector compatibility after session-bound chat and adventure migration.
  • Document the containment in docs/architecture/ipc-endpoints.md and trust-boundary docs before exposing shell launch through Telnet.

Historical Gate 1: service object representation

  • Define the transitional kernel/runtime representation for existing endpoint-backed service facets: target endpoint, interface id, and legacy receiver metadata. 2026-04-25 18:31 UTC checkpoint: the first representation slice reuses CapHold { object_id, interface_id, badge } as endpoint object, service interface id, and endpoint-scoped receiver selector for existing endpoint-backed service objects. Dispatch and spawn now preserve the held metadata for ordinary delegation; explicit trusted minting remains open.
  • Complete the transitional representation replacement with explicit generation-safe receiver records and lifecycle coverage for the synthetic proof. Big Chunk 1 now covers trusted service-object minting, receiver-cookie dispatch, receiver-preserving copy/move IPC transfer and spawn, request-byte spoofing checks, generation-safe server cookies, and close/revoke/stale-generation rejection. 2026-04-28 14:10 UTC checkpoint: commit a4655f0 added generation-checked receiver cookies, service-side object records, close/revoke rejection, and stale-cookie rejection after record reuse.
  • Add the minimum trusted mint path needed for the synthetic service-object proof: endpoint owner or explicit mint authority creates the initial service object cap; ordinary clients only copy or move it. 2026-04-28 checkpoint: CapGrantMode.serviceObject lets endpoint owners mint copy-transferable endpoint-scoped service object facets for child processes while delegated service object caps cannot relabel the held interface or receiver cookie. The legacy ProcessSpawner endpoint-result facet exception remains scoped to clientEndpoint and is rejected for serviceObject.
  • Scope receiver selectors to the target endpoint and keep them out of shell syntax, manifest user fields, and service policy labels.
  • Preserve the current held receiver metadata across copy and move transfer. Ordinary transfer must not mint a sibling object.
  • Prove receiver identity preservation across copy, move, IPC transfer, and spawn in the synthetic service-object QEMU proof. 2026-04-28 checkpoint: make run-service-object-routing exercises copy-transfer and move-transfer of service object caps through IPC, nested spawn delegation, and hostile payloads that try to name the other receiver.
  • Enforce that client-held service object caps cannot use endpoint receive/return authority unless a separate server-facing interface grants that authority.
  • Deliver endpoint metadata so servers can dispatch current object-shaped calls without treating it as caller-selected identity. 2026-04-25 18:45 UTC checkpoint: trusted manifest/init minting now uses explicit CapabilityAs spawn grants to request a service interface from endpoint exports, validation rejects the same override for non-endpoint exports, and system-spawn.cue proves a non-Endpoint service interface plus selector reaches the server receive metadata.
  • Rename or wrap server delivery surfaces around receiver-selector/server- cookie terminology once the behavior is receiver-selector-only.

Gate 2: process session invariant

  • Add process-owned immutable session context with explicit system/service session support.
  • Make child spawn inherit the parent’s session by default and require trusted broker/session-manager authority for different child sessions.
  • Add host and QEMU coverage proving ordinary processes cannot inject or use a second independent session subject.

Gate 3: endpoint caller session metadata

  • Deliver opaque service-scoped caller-session references and freshness results to endpoint servers.
  • Add an explicit subject-disclosure path so global principal/profile details are not revealed to services by default.
  • Add hostile coverage proving request bytes cannot spoof session identity or force disclosure.

Gate 4: shared-service demo migration

  • Convert chat identity from legacy receiver selectors to broker-granted chat roots/facets plus service-scoped caller-session references.
  • Finish adventure NPC/service-authority cleanup and any remaining stdio/terminal child bridge paths that depend on caller-selected endpoint identity. Aurelian ordinary player state is already keyed by live endpoint caller-session metadata.
  • Retire normal user-facing badge/receiver-selector syntax after chat, adventure, stdio, and endpoint smoke paths no longer depend on it.

Scheduling Context And Resource Donation

Convert the roadmap’s priority/budget donation and session-quota ideas into a measured design before adding new scheduler policy.

  • Record current direct-switch IPC timing and priority-inversion risks.
  • Define scheduling-context donation metadata.
  • Define resource donation parameters for session-creating caps.

Init ELF Embedding

Done 2026-05-25 23:26 UTC. The init ELF ships inside the kernel binary via include_bytes!, not as a manifest entry or separate Limine module. kernel/build.rs reads the prebuilt init/ artifact (CAPOS_INIT_ELF, with a conventional-path fallback) and emits a kernel::boot::INIT_ELF: &[u8] static; kernel bootstrap parses it through the existing capos_lib::elf loader. Init stays a standalone crate with its own linker script and code model. Embedding is byte packaging, not linker merging.

Landed as a hybrid keyed on the reserved selector rather than an always-embedded init: initConfig.init.binary is a generic “which binary is PID 1” selector, and most boots run a non-init binary as PID 1 (run-smoke’s shell, ~70 focused test-as-PID-1 manifests). So embedding applies only when init.binary == capos_config::RESERVED_INIT_BINARY_NAME ("init"): then PID 1 loads from INIT_ELF with no binaries resolution, and manifest validation (capos-config/mkmanifest) rejects any binaries entry named "init". Any other selector still resolves PID 1 from SystemManifest.binaries exactly as before. The real-init manifests (system.cue via the shared _baseBinaries plus the focused init.binary == "init" manifests) drop their init binaries entry; run-smoke and the test-as-PID-1 manifests are unchanged.

Because the embedded image is the canonical init, child spawns that reference the init binary by name (e.g. system-spawn.cue’s spawn-hardening fixtures) keep working: run_init injects the embedded bytes into the ProcessSpawner binary set under the reserved name when init is embedded (the BootPackage cap serves only the serialized manifest bytes), so the spawnable set matches the pre-embedding state without init appearing in the serialized manifest.

Proof: make run-init-embedding (minimal system-init-embedding.cue: PID 1 from INIT_ELF, no reserved binaries entry) and make run-smoke (PID 1 = shell, unchanged). cargo test-mkmanifest / cargo test-config cover the reserved-name rejection and the init-ref skip.

Reference: docs/proposals/service-architecture-proposal.md section Init Binary Embedding.

Remote Session CapSet Client Backlog

Detailed decomposition for the remote host app path described in Remote Session CapSet Clients. docs/tasks/README.md should point here when selecting implementation slices; it should not inline the details.

Visible Outcome

make run-remote-session-capset-interop boots capOS in QEMU, starts a loopback-scoped remote session gateway, runs a regular host-side Rust client on the host, authenticates or exercises an explicitly configured guest/anonymous denial path, obtains a RemoteSession, lists a broker-issued RemoteCapSet, gets typed capabilities by name/interface ID, calls at least two granted capabilities, proves missing/wrong-interface denials, logs out or disconnects, and observes stale proxy calls fail closed.

The first harness can be a small CLI because it is easy to script. The product shape should also support a native desktop GUI, a Tauri app whose Rust backend holds the remote CapSet, or a webapp whose trusted server/gateway holds the remote CapSet and exposes only UI frames, command descriptors, or bounded tool requests to browser JavaScript. The UI path can be bidirectional: the host UI may grant a narrow UI-surface capability back to capOS-side services or agents so they can propose task-specific panes, command palettes, visualizations, theme hints, and layout changes without receiving arbitrary host UI authority.

The ordinary operator run story is: start capOS with make run, note the printed remote CapSet: tcp 127.0.0.1 <port> -> guest :2327 line, then start one of the host clients against that endpoint. make run injects the host USER as the default operator account name on the capOS side; the CLI may take --user (or CAPOS_REMOTE_SESSION_USER) as an explicit operator override, but the web bridge keeps the login username field empty by default to avoid leaking host identity hints into the page before authentication. The current repo-local commands are:

make run
cargo run --manifest-path tools/remote-session-client/Cargo.toml \
  --target x86_64-unknown-linux-gnu \
  --bin remote-session-client -- --host 127.0.0.1 --port <printed-port>
CAPOS_REMOTE_SESSION_PORT=<printed-port> make remote-session-ui

The CLI also accepts --launch-adventure for the default-manifest proof that starts the Adventure service graph through serviceLaunch and requires a running status. --adventure-status follows a successful Adventure launch with bounded Adventure.status, Adventure.look, and Adventure.inventory calls through the session-bound worker; --adventure-go <direction> adds the first mutable typed DTO call by invoking bounded Adventure.go(direction) and checking the returned text/room response. The same CLI path now accepts bounded --adventure-take <item>, --adventure-use <item>, and --adventure-drop <item> controls for simple item interactions. The focused positive proof is make run-remote-session-adventure-interop; the existing make run-remote-session-capset-interop fixture remains a launch-denial proof shared with the browser UI smoke path.

The CLI and trusted local web bridge are development tools in this repo. The repo-local Tauri path reuses the same Rust backend boundary by loading the loopback remote-session-ui surface in a desktop webview:

CAPOS_REMOTE_SESSION_PORT=<printed-port> make remote-session-tauri

By default that target first runs a policy preflight over the reviewed check/dev scaffold, then checks Tauri CLI and Linux build prerequisites, reports dependency/scaffold status, and runs a deterministic wrapper cargo check when the host has those prerequisites. Set CAPOS_REMOTE_SESSION_TAURI_MODE=dev to launch cargo tauri dev. Missing host Tauri packages fail with explicit diagnostics and point operators back to make remote-session-ui; the Tauri wrapper is not a different authority model. CAPOS_REMOTE_SESSION_TAURI_MODE=policy tools/remote-session-tauri.sh runs only the scaffold guardrail and does not need Tauri system packages or a desktop session. package and automation modes are intentionally blocked until distributable packaging and desktop automation receive reviewed designs.

The first visible proof keeps QEMU host forwarding and a development transport. The current implementation uses length-prefixed schema-framed Cap’n Proto DTOs for remote login, session summary, CapSet list/get, calls, denials, and logout. Standard capnp-rpc framing and live object proxies remain the transport direction, but the first proxy slice is now explicitly dual-stack: host-backend-only capnp-rpc proxy objects over the existing DTO gateway first, then guest-wire replacement after the capOS userspace runtime decision.

Implementation Status

Implemented and active slices:

  • The capnp-rpc transport DTO surface is pinned in schema/capos.capnp ahead of the transport rewrite: RemoteAuthStart, RemoteAuthStep, RemoteServiceGrantRequirement, RemoteServiceExport, RemoteServiceProfile, plus the RemoteSessionGateway, RemoteAuthFlow, RemoteSession, RemoteCapSet, RemoteServiceCatalog, and RemoteServiceRunner interfaces. Round-trip coverage lives in capos-config/tests/remote_capnp_rpc_dto_roundtrip.rs.

  • Runtime placement decision: capnp-rpc v0.25 is std-only and needs a futures executor, while demos/remote-session-capset-gateway/src/main.rs is a #![no_std] #![no_main] gateway with a synchronous accept/recv/handle/send loop. Therefore the first proxy implementation is host-backend-only. The trusted Linux Rust backend may host a local capnp-rpc facade/proxy layer for chat or Adventure and translate those calls into the existing RemoteGatewayRequest / RemoteGatewayResponse DTO transport. The gateway, schema, generated bindings, kernel services, browser API, and browser view models stay unchanged for that slice. This is a temporary dual-stack period: the backend proves proxy semantics and exception mapping over the DTO wire, but it must not be documented as live standard capnp-rpc support inside the capOS guest. The gateway wire replacement remains gated on a reviewed capOS userspace async runtime or a reviewed sync-friendly Cap’n Proto RPC adapter. The completed task file docs/tasks/done/2026/remote-session-host-backend-capnp-rpc-facade.md records the implementation metadata and validation for the host-backend slice.

  • Host-backend capnp-rpc facade for Chat landed 2026-05-13 08:29 UTC. tools/remote-session-client/src/rpc_facade.rs creates a local capnp-rpc Chat client/server object in the trusted Linux backend and translates join, leave, send, who, and poll calls into the existing synchronous RemoteGatewayRequest / RemoteGatewayResponse DTO operations. The CLI client and trusted web bridge now route chat calls through the same backend-only facade. Browser JavaScript still receives only view models, typed results, typed denial envelopes, and redacted transcript rows; it does not receive raw capOS caps, local cap ids, endpoint owner handles, result-cap slots, process handles, or proxy table positions. Denials remain DTO/domain results at the browser boundary, and transport disconnects keep the existing reconnect-required mapping. This proves backend proxy semantics over the DTO transport, not live standard capnp-rpc support inside capOS or on the guest gateway wire.

  • The capOS SDK transitional RemoteTransport now uses the same trusted host DTO backend as a host-side std transport for shared typed clients. The first proof maps a forwarded system_info cap obtained through CapSetGet to a synthetic host-side cap id and drives SystemInfoClient::motd_wait through the current systemMotd DTO. This is still backend proxying over the length-prefixed DTO gateway, not live guest-wire capnp-rpc.

  • make run starts the remote-session CapSet gateway in the default manifest and forwards guest port 2327 to a host-local loopback port. The helper prefers 127.0.0.1:2327 but selects a free fallback when another QEMU run or developer process already occupies the port, unless the port is explicitly configured.

  • make run-remote-session-capset-interop boots a focused manifest, runs a Linux Rust host client, authenticates as the configured operator by default, lists the broker-shaped remote CapSet, calls session, system_info, and the first endpoint-backed chat service through a per-session worker proxy, proves wrong-interface/unknown/stale denials, and records a redacted transcript.

  • make run-remote-session-adventure-interop uses a focused manifest with the Adventure server, companion NPC binaries, and remote-session-adventure-worker embedded. The operator client launches the Adventure graph, calls adventureStatus, adventureLook, adventureInventory, and the first mutable adventureGo(direction) DTO plus bounded adventureTake(item), adventureUse(item), and adventureDrop(item) controls, proves stale failure after logout, and preserves the same transcript authority-leak checks.

  • RemoteAuthMethod advertises password and anonymous as enabled methods plus disabled public-key, OIDC, and passkey/WebAuthn entries so the protocol and client are not password-shaped.

  • The capOS gateway uses manifest-scoped TcpListenAuthority on guest port 2327, plus SessionManager and AuthorityBroker. It does not receive raw NetworkManager, TcpListener, or TcpSocket authority, and the manifest does not grant service endpoint caps directly to the gateway. The gateway asks the broker for a narrower remote-client bundle, exposes broker-held service endpoints such as adventure and chat as remote CapSet descriptors, and starts the first chat endpoint proxy through a session-bound worker when the client calls chatSend. Adventure status, look, inventory, bounded go(direction), and bounded take/drop/use item calls now have matching service-specific worker/client slices after the Adventure graph is launched. Other mutable Adventure methods and Paperclips direct methods still wait for their service-specific worker/client slices. Login source metadata is derived by the gateway from the accepted socket and a gateway-generated connection event id rather than from client-supplied fields.

  • The host Rust client crate is UI-neutral and can back a CLI, native GUI, Tauri backend, or trusted web gateway.

  • A first trusted local web bridge now exists as remote-session-ui. It serves a loopback-only browser UI whose Rust backend holds the TCP connection and remote session state. make run-remote-session-capset-ui boots the focused gateway-only fixture, drives every visible button in the browser UI, and captures a screenshot plus a redacted transcript. The current web UI uses a dedicated full-window sign-in view with compact endpoint/auth controls and no full persistent technical header. Login includes a visible username field that is empty by default – the bridge does not pre-fill from CAPOS_REMOTE_SESSION_USER, host USER, or any other host-side identity hint, because pre-filling would leak operator/account hints to anything observing the page before authentication. The browser sends only username/password for password login; operator and other resource-profile names are not user-typed system details. For the current legacy DTO protocol, the trusted Rust backend maps an omitted password profile to the default operator profile before calling the gateway; gateway-side profile policy/picker support remains future work for manifests with multiple user-meaningful choices. Authenticated users land in a Services-first SPA workspace with Services, CapSet, Diagnostics, Transcript, and Session views rather than seeing every technical panel at once. The UI smoke tracks visible buttons across login and workspace states and fails when any visible button is not exercised.

  • The current UI slice makes Services the task-oriented SPA action hub for the default-manifest service surface. It should use the catalog and launcher view models to show runnable profiles, required grants, launch status, denials, and generic/simple service panels without moving capOS authority into browser JavaScript.

  • A read-only DTO service catalog now advertises currently available remote DTO services (session, system_info) plus backend-held endpoint services such as chat and adventure when the broker returns them for the authenticated profile. A companion launcher catalog describes service-runner profiles, required grants, and exported service descriptors. Adventure is the active default-manifest launch profile; Paperclips remains a future profile until its authoritative server path is available to the default remote session. The catalogs are browser-safe view models only: no raw ProcessSpawner, process handle, endpoint owner, local cap id, or result-cap slot is exposed.

  • The launch DTO/probe slice is complete. It exposes the remote-safe serviceLaunch request/status path for cataloged profiles. The request carries only a profile id plus explicit grant names; the status reports support state, accepted grant names, a message, and exported or planned service descriptors. The completed probe contract does not call spawn, create/own endpoint receivers, return process handles, or attach new service caps to the remote CapSet.

  • The current Adventure serviceLaunch slice implements the actual restricted backend launch for the default make run manifest. The trusted backend/gateway starts adventure-server plus simple NPC companion processes through an approved service-runner profile and attaches or retains backend-held descriptors/caps for the Adventure/chat-facing services. Browser JavaScript still receives only view models, launch status, service descriptors, denial diagnostics, and typed results. Real direct Chat.send now runs through the first per-session worker/proxy proof; Adventure status, look, inventory, bounded go(direction), and bounded take/drop/use item actions use the same pattern after launch, while richer Adventure controls remain later client layers over the same backend-held capability boundary.

  • The launch-denial proof is implemented for the currently exposed remote gateway paths. Focused CLI and UI QEMU harnesses drive operator missing-grant, wrong-interface, and disallowed-binary serviceLaunch denials; the CLI QEMU harness also drives stale-session and anonymous/no-runner serviceLaunch denials. Smoke checks require explicit error codes/messages, backend teardown, no Adventure server or companion process spawn in the denial-only fixture, and no raw process-handle, endpoint-owner, local-cap, result-cap, capability-manager, process-spawner, terminal-authority, or network-authority markers in browser-visible envelopes, UI reports, or redacted transcripts. The separate run-remote-session-adventure-interop fixture embeds the Adventure binaries, requires the Adventure process graph to spawn, and verifies direct Adventure.status, Adventure.look, Adventure.inventory, mutable Adventure.go(direction), and bounded item take/use/drop responses through the worker. Guest admission shipped on 2026-05-08 03:59 UTC as RemoteAuthMode::Guest plus the RemoteGatewayRequest.guestLogin @24 union arm; the gateway routes it through start_guest_session and the shared validate_guest_admission lib-level helper, which refuses any attempt to acquire a non-guest profile (e.g. operator, anonymous) via the guest method and any session whose minted principal is not Guest. The QEMU interop harness now exercises a guest happy-path proof and a guest-profile-mismatch denial; the RemoteErrorCode::DisabledAuthMethod path is covered through the bridge host-test layer (a manifest with no guest seed makes the kernel SessionManager.guest() return failure, which the gateway maps to that code).

  • Rust-level backend/account-store denial coverage now proves inactive accounts (disabled, locked, and recovery-only), unknown principals, and missing or retired resource profiles cannot produce remote-client bundle plans. Focused SessionManager account-selection coverage records that unknown, inactive, non-operator, or no-console-password account paths do not become password-login candidates suitable for later broker use. The live CLI QEMU gateway proof now drives failed password proof, unknown account, wrong password requested profile, and anonymous profile mismatch cases; each denied client completes as auth-denied with no session start, CapSet list/get, session info, or service-launch activity. Denied re-login clears prior per-connection gateway state plus cached host-client and web-bridge session view state instead of leaving stale authority usable after denial.

  • Kernel-backed remote logout is implemented for the DTO gateway. Each SessionManager-minted UserSession registers a kernel-private liveness cell keyed by the minted session id. Reconstructed broker and launcher SessionContext values resolve that existing cell and fail closed if it is absent or logged out; they do not create fresh live state from SessionInfo bytes. Explicit remote logout calls UserSession.logout, and connection teardown logs out the owned live remote session before dropping the backend session cap. UserSession.info, session-bound SystemInfo, endpoint call admission, and normal service-cap dispatch go stale after logout; UserSession.auditContext remains available for audit attribution. Endpoint returns now recheck the caller session at the return commit point: if the caller logged out, expired, or otherwise went stale after admission, the kernel rolls back prepared result-cap move sources, cancels the in-flight call instead of restoring it, posts an invoke-failed caller completion when the caller CQ can accept it, and rejects the server RETURN without copying result bytes, application-exception payloads, result-cap records, or returned caps into the stale caller.

  • Gateway idle-disconnect bug fixed (operator-reported regression on the trusted web bridge). Symptom: after some time of using make remote-session-ui against make run, the next routine action – often a periodic or user-driven sessionInfo refresh – failed with gatewayDisconnected carrying the message “remote gateway closed the connection during sessionInfo; retry login to reconnect”, forcing the operator to log in again. Root cause was gateway-side: the per-frame TCP recv on the accepted remote-session socket used a 5-second timeout (WAIT_NS = 5_000_000_000) inside recv_exact / recv_frame. Routine inter-request idleness on the bridge – which is reactive, not driven by a background poller – exceeded the 5 s budget, the gateway treated the timeout as a fatal recv failure, exited the per-connection loop, ran close_remote_session_state (issuing UserSession.logout and the “remote session stale” / “connection teardown” audit lines) and dropped the TCP connection, then accepted the next host TCP attempt fresh. The bridge’s next request hit the closed socket and surfaced the disconnect through gateway_io_error. Fix: use RECV_FRAME_WAIT_NS = CAP_ENTER_WAIT_FOREVER for the per-frame recv loop. The kernel-side TCP recv waiter still resolves on data arrival, on clean peer FIN as a 0-byte completion (treated as graceful peer teardown), and on transport-level errors (treated as fatal recv failure); only the spurious 5-second idle timeout is removed. Regression test: recv_frame_wait_is_forever_to_survive_idle_remote_clients in demos/remote-session-capset-gateway/src/lib.rs pins the policy constant. A const _: () = assert!(...) in the gateway main keeps the lib constant and the runtime CAP_ENTER_WAIT_FOREVER sentinel in lockstep so the value cannot drift back to a finite timeout. The short-lived smoke harnesses (make run-remote-session-capset-interop, make run-remote-session-capset-ui) finish well within the previous 5 s budget and so did not catch this – the bug only fires under realistic interactive operator pacing. Future work: when SSH Shell Gateway lands, audit the equivalent recv-loop policy on that path before borrowing the shape from this gateway.

  • Partial-frame DoS proof closed 2026-05-07 08:37 UTC. The forever-wait fix above survives quiet remote peers but, taken alone, also lets a peer that sends a frame header and then stalls (or dribbles a few bytes per minute) keep the gateway accept loop pinned on a single connection. The gateway recv now uses a two-phase wait policy: byte 1 of an idle frame waits forever (RECV_FRAME_WAIT_NS = CAP_ENTER_WAIT_FOREVER) with up to TCP_RETRY_ATTEMPTS = 1024 EAGAIN retries, while bytes 2..N of an already-started frame use the bounded WAIT_NS = 5_000_000_000 (5 s) wait with no EAGAIN retry, and the per-frame recv-call count is capped at MAX_FRAME_COMPLETION_RECVS = 64, bounding a slow-dribble peer at roughly 5 minutes per frame before the gateway closes the connection. Proven by run_partial_frame_probe in tools/qemu-remote-session-capset-harness.sh, which opens a TCP connection, sends a 4-byte header declaring an 8192-byte payload followed by only 4096 payload bytes, and observes the gateway closing the connection within 20 seconds; the QEMU smoke (tools/qemu-remote-session-capset-smoke.sh) asserts the proof line remote-session partial-frame proof: started payload closed after bounded wait.

Default Run And Game Server Story

The default operator manifest is system.cue, layered on cue/defaults/defaults.cue. Today it boot-launches standalone init; init starts chat-server, remote-session-capset-gateway, remote-session-web-ui, and the foreground shell. The default binary catalog embeds Adventure server, Adventure NPC, Adventure client, and the terminal Paperclips binary. Adventure is not boot-started automatically, but the current remote-session slice makes the default-manifest serviceLaunch path start adventure-server plus simple NPC companions through a restricted backend service-runner profile and attach or retain backend-held Adventure/chat-facing service descriptors/caps. Paperclips launch remains future. The default remote-session gateway receives only console, scoped TCP listen authority for guest port 2327, SessionManager, AuthorityBroker, and narrowly approved backend launch authority; it does not expose raw ProcessSpawner, raw network-manager/socket authority, endpoint owner caps, process handles, local cap ids, or result-cap slots. The remote-session-web-ui service receives scoped TCP listen authority for guest port 8080, SessionManager, AuthorityBroker, console, and the read-only system manual cap. make run forwards guest port 8080 to a loopback host port and prints remote self-served UI: tcp 127.0.0.1 <port> -> guest :8080 so the operator can open the self-served UI in a browser directly from the default operator run.

Current game-server proofs live in focused manifests:

  • make run-adventure uses system-adventure.cue, which starts chat-server, adventure-server, Adventure NPC companion processes, an adventure-scenario-test, and the shell. The Adventure server exports the adventure endpoint, consumes a client facet of chat, owns room/player state, and keys player access by the live caller-session reference.
  • make run-paperclips uses system-paperclips.cue, which starts paperclips-server and paperclips-proof-server services exporting PaperclipsGame endpoints, then launches the terminal paperclips client with explicit StdIO, game endpoint, timer, and optional proof_accelerator grants. The server owns generated content, game state, timer cadence, command descriptors, status snapshots, project entries, unlock checks, and game-rule mutation.

The remote UI direction is therefore not “open a terminal and type the MOTD commands.” The completed DTO/probe slice can describe and probe runnable game-server profiles without side effects. The current Adventure implementation gate is the real restricted service-runner/catalog surface for the default manifest: it starts the approved Adventure server graph and attaches or retains the capabilities those processes export or receive to the backend-held remote CapSet. The service-panel UI can expose this as launch state, descriptors, denials, and generic/simple surfaces. Chat now has the first worker-backed method proof; Adventure status, look, inventory, bounded go(direction), and bounded take/drop/use item calls have a service-specific per-session worker/client context after launch. Paperclips stays future until the server-owned Paperclips profile is available to the default remote session.

Host UI UX Direction

For the high-level synthesis of UI scope, invariants, and architecture, read docs/proposals/remote-session-capset-client-proposal.md -> “UI Scope And Architecture”. This section keeps the operator-story guidance for day-to-day UX work.

The host UI should optimize for the ordinary operator stories instead of mirroring protocol objects one-for-one:

  • Connect and sign in: start with a dedicated OS-like authentication view. The username field is visible and empty by default – the web bridge does not pre-fill from CAPOS_REMOTE_SESSION_USER, host USER, or any other host-side identity hint, because pre-filling leaks operator/account hints to anything observing the page before authentication. The CLI may take --user as an explicit operator override; the web UI does not. Endpoint/auth method controls remain available but secondary; retryable login/transport errors stay in the login view without losing the configured endpoint. Resource-profile names such as operator are not requested from the user during password login; they are filled only by the trusted Rust backend for the current legacy DTO. A gateway-side policy choice or post-auth profile picker should appear only when multiple manifest-published profiles are meaningful to the user.
  • Auth method advertising: the gateway forwards the auth methods the system supports, narrowed only by explicit manifest policy. Disabled methods stay listed and clearly marked (so the protocol is not password-shaped); the gateway does not silently hide methods the system supports.
  • Understand session health: after login, keep the active profile, principal, expiry, recent result, and logout in a Session view so common service work does not start on a protocol summary.
  • Use granted services: make Services the action hub for runnable profiles and remote-proxyable service descriptors. It should show availability, required grants, denial reasons, launch status, and generic command/status forms. When a descriptor is not directly callable yet, the panel should say so instead of implying method success. Service-specific rich clients (real Chat panel, Adventure rich client, Paperclips client, future agent-shell services) layer on top of the same backend-held caps.
  • Terminal panels are allowed when granted: the CapSet UI is not defined as a terminal emulator and works without one, but when the broker grants a TerminalSession cap (for native shell, POSIX shell, or any StdIO-based service expecting a terminal on the other side), the UI may host a terminal panel for that cap. Terminal bytes flow through a backend-held TerminalSession; the browser renders frames it receives, never opens a raw shell or holds a ProcessSpawner.
  • Agent-shell-exposed capabilities are first-class: the CapSet UI does not contain the LLM loop, model client, or tool-execution runner, but agent-shell-exposed services (e.g. “send message to running agent”, “approve queued action”, “audio stream to/from agent”) are services the broker can bundle, exposed through the same per-session worker / typed view-model pattern as Chat or Adventure. Whether some of those agent surfaces should themselves be layered on Chat rather than distinct caps is the cross-cutting refinement task tracked in docs/tasks/.
  • Inspect capabilities: keep CapSet as an explicit inspection view for users who need names, interface IDs, policies, and descriptor selection.
  • Diagnose calls: isolate low-level probes, stale-session proofs, MOTD, and raw result JSON in Diagnostics so common service use is not buried under transport details. The session-summary diff control belongs in Session/Diagnostics, not in the main Services flow.
  • Audit and export: keep transcript review/export in its own view, with redaction status visible and raw authority material absent.

Modernization should build on that navigation shape: no full persistent technical header on the login view, a compact authenticated app shell, clear loading and denial states, empty states with next actions, searchable service/capability lists, command forms generated from typed descriptors, side panels for details, keyboard-friendly controls, responsive layouts, and service-specific rich clients layered over the same backend-held capabilities. Adventure and Paperclips should eventually have rich client views, but the minimum viable UI must still expose their available server capabilities through simple generic forms first.

Service-Runner And Catalog Path

Staged path:

  1. The first reader-facing service catalog is implemented in the DTO gateway and UI. It lists available DTO calls plus service-runner profiles and exported capability descriptors for the current session.
  2. The remote-safe launch DTO/probe contract for those profiles is complete. The request names a catalog profile and explicit grants; the probe/status result reports support state, accepted grant names, a message, and planned exported descriptors. This slice is intentionally side-effect-free: it does not start a process, allocate endpoint owners, return process handles, or attach caps.
  3. The current Adventure slice implements a restricted service-runner surface behind the broker for the default make run manifest. It may use local spawn authority internally, but the remote session receives only catalog descriptors, launch requests, launch status, and returned remote capability descriptors. Raw ProcessSpawner, process owner handles, endpoint owner caps, local cap IDs, result-cap slots, and process handles stay inside capOS or the trusted backend.
  4. The CLI and remote-session-ui backend can call the runner and attach or retain the returned backend-held descriptors/caps. Browser JavaScript receives view models, launch forms, progress, denials, command/status descriptors, and call results for methods that are actually callable through the current DTO path; it does not receive raw capOS capability objects.
  5. Start with simple generic panels. Adventure now exposes launch plus status/look/inventory, bounded mutable go(direction), and simple bounded take/drop/use item controls over the backend-held Adventure endpoint and chat-facing descriptors. The first direct chat call and these Adventure controls run through session-bound worker proxies; broader Adventure verbs and Paperclips calls still need service-specific worker/client layers before richer clients sit on top of the same backend-held CapSet. Paperclips can expose PaperclipsGame.commands, status, projects, and command once the server profile is available to the default remote session.
  6. Keep hardening the repo-local Tauri wrapper. The current make remote-session-tauri command policy-checks, dependency-checks, or launches a scaffolded desktop wrapper over the same Rust/backend authority boundary as the web bridge and uses the printed make run remote CapSet port. The policy check fails closed if bundling, window URLs, default capabilities, app-specific invoke handlers, Tauri commands, or tauri-plugin-* usage drift from the reviewed check/dev scaffold. Distributable packaging and desktop automation remain future polish.

Remaining major gaps:

  • Continue expanding the first host UI beyond the current session, system_info, and worker-backed chat proof while still reusing the Rust backend boundary and DTO gateway. A later Tauri package can wrap the same backend when the goal is a distributable desktop app.
  • The first richer service client is a session-summary diff. The pure Rust helper lives in tools/remote-session-client/src/session_diff.rs and compares two snapshots of the remote session view (CapSet plus SessionInfoSummary) into CapSetDiff / SessionSummaryFieldDiff records keyed on (name, interface_id) and visible session fields. The trusted web bridge stores the raw snapshots backend-side and exposes /api/call/session-diff-refresh, which returns a redacted SessionSummaryDiffVm. The browser renders the diff in a dedicated “Last refresh diff” pane on the Session view, with the new session-diff-refresh button exercised twice by the focused UI smoke (first call captures a baseline with hasBaseline=false; the second call reports the diff against the previous snapshot with hasBaseline=true). Backend host tests cover the baseline + no-change path and an added-cap + expiry-change path.
  • Make the remote UI capable of discovering and presenting the full remote-proxyable functionality granted to the authenticated session in the default make run manifest. The first pass may use generic/simple panels for demo services such as chat, Adventure, and Paperclips, but users should not have to switch tools merely because a capability is part of their default remote session bundle. Rich game-specific clients are a later UI layer on top of the same backend-held CapSet, not a reason to narrow the first UI to only session and system_info.
  • Extend the implemented Adventure service-runner slice beyond the first mutable control. The current host backend can start the allowed default-manifest Adventure server graph through the restricted launch path, discover the resulting descriptor in the backend-held remote CapSet, and call Adventure.status, Adventure.look, Adventure.inventory, bounded Adventure.go(direction), and bounded Adventure.take/Adventure.drop/ Adventure.use through a per-session worker. Next work is broader Adventure command coverage and richer game-specific clients on top of that same worker-held boundary.
  • Keep Paperclips launch future until the authoritative Paperclips server profile is available to the default remote session. The UI may show Paperclips as planned/not remote-proxyable rather than claiming launch support.
  • Replace the DTO transport with standard capnp-rpc framing and live typed remote proxy objects.
  • Expand auth adapters beyond password and anonymous.
  • Use the generalized per-session worker lifecycle manager for future endpoint-backed services. Chat send and Adventure status/look/ inventory/go(direction)/take/drop/use now share worker spawn validation, logout/close teardown, graceful shutdown, forced termination fallback, and release flushing; broader Adventure controls, Paperclips worker/client protocol, and live-proxy lifecycle hardening remain future work.
  • Gateway response writes now fail closed per connection: a send-side host disconnect or invalid send byte count breaks the connection loop, then drops backend-held session state and terminates any session-started Adventure processes instead of aborting the gateway process. Direct Chat.send is no longer called from the gateway process; it runs through the first session-bound worker proxy. Adventure status, look, inventory, bounded go(direction), and bounded take/drop/use item methods now receive the same treatment; broader Adventure methods remain later.
  • Add resource limits, TLS/mTLS, renewal, revocation, and UI-composition surfaces.

Design Constraints

  • Do not serialize local capOS cap IDs, cap-table slots, endpoint receiver selectors, endpoint generations, result-cap indexes, server cookies, or global session identifiers as portable authority.
  • Do not treat password auth as the only remote path. The schema and docs must leave room for public key, OIDC, passkey/WebAuthn, mTLS, guest/anonymous, and service/workload admission.
  • Keep the session-bound invocation invariant. Remote post-auth calls run under the remote session’s capOS worker context or an equivalent reviewed context.
  • Keep default remote bundles narrower than operator shell bundles.
  • Keep browser JavaScript and model providers away from raw capOS caps. Browser and agent paths use gateway-side tool/cap proxies.
  • Keep the first CapSet UI distinct from WebShell. It can inspect and call currently implemented remote session capabilities without launching a shell, terminal emulator, shell-runner policy engine, or model agent.
  • Treat raw ProcessSpawner and browser-held capOS capabilities as explicit non-goals for the remote UI path. A service-runner may hold launch authority inside capOS, but browser and webview code see only catalog entries, launch forms, service descriptors, view models, and typed results.
  • Service launch from the remote UI must go through a restricted, session-bound launcher or broker service-runner profile. The browser must not receive raw process handles, local cap ids, endpoint owner handles, or a raw ProcessSpawner; it receives only view models, launch plans, service descriptors, and typed call forms/results.
  • Keep UI composition declarative and bounded. A capOS service may propose layout/theme/view updates only through an explicit UI capability; it cannot inject arbitrary JavaScript/CSS, spoof trusted chrome, or persist UI state without a settings/profile cap.
  • Keep listener and transport authority scoped; no raw NetworkManager or broad ProcessSpawner in the long-term gateway.
  • Preserve the error split: transport/CQE errors, capability infrastructure exceptions, and domain result unions remain distinct.

The planning update that introduced this backlog aligned these documents:

  • remote-session-capset-client-proposal.md: owning design.
  • shell-proposal.md: remote clients are peer clients of broker-issued bundles, not shell transports.
  • boot-to-shell-proposal.md: web/remote login feeds the same session manager and broker, and must support non-password admission.
  • ssh-shell-proposal.md: SSH remains a terminal transport, while public-key auth records can also feed non-shell remote clients through a domain-separated protocol.
  • user-identity-and-policy-proposal.md: broker bundles need a remote-client profile shape in addition to shell bundles.
  • browser-capability-proposal.md, llm-and-agent-proposal.md, and interactive-command-surface-proposal.md: UI composition, browser/agent front ends, and typed command surfaces remain capability-mediated rather than raw browser or shell authority.
  • roadmap.md and docs/tasks/README.md: the old chat-only interop item is reframed as remote session CapSet interop without changing the selected threading milestone.

Grounding Files

Relevant design and research grounding:

  • docs/proposals/session-bound-invocation-context-proposal.md
  • docs/proposals/user-identity-and-policy-proposal.md
  • docs/proposals/boot-to-shell-proposal.md
  • docs/proposals/shell-proposal.md
  • docs/proposals/ssh-shell-proposal.md
  • docs/proposals/certificates-and-tls-proposal.md
  • docs/proposals/oidc-and-oauth2-proposal.md
  • docs/proposals/capos-service-proposal.md
  • docs/proposals/interactive-command-surface-proposal.md
  • docs/proposals/browser-capability-proposal.md
  • docs/proposals/llm-and-agent-proposal.md
  • docs/research/cloudflare-capnproto-workers.md
  • docs/research/spritely-captp-ocapn.md

Ordered Gates

Gate 0: Rename The Target

  • Rename the planning target from chat interop to remote session CapSet interop while preserving the existing chat proof as a historical transport slice.
  • Add docs that say the remote client is a regular host app and does not use capos-rt, the capOS ring page, or the local CapSet page.
  • Keep the existing make run-capnp-chat-interop target until a successor proof exists; do not remove useful evidence.

Gate 1: Host Rust Cap’n Proto RPC Client

  • Add a host-built Rust client crate or tool using generated schema bindings. The first slice uses length-prefixed schema-framed Cap’n Proto DTOs; standard capnp-rpc remains open.
  • Keep the client library UI-neutral so it can back a CLI harness, a native GUI, or a Tauri backend without changing the capOS protocol.
  • Connect through QEMU host forwarding to the capOS gateway.
  • Verify schema version/interface ID mismatches fail with explicit diagnostics.
  • Add a host-side transcript that records successful connect, bootstrap, session info, CapSet list, calls, denials, and logout.

Gate 1A: First Host UI Client

  • Build a thin Tauri or trusted-local-web UI over tools/remote-session-client, without changing the capOS gateway protocol. Prefer Tauri when the goal is a distributable desktop app whose Rust backend can hold the remote session; prefer a local web bridge when browser iteration speed matters more than app packaging.
  • Document and support the repo-local operator paths: make run for capOS/QEMU, cargo run --manifest-path tools/remote-session-client/Cargo.toml --target x86_64-unknown-linux-gnu --bin remote-session-client -- --host 127.0.0.1 --port <printed-port> for the CLI, and CAPOS_REMOTE_SESSION_PORT=<printed-port> make remote-session-ui for the trusted local web bridge. The Makefile target wraps the same remote-session-ui Rust backend and defaults to http://127.0.0.1:3337/. The Tauri wrapper layers over the same backend, not a separate authority model.
  • Add a bounded repo-local Tauri wrapper command: CAPOS_REMOTE_SESSION_PORT=<printed-port> make remote-session-tauri. It checks Tauri CLI and Linux build prerequisites, including xdo and openssl pkg-config modules, reports dependency/scaffold status, and either runs a deterministic wrapper check or launches cargo tauri dev when requested. Missing prerequisites fail with explicit diagnostics and point operators back to make remote-session-ui.
  • Add the actual repo-local Tauri wrapper over the existing backend. The wrapper shares the same tools/remote-session-client backend boundary by loading the loopback remote-session-ui surface; webview code receives view models and user events, not replayable capOS handles. Distributable package bundling remains disabled until the sidecar/backend lifecycle is reviewed.
  • Add a policy-only Tauri wrapper preflight: CAPOS_REMOTE_SESSION_TAURI_MODE=policy tools/remote-session-tauri.sh. The guardrail proves the current wrapper remains check/dev only: bundle.active=false, the Tauri devUrl and single main window URL stay pinned to http://127.0.0.1:3337, default permissions stay exactly ["core:default"], and app-specific invoke_handler, generate_handler, #[tauri::command], and tauri-plugin-* drift is rejected. This does not prove distributable packaging or desktop automation.
  • Keep capOS authority in the backend. Browser/webview JavaScript receives session summaries, auth-method descriptors, CapSet entries, capability call forms, transcript rows, and denial diagnostics, but no replayable capOS handles.
  • Implement the first UI views for endpoint configuration, auth-method inventory, password/anonymous login, session summary, CapSet list/get, sessionInfo, systemMotd, denied-chat probe, logout, stale-call proof, and redacted transcript export. The first web bridge now uses a dedicated full-window sign-in view and authenticated SPA navigation so the common workflow is not a single technical page.
  • Implement selectable remote UI themes based on the committed concept assets in tools/remote-session-client/ui/assets/: a space login theme using bg-space.2k.webp and design-mockup-space-login.webp, a mountain login theme using bg-mountain.2k.webp and design-mockup-mountain-login.webp, a light login theme using design-mockup-light-login.webp, and a hacker terminal theme using design-mockup-operator-console.webp. The hacker theme should use a black/deep-teal background, phosphor-green monospace typography, thin terminal-grid borders, subdued binary side texture, bracketed primary action text, and a footer status line such as “Secure connection established” with a lock indicator, without keeping a persistent global header above the login or workspace views. Treat the mockups as visual references, not runtime screenshots. The implementation should expose a bounded theme selector in the trusted local web UI, persist the selected theme locally, keep browser JavaScript limited to UI state and backend view models, preserve the existing authenticated SPA workflow, and prove contrast, focus, small-screen layout, and screenshot coverage for every theme. The trusted web UI now serves only the committed theme assets by fixed name, stores theme choice in browser-local UI state, drives the selector in both login and workspace modes, and captures desktop plus mobile screenshots for both login and workspace views of each theme in the focused UI smoke. The login view is styled as a focused OS-style sign-in surface without a persistent header; endpoint configuration, auth method inventory, anonymous login, and theme choice remain accessible as compact secondary controls.
  • Ensure the UI discovers every granted remote CapSet entry in the default make run operator session and offers at least a generic/simple surface for each remote-proxyable service exposed by that bundle. Call forms are only for methods the current DTO/proxy path can actually invoke. The first endpoint-backed chat call is now callable through the session-bound worker proxy, and Adventure status, look, inventory, bounded go(direction), and bounded take/drop/use item actions are callable after the Adventure service graph is launched. Game surfaces can start with a simple chat send/probe form, a generic Adventure panel when the service is callable remotely, and Paperclips status/command panels when its server capabilities are exposed. Rich game clients remain a later layer over those same capability bindings. The gateway now lists broker-held endpoint descriptors from service_endpoints, so operator sessions include session, system_info, adventure, and chat; the focused QEMU proof asserts those CapSet entries and the web UI exposes them through CapSet and Services surfaces.
  • Add a task-oriented “Services” view for default-manifest operator sessions: list broker/launcher-advertised runnable services, show which grants are required, start allowed game server processes through a remote-safe restricted launcher/service-runner API, and attach or retain the returned exported descriptors/caps in the backend-held remote CapSet. The first Adventure flow should be able to start adventure-server plus required NPC/server companion processes with their manifest-shaped grants, then show the resulting Adventure/chat descriptors through generic/simple panels. Chat method success now runs under the authenticated session through the first per-session worker proxy; direct Adventure status, look, and inventory now have matching service-specific worker/client paths, and bounded mutable go(direction) plus take/drop/use item paths use the same worker. Broader Adventure method success remains later work. The Paperclips flow may stay simple until the authoritative Paperclips server backlog lands, but the UI direction is server-owned game state and remotely callable game capabilities, not terminal text scraping. The web bridge refreshes CapSet, service catalog, and launcher catalog view models after a successful serviceLaunch so the SPA reflects post-launch descriptors immediately; the focused UI fixture still treats missing Adventure binaries as an explicit planned/denied state.
  • Add browser/UI automation for the chosen client: start a gateway-only QEMU fixture, such as run-remote-session-capset-interop-vm with explicit hostfwd/pid/log handling or a new focused UI fixture target, then drive login, CapSet inspection, capability calls, denials, logout, and transcript redaction, and capture screenshots or traces for review. Do not drive the UI against make run-remote-session-capset-interop because that wrapper starts the scripted CLI client and shuts QEMU down.
  • Keep WebShell-specific work out of this gate. No terminal emulator, shell process delegation, shell-runner policy, agent tool execution, or UI-composition cap is required for the first CapSet UI.

Gate 1B: Self-Served capOS Web UI

Gate 1A is host-served bridge work: make remote-session-ui serves the browser UI from the trusted host Rust backend while capOS exposes the remote CapSet gateway over QEMU host forwarding. Gate 1B adds the first self-served capOS web UI proof: a capOS-side service serves the browser UI entry point and same-origin backend path itself.

Task records:

  • remote-session-self-served-web-ui-design selected the capOS-side hosting boundary, listener authority, asset source, session/admission path, asset integrity/update story, and browser-safe view model boundary.
  • remote-session-self-served-web-ui implemented the first self-served proof with a focused immutable UI shell and browser automation against the capOS-served origin.
  • remote-session-self-served-web-ui-default-run integrated the self-served path into ordinary make run. The default manifest now auto-starts remote-session-web-ui and make run prints remote self-served UI: tcp 127.0.0.1 <port> -> guest :8080. Completed 2026-05-14 09:07 UTC.
  • remote-session-self-served-full-ui-bundle replaces the immutable proof shell with the reviewed fixed-name boot-resource UI bundle. The capOS service now serves /, /app.js, /styles.css, /feature-flags.js, /themes/retro.css, the icon/background/logo assets, /ui-config.js, and /bundle/manifest.json from the capOS-owned origin with explicit content types, no directory traversal, and a build-time digest pinned in demos/remote-session-web-ui/ui-bundle.digest. The focused proof verifies every served asset byte-for-byte against the manifest and then drives the operator workspace views, logout, stale failure, transcript redaction, and system-manual view models.
  • cloud-prod-remote-session-web-ui-l4-local-proof consumed the landed Phase C userspace L4 and DHCP/IPv4 config proofs. It proves remote-session-web-ui through the non-qemu cloudboot socket path locally with the full fixed-name UI bundle, password login, backend-held SystemInfo, logout/stale failure, manual viewer, and browser-boundary checks. Completed 2026-06-09 01:49 UTC (ff769a5c) as local QEMU/cloudboot evidence only; it does not claim private GCE reachability, public ingress, TLS, or production browser readiness.
  • cloud-prod-network-stack-web-ui-slow-client-bounds hardened the userspace network-stack server that backs the L4 Web UI listener (Review C medium: a single-writer accept loop and fatal recv/accept/send budgets let one idle or held-open unauthenticated client crash the network stack or block every other connection). The server now keeps a bounded multi-socket listen backlog, hands out only data-ready connections (idle held-open ones are left for the reaper), reaps idle/half-closed backlog connections after a short idle window, and treats every budget expiry as non-fatal (abandon the offending connection and re-arm instead of exiting). make run-cloud-prod-remote-session-web-ui-l4 adds a slow-client bound proof in two phases: several idle held-open clients that send no request bytes (kept out of the serving path by the reaper) and one partial-request (Slowloris) client that sends incomplete headers then stalls (served, then abandoned when the recv budget expires). In both phases a concurrent /healthz keeps completing and the server survives, and the kernel log shows the backlog config, idle reaping, and the recv-budget abandon. Serving is still serial, so a data-ready partial-request client adds a bounded head-of-line delay (one recv budget) to the next connection rather than blocking it indefinitely; that bound is the accepted limit for this research demo. This is the server-side prerequisite for remote-session-web-ui-connection-bounds, which layers per-connection deadlines in the remote-session-web-ui RPC client on top.
  • remote-session-web-ui-connection-bounds completed the client side of that boundary (Review C medium). The remote-session-web-ui service replaced its retry-count spin budgets with per-connection wall-clock deadlines on the monotonic clock: a request-read deadline (6 s, anchored at accept, covering request line, headers, and body together) and a response-send deadline (30 s, anchored conservatively at request dispatch, before routing), neither of which resets on byte progress, so total accept-loop occupancy per connection is bounded regardless of client pacing. Deadline expiry abandons only the offending connection fail-closed with an explicit console evidence line. This closes the case the server-side per-call budgets cannot see: a drip-feed client that delivers one header byte at a time keeps every server recv budget fresh while never completing the request. make run-cloud-prod-remote-session-web-ui-l4 adds a third slow-client phase driving exactly that drip-feed client and asserts the web-ui abandons it at the read deadline and /healthz still completes afterwards, alongside the existing held-open-vs-concurrent-/healthz and Slowloris phases. Connection admission limits (the bounded listen backlog and idle reaping, which cap all pre-login connections) remain server-owned in the network-stack listener layer.
  • remote-session-web-ui-session-hardening closed Review C high (predictable capos_remote_session tokens and missing browser-session enforcement). The remote-session-web-ui service now mints an unpredictable, opaque server-side session id (one-way SHA-256 over the kernel-CSPRNG backend session id, base64url, never the accept counter) and a domain-separated per-session double-submit CSRF token; rotates both on login, re-login, and logout (clearing the browser cookies and failing closed on a replayed rotated-out id); enforces idle and absolute lifetime bounds before request dispatch; validates Host (DNS-rebinding) and Origin and requires the X-CSRF-Token double-submit cookie/header on state-changing requests; and marks the session cookie Secure when X-Forwarded-Proto: https reports HTTPS ingress (the plaintext loopback proof stays explicitly non-Secure). This aligns the in-capOS server with the committed operator-bundle and host-bridge CSRF contract (tools/remote-session-client/{ui/app.js,src/web_security.rs}). make run-cloud-prod-remote-session-web-ui-l4 extends the self-served proof with stale-token, CSRF (missing/mismatch), Origin (missing/cross-site), Host, and idle/absolute expiry denial paths plus a login/re-login rotation check, all failing closed before any backend-held capability call. Local QEMU/cloudboot evidence only; it does not claim private GCE reachability, public ingress, or TLS.
  • The public-ingress browser hardening set is done on the same make run-cloud-prod-remote-session-web-ui-l4 gate (all local QEMU/cloudboot evidence, no public exposure): in-guest login peer-gate and failure-backoff hardening, the single public-origin policy (one manifest-granted public_origin.<host> marker fixes the only accepted public origin on the trusted forwarded-scheme HTTPS path), the IAP-aware SameSite cookie policy (Strict by default, Lax only under the manifest IAP marker with a cross-site GET provenance gate), the JSON content-type guard (typed 415 on every state-changing /api/* POST before backend dispatch), the security response headers and strict CSP (uniform header set plus a no-unsafe-inline CSP proved violation-free in a real Chromium), the GFE-range-pinned forwarded-scheme trust (X-Forwarded-Proto authoritative only from 130.211.0.0/22 / 35.191.0.0/16, implementing the firewall-bounded forwarded-scheme trust rule below), and the public /healthz health-check contract (bounded anonymous JSON body, no session state, Host-allowlist exempt for by-IP provider health checkers).
  • Two browser-boundary local proofs remain dispatchable task records under docs/tasks/, not landed: the public-deployment loopback gate (reject loopback Host/Origin/Referer acceptance and loopback-shaped source hints under the configured public-origin load-balancer posture while preserving the local QEMU loopback proof) and the consolidated browser-visible forbidden-marker matrix proof across success, denial, health, manual, and error response classes, including hostile browser-supplied authority fields. Both extend make run-cloud-prod-remote-session-web-ui-l4 locally and do not authorize private GCE reachability or public exposure.
  • cloud-gce-legacy-virtio-webui-serving-local-proof closed the legacy-virtio serving gap locally (2026-06-11): a persistent kernel-brokered legacy virtio 0.9 runtime backs the typed Nic cap, and make run-cloud-gce-legacy-virtio-webui-serving proves a host HTTP peer fetching the byte-verified UI bundle under disable-modern=on. Local serving evidence for the GCE NIC shape only, not live GCE reachability.
  • The no-spend provider-harness gates are done as recording-stub fixture evidence — provider CLIs resolve only to the stubs, with no real provider invocation or mutation on any path: the private-proof harness --preflight-only mode, the private and public proof-evidence validators, the public ingress resource plan gate, the journal-driven teardown engine, and the provider-command allowlist gate. They bound the future private/public runs’ evidence, resource graph, teardown, and provider-command surfaces; they are not reachability, exposure, or spend authorization. A matching public-harness no-spend preflight task is dispatchable future work, not landed.
  • cloud-gce-private-self-hosted-webui-proof follows the local Web UI L4 proof and proves private GCE reachability over the live NIC without public IP or public firewall exposure. It remains on hold on missing firewall IAM against GCE default-deny ingress and on per-run billable authorization; the legacy-virtio serving gap is closed locally.
  • cloud-gce-public-webui-ingress-tls-policy-design selected the public ingress, TLS/certificate, firewall, browser-session, and teardown policy before exposure work starts (see “Selected public ingress and TLS policy” below).
  • cloud-gce-public-self-hosted-webui-ingress-tls is blocked on the private proof and on explicit public-exposure approval. With the policy design closed, it is the first public operator-access step, builds against the selected provider-terminated-HTTPS policy, and does not permit raw public HTTP as the closeout proof. The local plan/teardown/evidence/allowlist gates above bound this future run without authorizing it.

IPv6 is a separate network-stack capability lane, not a Gate 1B blocker for the first public Web UI proof. The IPv4 path above still owns the first useful GCE Web UI closeout; the IPv6 scope decision cloud-prod-ipv6-architecture-status-grounding is done and the lane is tracked in Hardware, Boot, and Storage. The broader network usability lane is Network Usability and Post-smoltcp: DNS resolver, POSIX getaddrinfo, ping/ping6, packet tracing, socket readiness, and transport policy are follow-on usability work. They do not block Gate 1B or the first IPv4 public Web UI proof unless a later ingress policy explicitly promotes one; the local DHCP/IPv4 configuration gate is done and now feeds the Web UI L4 and private GCE proof gates.

Selected public ingress and TLS policy:

  • The first public exposure of remote-session-web-ui on GCE terminates HTTPS at a GCP external Application Load Balancer (Google front end, provider-managed certificate). capOS serves only plain HTTP/1.1 on its UI backend port; the operator browser reaches the UI exclusively through the load balancer’s HTTPS origin, and capOS never holds the TLS private key.
  • This is the bootstrap shape chosen because capOS does not yet have TLS termination and private-key custody. The Phase-1 certificate verifier has landed, but TlsServerConfig, key custody, and the userspace L4 TcpSocket relocation have not landed. The ACME/Let’s Encrypt path is now decomposed in Certificates / TLS as a capability-native successor: minimal PrivateKey / KeyVault / KeySource custody, TLS client/server support, RFC 8555 account/order, scoped http-01, CertificateStore.watch renewal, and then a separate public GCE direct-termination proof with explicit public-ingress and CA authorization. That successor does not replace the provider-managed first public proof.
  • Raw public HTTP is rejected as closeout evidence; any port-80 listener is a 301 redirect to HTTPS at the load balancer and never reaches capOS.
  • Browser session rules add a single public HTTPS origin, firewall-bounded trust of the load balancer’s forwarded-scheme header, Secure/HttpOnly/SameSite session cookies, HSTS, anti-CSRF tokens with an origin check, bounded session/idle lifetime, and server-side logout — over the unchanged Gate 1B view-model boundary.
  • Firewall ingress to the UI backend port is restricted to Google load-balancer/health-check ranges (130.211.0.0/22, 35.191.0.0/16) and, if IAP fronts the door, the IAP range (35.235.240.0/20); never 0.0.0.0/0.
  • The full firewall, certificate-custody, evidence, and teardown policy lives in the “Public Web UI Ingress Policy” section of Cloud Deployment, and the TLS-termination/key-custody decision in the “Bootstrap TLS for the First Public GCE Web UI” section of Certificates and TLS.

Selected design:

  • Add a capOS userspace service named remote-session-web-ui for the first proof. It is a sibling of remote-session-capset-gateway, not a replacement for the gateway and not the host remote-session-ui bridge running inside capOS. The service owns the web listener, static assets, authenticated web sessions, remote-session backend state, per-session worker proxies, and browser-facing view-model projection.
  • Static assets live as a checked-in, fixed-name UI bundle embedded in the capOS boot package and served by remote-session-web-ui. The service serves only fixed files, /bundle/manifest.json, and same-origin JSON API routes; it does not expose a general filesystem, asset directory traversal, host path, or development hot-reload surface. The full-bundle proof is remote-session-self-served-full-ui-bundle.
  • The first listener is HTTP/1.1 on a manifest-scoped TcpListenAuthority for a dedicated UI port, for example guest port 8080 under QEMU host forwarding. The service serves static GET assets and same-origin JSON API routes. WebSocket, server-sent events, and streaming terminal/media paths are later extensions that require separate per-route authority and resource bounds; the first self-served proof does not need them.
  • Manifest grants authorize the listener and backend work: scoped TcpListenAuthority for the UI port, SessionManager, AuthorityBroker, a named immutable UI asset bundle, and only the same narrow remote-client service-runner/backend-launch authority already allowed for the remote session path. The service does not receive raw NetworkManager, raw TcpListener factories, broad storage roots, raw ProcessSpawner, shell launcher authority, endpoint owner caps, or arbitrary endpoint creation authority.
  • remote-session-web-ui is the trusted backend and holds the remote session CapSet/proxy state server-side. Browser JavaScript receives only browser-safe view models, launch forms, user-event commands, typed results, denial diagnostics, and redacted transcript rows. It never receives raw capOS caps, raw ProcessSpawner, process handles, endpoint owner authority, local cap IDs, result-cap slots, session-global identifiers, remote CapSet handles, host usernames, host environment variables, host paths, or QEMU-forwarding identity hints.
  • Authentication remains gateway/session-manager shaped. The browser sends credentials or guest/anonymous intent to the capOS-served JSON endpoint; the service derives connection/source metadata from its accepted socket and its own event id, asks SessionManager for a UserSession, asks AuthorityBroker for the remote-client bundle, and projects only the disclosed session and service fields into browser-safe view models. The browser cannot choose a principal, profile, worker session context, or backend cap holder by replaying a request field.
  • Cloudboot-local authority inventory for the completed cloud-prod-remote-session-web-ui-l4-local-proof: the non-qemu proof manifest grants remote-session-web-ui only console, a scoped UI TcpListenAuthority for guest port 8080 served by the Phase C userspace network-stack path, SessionManager, AuthorityBroker, the read-only manual cap, the timer cap used by the HTTP/backend loop, and the fixed-name boot-resource UI bundle. It does not satisfy the UI listener from a kernel tcp_listen_authority source in the non-qemu cloudboot path, and does not grant raw NetworkManager, TcpListener/TcpSocket factories, broad storage roots, raw ProcessSpawner, shell launcher authority, endpoint-owner caps, arbitrary endpoint creation authority, host filesystem paths, or provider/cloud mutation authority. Backend launch/service-runner authority remains available only through the same broker-approved remote-client bundle policy described above.
  • The local cloudboot proof should assert the same browser boundary as the self-served QEMU proof while proving the different listener substrate: browser-visible envelopes, DOM state, diagnostics, transcripts, and JSON responses must not contain raw capOS caps, raw process authority, endpoint-owner authority, local cap ids, result-cap slots, NetworkManager, TcpListenAuthority, TcpListener, TcpSocket, host usernames, host environment variables, host paths, QEMU-forwarding identity hints, provider resource identifiers, public IPs, firewall rules, or TLS key material. Login/source metadata must come from the accepted socket plus a service-generated event id; browser requests cannot supply the trusted principal, profile, source address, worker-session context, or backend cap holder.
  • Expected local cloudboot proof markers are the existing service-side lines that show the narrow service capset, scoped listener, fixed-name bundle, backend-held login/session, backend-held SystemInfo call, browser-safe workspace view models, redacted transcript, backend-held manual view-model projection, and stale-call failure, followed by exactly one cloudboot-evidence: remote-session-web-ui-l4 <token> marker after all forbidden-authority and browser-visible marker checks pass. That marker is local QEMU/cloudboot evidence only; it does not prove private GCE reachability, public ingress, HTTPS/TLS custody, firewall policy, or browser production readiness.
  • Proof marker triage:
Missing or failed marker classLikely failed invariantOwning laneBlocks local Web UI L4 proof?
Narrow service capset, scoped UI listener, or trusted listener/source metadata is absent, or the listener is satisfied by the non-cloudboot qemu kernel socket pathremote-session-web-ui is not bound to the manifest-scoped TcpListenAuthority served by the Phase C userspace network-stack pathListener substrateYes. The local L4 proof cannot close without the non-qemu cloudboot listener source.
Fixed-name bundle, byte-for-byte asset, content-type, /ui-config.js, or /bundle/manifest.json marker is absent or mismatchedThe capOS-served origin is not serving the reviewed immutable boot-resource UI bundleFixed-bundle servingYes. A health-only service marker is not a self-served Web UI proof.
Backend-held login/session, SystemInfo, manual view-model, or workspace view-model marker is absent, or a browser request supplies trusted principal/source/backend holder fieldsThe service is not deriving authority from server-side session state and broker-approved backend capsAuthenticated backend callYes. The proof must exercise at least one backend-held cap path after login.
Logout/stale-call failure marker is absent, stale requests keep dispatching, or result-cap/session table identifiers leak into client-visible stateBackend session teardown does not fail closed before later public or provider promotionStale/logout failureYes. The first local L4 proof needs the stale-call denial; later session-hardening work may add stricter lifetime controls.
Browser-visible envelopes, DOM, diagnostics, transcripts, or JSON contain raw caps, cap ids, process/socket/network authority, host identity, provider resource ids, public IPs, firewall rules, or TLS materialThe browser-safe view-model boundary leaked trusted authority or out-of-scope provider/exposure stateBrowser-visible forbidden marker leakYes for local-service leaks. Provider, public-ingress, and TLS material also route to their later proof lanes before promotion.
All service-side markers pass but the final cloudboot-evidence: remote-session-web-ui-l4 <token> marker is missing, duplicated, or emitted before forbidden-authority checks finishThe harness has not produced a single closeout marker tied to the completed local cloudboot proofEvidence-class boundaryYes. The local proof is incomplete without exactly one final local L4 marker.
Private GCE probe, public HTTPS, DNS, certificate, firewall, load-balancer, or operator-exposure markers are absentThe run did not attempt a later evidence class, or correctly kept provider/public exposure out of the local proofEvidence-class boundaryNo. Those belong to cloud-gce-private-self-hosted-webui-proof or the on-hold public ingress/TLS task, not the local L4 closeout.
  • The first implementation gate was remote-session-self-served-web-ui: boot a focused manifest, load the UI from the capOS-owned HTTP endpoint, log in, exercise at least one granted capability call through the service-held backend state, prove logout/stale failure remains closed, and run browser automation against that capOS-served origin. That pre-Phase-C target used the qemu-only kernel tcp_listen_authority socket owner and is no longer current selected- milestone evidence after the kernel L4 owner was retired. The replacement gate is make run-cloud-prod-remote-session-web-ui-l4, owned by cloud-prod-remote-session-web-ui-l4-local-proof.
  • Validation targets: make run-cloud-prod-remote-session-web-ui-l4 clearly distinguishes the self-served origin from the host development bridge and asserts forbidden browser-visible markers are absent. The current make remote-session-ui bridge remains a development tool, and make run-remote-session-capset-ui keeps its existing host-bridge smoke coverage while the self-served path evolves. Ordinary make run remains a remote CapSet forwarding path, not a self-served UI proof, unless the default-run integration task closes with reviewed manifest, forwarding, and operator-instruction changes.
  • Rollback path: remove the self-served focused manifest/target and stop granting remote-session-web-ui its UI TcpListenAuthority and asset bundle, while leaving the host-served make remote-session-ui path and the remote-session CapSet gateway unchanged. Because the static assets are boot-package resources and the listener is manifest-granted, rollback is a manifest/build-target selection change rather than a downgrade of the gateway authority model.

Acceptance for the implementation gate:

  • The browser retrieves UI assets or the UI backend entry point from a capOS-owned service path, not from the host remote-session-ui development bridge.
  • Browser JavaScript receives browser-safe view models and user-event commands only; raw caps, raw ProcessSpawner, endpoint owner authority, result-cap slots, and host-local identity hints stay out of browser-visible state.
  • The proof uses browser automation against the self-served path and exercises login plus at least one granted capability call.

Gate 2: Gateway Bootstrap And Auth Method Inventory

  • Add RemoteSessionGateway.authMethods and a policy-shaped method list.
  • Support explicit denial for disabled methods so the harness can prove password-only assumptions are not baked into the protocol.
  • Record gateway-derived source metadata, method kind, requested profile, and protocol binding in audit-shaped output.
  • Keep first-remote-client setup disabled unless a manifest explicitly grants a local setup authority path.

Gate 3: First Auth Adapter

  • Choose one bounded first adapter for the proof. Acceptable first choices are public-key fixture auth, password via existing SessionManager.login under explicit policy, or guest/anonymous admission under a narrow profile. Do not design the schema as password-only.
  • Map the accepted proof into SessionManager and mint a real UserSession.
  • Add Rust-level backend/account-store proof coverage that disabled, locked, and recovery-only accounts, unknown principals, and missing or retired resource profiles cannot yield remote-client bundle plans, and that SessionManager password-account selection rejects unknown or inactive account records before a UserSession can be minted.
  • Prove failed proof, wrong requested profile, and unknown principal in the live host/QEMU remote-gateway path before the broker returns a CapSet. The proof also covers anonymous profile mismatch and asserts denied re-login clears previous per-connection/session view state.

Gate 4: Broker Remote Bundle

  • Add an AuthorityBroker path for remote-client bundles, or a temporary clearly named wrapper around the existing shell bundle that does not imply terminal authority.
  • Bundle at least session and systemInfo; add one demo service cap such as chat or paperclips for behavior proof.
  • Add a remote-client bundle shape that preserves the useful default-operator service surface without becoming an operator shell bundle. It should include a restricted launcher/service-runner descriptor for allowed service binaries, broker-held or remote-proxyable service endpoints such as chat and adventure, and enough metadata for the UI to construct launch plans for server processes. It must not grant a raw shell launcher, terminal authority, raw ProcessSpawner, raw network factories, or endpoint owner authority to browser code.
  • Ensure anonymous/guest/default remote bundles do not receive operator shell launcher or broad service endpoints unless policy explicitly grants them.
  • Add wrong-name and wrong-interface tests for RemoteCapSet.get.

Gate 4A: Remote Service Catalog, Launch DTO, Adventure Launch, And Game Server Caps

  • Define a remote service catalog DTO or capnp-rpc object. It should list policy-approved service profiles, runnable binaries, companion processes, required grant names/interfaces/transfer modes, exported capability descriptors, attach/start/stop policy, and whether each grant is backend-held, service-owned, or a client facet. The current DTO catalog describes available DTO services plus Adventure/Paperclips launch profiles. Adventure start/attach is the current restricted-runner slice; Paperclips attach/start/stop policy remains future runner work.
  • Define the restricted service-runner launch request/status/probe DTO shape: submit a catalog profile plus explicit named grants, then return side-effect-free support state, accepted grant names, a message, and planned remote descriptors for exported or broker-held capabilities. This slice intentionally does not start processes, create endpoint owners, attach returned caps, or expose raw ProcessSpawner, process owner handles, endpoint owner caps, local cap IDs, result-cap slots, or browser-held capOS caps.
  • Implement the actual restricted service-runner behind the serviceLaunch contract for Adventure in the default make run manifest. The service runner may use local spawn authority internally, but the remote/browser-facing contract must still expose only launch request/status DTOs and remote capability descriptors, never raw spawn authority or local handles.
  • Implement the first game-server flow for Adventure. The backend should use the remote session’s restricted launcher/service-runner to start adventure-server and simple NPC companion processes with the remote-safe endpoint grant shape: the Adventure endpoint owner and Chat client facet are passed to child processes, while the gateway’s system Console cap is not regranted across the operator-session boundary. The backend then attaches or retains backend-held Adventure and chat-facing service descriptors/caps. Chat now uses a per-session worker endpoint proxy for Chat.send; Adventure status, look, inventory, go(direction), and bounded take/drop/use item calls use the same pattern after launch. Broader Adventure endpoint calls and rich client controls remain later.
  • Implement the Paperclips direction as soon as the server-owned Paperclips server profile is available in the remote catalog: start or attach to the authoritative Paperclips server, read structured status/project/command descriptors, and submit commands through server-owned capabilities. Until then, the UI may show Paperclips as “terminal-only/not remote-proxyable yet” rather than scraping terminal text.
  • Prove launch denials are explicit: disallowed binaries, missing required grants, wrong-interface grants, stale sessions, and anonymous/guest profiles without service-runner authority all fail before any process is started or any returned cap is exposed. The live remote-gateway proof covers stale sessions and anonymous/no-runner sessions in the CLI QEMU path; guest admission now has a dedicated RemoteGatewayRequest.guestLogin arm, and guest sessions go through the broker/account-store remote-client bundle policy with the same no-runner constraint.
  • Prove process handles and endpoint owner caps stay backend-local or are withheld entirely from the browser. Browser-visible state is limited to launch status, service descriptors, command/status view models, denial diagnostics, and redacted transcript rows. CLI and UI smoke checks reject raw authority markers in transcripts, reports, and API envelopes.
  • Add a focused guest remote-gateway login proof once the wire protocol and gateway expose a concrete guest auth adapter, then repeat the same no-runner serviceLaunch denial assertions for guest sessions. Landed 2026-05-08 03:59 UTC. The QEMU interop harness ships a guest admission happy proof (manifest seeds a guest principal, gateway accepts the requestedProfile = "guest" request) and an guest launch-denial proof (successfully admitted guest sessions repeat the service-launch denial matrix; in the Adventure interop manifest this proves the guest bundle still lacks service-runner authority even when the operator path can launch) plus an auth denial guest profile mismatch proof (gateway refuses requestedProfile = "operator" through the guest method with the redacted "guest login denied" message). The bridge host-tests additionally pin the RemoteErrorCode::DisabledAuthMethod denial that fires when the manifest has no guest seed.

Gate 5: Per-Session Worker And Proxy Lifetime

  • Host the first post-auth endpoint-backed remote cap, Chat.send, in a per-session worker/proxy context instead of calling it from the gateway process.
  • Associate the first chat proxied calls with the live remote session context; the focused QEMU proof shows the spawned chat worker running with the operator session context.
  • Drop/release the chat worker holds when logout is called, the connection closes, or the worker exits; teardown now asks the worker to shut down through its control endpoint and falls back to termination only if that path fails.
  • Generalize the worker/proxy lifecycle infrastructure for the currently supported endpoint-backed calls. Chat send and Adventure status/look/inventory now share worker spawn validation, exactly-one parent control endpoint validation, graceful shutdown, forced termination fallback, logout/close teardown, and release flushing.
  • Add the first richer Adventure worker/client protocol slice on top of the shared lifecycle manager: read-only Adventure.look and Adventure.inventory now share the same per-session Adventure worker as Adventure.status.
  • Add the first service-specific mutable Adventure worker/client protocol slice: bounded Adventure.go(direction) now runs through the same per-session Adventure worker and returns bounded movement text plus room state.
  • Add the first item-oriented Adventure worker/client protocol slice: bounded Adventure.take(item), Adventure.drop(item), and Adventure.use(item) run through the same per-session Adventure worker, validate transcript-safe item tokens, and return bounded text or room state to the CLI and web bridge.
  • Add service-specific worker/client protocol slices for broader mutable Adventure calls and future Paperclips service calls on top of the shared lifecycle manager.
  • Treat send-side disconnects while replying as connection close, then release gateway-held state through the existing per-connection teardown path instead of failing the whole gateway process.
  • Prove stale proxy calls after logout/disconnect fail closed.

Host-client/backend coverage now includes pre-session bootstrap reset and zero-byte read-timeout retry, repeated DTO calls, repeated post-logout stale-call probes, authenticated gateway close during a call, and oversized gateway response frames. The scripted CLI retries authMethods connection resets before login so QEMU host-forwarding races do not look like real session loss. The trusted web backend also retries a pre-session authMethods bootstrap disconnect or no-byte read timeout before any auth inventory or session state exists, clears backend-held session state for disconnect/oversized response failures, and returns user-facing gatewayDisconnected / reconnectRequired guidance without exposing raw frame errors to browser JavaScript. Kernel deferred TCP recv waiters now fail closed with an error CQE on terminal runtime/transport errors instead of dropping the pending call without completion; WouldBlock still requeues, and socket close still returns zero-byte EOF. The gateway now uses a connection frame-read wait instead of the short service-call wait, so an idle TCP remote session remains open past the former five-second read window and tears down only when the peer closes or the transport actually fails.

Gate 6: Capability Calls Beyond Chat

  • Call at least two granted capabilities through generated host bindings. The current proof covers UserSession.info/session, SystemInfo.motd/system_info, the first worker-backed Chat.send, and the worker-backed Adventure methods, Adventure.status, Adventure.look, Adventure.inventory, mutable Adventure.go(direction), and bounded item controls Adventure.take/Adventure.drop/Adventure.use. Broader Adventure controls and PaperclipsGame.status wait for later service-specific proxy/client gates.
  • Prove a service-specific domain denial remains a schema result rather than a transport failure. The focused chat proof asks the per-session worker to call Chat.send without first joining the proof channel and requires chatSent(false) in the CLI/UI API smokes, not RemoteError or a gateway disconnect.
  • Prove target service sees session-bound caller metadata rather than a caller-selected identity field. The remote-client chat facet now grants only the existing bounded disclosure fields to the per-session worker, the worker explicitly requests those fields on Chat.join/Chat.send, and chat-server logs a target-service proof only after it sees a live opaque caller-session reference with operator principal class, password auth strength, and operator profile class. Browser/client-visible DTOs still do not expose raw scoped refs, local cap handles, or process handles.

Gate 7: Transport Security And Non-Password Auth Expansion

  • Add capOS-terminated TLS server config once certificate/TLS primitives exist. Until then, the first public Web UI ingress terminates HTTPS at the provider load balancer (see “Selected public ingress and TLS policy” under Gate 1B); this checklist item is the capability-native successor, not the first public proof.
  • Add mTLS client identity admission when certificate policy and account bindings exist.
  • Add public-key auth with protocol-domain-separated challenge bytes.
  • Add OIDC device-code and browser-assisted PKCE flows when OAuth/OIDC token capabilities exist.
  • Add passkey/WebAuthn through the web gateway path when authenticator primitives exist.
  • Add service/workload credential admission for non-human automation.

Gate 8: Renewal, Revocation, And Resource Bounds

  • Wire kernel-backed UserSession.logout and gateway/connection close propagation for the current DTO remote-session gateway.
  • Reject already-admitted endpoint returns after caller logout/session death before result bytes, exception payloads, or result caps are installed in the stale caller.
  • Extend logout/revocation cleanup to live remote proxy objects once standard RPC framing lands.
  • Add renewal only through a narrow session-manager/broker path that does not revive stale ordinary grants by accident.
  • Add resource limits for connections, remote refs, in-flight calls, queued promises, result sizes, and per-session CPU/memory/network accounting. Initial four classes landed 2026-05-03 16:21 UTC: transcript ring (6d855c01), backend cap-holders + catalog mirrors (5ec0e456), outstanding worker calls per session (0f82528c), and gateway concurrent logins per principal (99955d59). Bound choices and the exhaustion-as-typed-denial contract are documented in the proposal’s “Resource and revocation bounds” section. Per-session CPU/memory/network accounting and remote-ref limits remain future work tied to the capnp-rpc rewrite.
  • Add explicit CapException/RPC exception tests for the currently representable Gate 8 failure classes: transport breakage, worker/proxy failure, stale sessions after logout, and oversized messages. Host coverage now checks that the backend-only capnp-rpc Chat facade maps DTO transport breakage to capnp::ErrorKind::Disconnected, maps DTO denials and unexpected worker/proxy responses to Failed CapException-like errors, and does not expose raw proxy positions, local cap ids, result-cap labels, session ids, or socket hints in exception text. The trusted web bridge coverage now drives worker-targeted Chat.send disconnect, oversized worker response, and post-logout stale-session paths; each fails closed as gatewayDisconnected or staleSession, decrements outstanding worker-call accounting, clears or preserves backend state according to the existing lifetime contract, and keeps redacted transcript export free of raw socket errors, frame sizes, local cap ids, proxy positions, raw session id hex, passwords, and host endpoint hints. Revoked-lease coverage remains blocked rather than faked: the current DTO surface has lease timestamps in RemoteCapEntry, but no explicit revoke/lease-expired request path or RemoteErrorCode variant that can distinguish a revoked lease from the existing staleSession / methodDenied denials. Add the revoked-lease proof when the standard RPC object lifetime path or a reviewed DTO denial code makes it observable.

Gate 9: Bidirectional UI Composition

  • Keep this separate from Gate 1A. Gate 1A is a host-rendered UI over the existing client; Gate 9 lets capOS-side services propose bounded UI surfaces back to that host UI through explicit capabilities.
  • Add a proposal-level RemoteUiHost / RemoteUiSurface schema slice or equivalent typed DTOs for declarative UI patches and typed user events.
  • Keep the first UI proof behind a separate granted UI-surface cap, not implicit in RemoteSession or RemoteCapSet.
  • Prove a capOS service can open/update one bounded surface and receive one typed user event from the host UI.
  • Prove the same service cannot spoof login/permission chrome, inject raw JavaScript/CSS, persist layout or theme state, or exceed update/size quotas without explicit authority.
  • Add a host-app reset/close path that releases the UI surface and leaves underlying service caps intact.

Verification Targets

Initial documentation/planning check:

make docs
git diff --check

First implementation check:

cargo test --manifest-path tools/remote-session-client/Cargo.toml --target x86_64-unknown-linux-gnu
make run-remote-session-capset-interop
make run-remote-session-adventure-interop
make run-capnp-chat-interop

Security review checklist:

  • Remote client cannot obtain authority by guessing a cap name.
  • Remote client cannot replay a session or grant identifier on another connection.
  • Remote client cannot ask for a local cap slot, endpoint selector, or receiver metadata.
  • Logout/close/revocation tears down all session-bound proxies.
  • Guest/anonymous profiles receive only explicitly policy-granted caps.
  • Browser/agent paths never receive raw capOS capability objects client-side.
  • GUI/Tauri/web front ends keep capOS caps in the Rust/backend/gateway side of the trust boundary; UI code receives typed view models, command descriptors, or tool requests.
  • UI composition is capability-gated, declarative, quota-bound, and reversible by the user.

capOS SDK Crate And Dual Transport Backlog

Detailed decomposition for the front-door capos SDK crate: one published crate whose typed capability clients run unchanged against two transports – the in-process capability ring (an application running inside capOS) and a remote connection (a host-side RPC client). This extends the Crate publication roadmap track with a concrete architecture and publication ordering, and consumes the remote transport planned in Remote Session CapSet Client. docs/tasks/README.md should point here when selecting slices; it should not inline the details.

Visible Outcome

  • A Rust author writes against typed capability clients (Console, Timer, EntropySource, VirtualMemory, and future caps) once. cargo add capos builds an in-system no_std application that reaches the kernel through the ring. cargo add capos --no-default-features --features remote builds a host program that reaches a capOS instance through the remote session transport, using the same client API.
  • The crates.io names capos (front-door facade) and the capos-* family (capos-abi, capos-lib, capos-config, capos-rt) are published with real content, stable versioning, rendered docs, and license/metadata.

Why This Is High Priority

Two reasons, one architectural and one external:

  • Architecture. The Cap’n Proto-first design already treats each process as a capnp-rpc vat and the per-process ring as that vat’s connection to the kernel (design principle 5). “In-system app” and “remote client” differ only in the transport under the typed clients, so a single SDK with a transport seam is the natural shape rather than two parallel client stacks.
  • Namespace contention. The capos/capos-* crate names overlap with an unrelated capability-OS effort already publishing under the same prefix on crates.io. crates.io is a flat, first-come namespace, so publishing the real reusable crates early (and reserving the bare capos facade) both prevents name contention and establishes a dated public-use record. Publish crates with real content; do not register empty placeholder names, which the crates.io policy can reclaim.

Transport Seam

The seam is a Transport trait (working name) that the typed capability clients depend on instead of the concrete ring. It must express the existing ring opcode semantics faithfully rather than collapse them:

  • call(cap, interface_id, method_id, request_bytes) -> CallHandle – maps to the ring CALL SQE.
  • completion retrieval (poll / wait) – maps to consuming a CQE after cap_enter.
  • release(cap) – maps to the local RELEASE SQE.
  • server-side recv / return_(call, response_bytes, result_caps) – maps to RECV / RETURN for endpoint-owning servers. The first SDK slice may scope to the client side (CALL/RELEASE + completions) and add the server side when an endpoint-owning userspace service consumes the SDK.

Two implementations:

  • RingTransport (no_std, default ring feature): wraps the existing capos-rt single-owner ring client and cap_enter. This is current behavior moved behind the trait, not new behavior.
  • RemoteTransport (std, remote feature): connects over the remote session transport, authenticates through SessionManager/AuthorityBroker, holds a forwarded RemoteCapSet, and dispatches typed calls over the same length-prefixed DTO gateway used by tools/remote-session-client/.

Crate Layout

craterolestd/no_std
caposfront-door facade: prelude, re-exports runtime + typed clients, transport selection by featureno_std core; std only behind remote
capos-rtring runtime (_start, syscalls, ring client); provides RingTransportno_std
capos-abiABI/policy constantsno_std
capos-libhost-testable pure logic (ELF, CapTable, ring/SQE validation)no_std + alloc
capos-configmanifest/CUE loader, ring structsno_std + alloc

Feature flags on capos: ring (default, in-system, no_std), remote (host client, pulls std + the remote transport deps). The shared core – typed capability clients plus capnp encode/decode (capnp 0.25 is no_std + alloc) – must stay no_std; std is confined to the remote transport. Open decision: whether RemoteTransport lives in capos behind remote or in a separate capos-remote crate the facade re-exports.

Honesty / Caveats

  • The remote transport is transitional. The remote session path is currently length-prefixed schema-framed DTOs, not standard capnp-rpc with live object proxies; the rewrite is gated on a reviewed capOS userspace async runtime or a sync-friendly Cap’n Proto RPC adapter (see Remote Session CapSet Client, Gate 1 / Task 1). So the first remote form proxies through the trusted host backend boundary; it is not arbitrary remote capability invocation with promise pipelining. Do not document it as live guest-wire capnp-rpc.
  • capnp-rpc 0.25 is std-only and needs a futures executor, which is why the remote transport is std-only and host-side by necessity. This matches the in-system/no_std versus host/std split rather than fighting it.
  • The trust models differ but the client API does not: in-system the kernel hands an unforgeable bootstrap CapSet; remote, the client only ever sees caps explicitly forwarded into its authenticated session. Keep the API identical and the authority boundary explicit.

Publication Decision

Decision recorded 2026-05-22 23:41 UTC: the SDK track publishes real crates only, never empty placeholder packages. This follows the Cargo publishing model where crate names are first-come, first-served, published versions are permanent, and publishers should fill out license, repository, readme, description, keyword, and category metadata before upload. It also follows the crates.io usage policy against packages that exist only to reserve a name for a prolonged period without genuine functionality or development activity. The exact planned crates.io names (capos, capos-abi, capos-capnp-build, capos-config, capos-lib, and capos-rt) were not present in the crates.io API or sparse index when checked on 2026-05-22 23:39 UTC, and were re-confirmed unclaimed on 2026-06-02 16:10 UTC; the adjacent capos-bitstruct crate exists under the unrelated cap-os/rust-tools repository and is the visible namespace contention signal. libcapos and libcapos-posix are not crates.io crates (they ship as release artifacts, see item 7), so their names are deliberately left unclaimed on crates.io – an accepted residual risk. Re-check the registry immediately before any real publish.

The publish set and order are:

  1. capos-abi 0.1.0 – shared no_std ABI and policy constants.
  2. capos-capnp-build 0.1.0 – build-time schema generation helper used by capos-config; it is a real package because capos-config cannot publish with an unpublished path-only build dependency.
  3. capos-config 0.1.0 – manifest/config/ring structs and generated schema module built from packaged schema source; depends on capos-abi and capos-capnp-build.
  4. capos-lib 0.1.0 – reusable no_std + alloc pure logic; depends on capos-abi and capos-config.
  5. capos-rt 0.1.0 – in-system no_std runtime and ring transport, published with the bare facade slice after the transport seam lands.
  6. capos 0.1.0 – front-door facade with ring as the default no_std feature and remote reserved as a std feature until the remote transport slice closes.
  7. libcapos / libcapos-posix – C-substrate distribution. Decision 2026-06-02 16:10 UTC: these ship as release artifacts only (prebuilt libcapos.a / libcapos_posix.a plus the capos/*.h headers attached to a GitHub release/tarball), not crates.io crates, because their consumers are C programs that link the archive, not Rust crates run through cargo add, and they build only for the custom x86_64-unknown-capos target. The artifact build is bundled into the same explicit operator publish wave as the Rust crates above (built via make libcapos / make libcapos-posix, verified through the existing C smokes). Their crates.io names are intentionally not reserved (accepted residual risk).

Do not reserve capos-remote yet. If slice 4a proves the remote backend should live outside the facade crate, publish capos-remote with real host transport code in that slice.

The MSRV target for Rust crates was 1.85.0 (the Rust 2024 edition floor) until the first-public-release slice verified the package set against stable. The verified MSRV is stable Rust 1.88.0: capos-config uses let chains in if/while conditions (&& let Some(..) = ..), stabilized in 1.88.0, so the set does not build on 1.85.0. rust-version = "1.88.0" is set on all four published crates. The current repository still builds the OS with the date-pinned nightly-2026-04-20 toolchain. capos-rt must document the capOS target toolchain requirement separately if its package cannot be built on stable for the custom userspace target.

The license gate is satisfied: the repository carries LICENSE-APACHE, LICENSE-MIT, and LICENSING.md, and per LICENSING.md the published SDK crates use MIT OR Apache-2.0 (the kernel/system is Apache-2.0-only and is not in the publish set). The publish metadata uses that SPDX expression in the license field of capos-abi, capos-capnp-build, capos-config, capos-lib, capos-rt, and capos. Each carries repository/readme/description/keywords/categories metadata and docs.rs settings.

Generated Cap’n Proto bindings do not ship as a separate published crate in the first release set. capos-config ships the schema source needed to build its own generated module, and capos-capnp-build remains the single no-std patch helper for that generation path. The publish slice must make the capos-config package self-contained instead of relying on repository-sibling paths such as ../schema/capos.capnp. A separate generated-bindings crate is deferred until an external consumer needs schema bindings without the manifest/config crate.

Versioning Policy

The published crates – capos, capos-rt, capos-abi, capos-lib, capos-config, and capos-capnp-build – follow one policy:

  • Pre-1.0 SemVer. The set starts at 0.1.0. Per Cargo’s SemVer rules a 0.y.z release treats the minor field as the breaking-change field: a breaking API/ABI change bumps the minor (0.1 -> 0.2); a backward-compatible addition or fix bumps the patch (0.1.0 -> 0.1.1). Treat the whole 0.x series as unstable and do not promise API stability until a 1.0.0 is deliberately cut.
  • Schema/ABI contract maps to a breaking bump. These crates encode the capOS wire contract: capos-config carries the generated Cap’n Proto bindings and the ring CapSqe/CapCqe structs, capos-abi carries the policy/quota constants, and the typed clients in capos-rt/capos encode the SQE CALL/RELEASE wire format. A change to schema/capos.capnp, the generated bindings, the ring wire layout, or a typed-client method’s wire encoding is a breaking change and bumps the minor field (pre-1.0). The SDK is a schema consumer; the schema change lands under its owning plan, and this set’s version bump follows it.
  • Lockstep versioning across the set. Because the crates share the wire contract and depend on each other by exact path/version, publish them at the same version and bump them together rather than independently. A consumer pinning capos = "0.1" then gets a coherent capos-rt/capos-config/ capos-abi graph. Independent per-crate versioning is deferred until a crate has external consumers and a stability story of its own; revisit at 1.0.
  • MSRV is stable Rust 1.88.0. Verified by the slice-2 publish dry-run: capos-config uses let chains (&& let Some(..) = ..) stabilized in 1.88.0, so the set does not build on the Rust 2024 edition floor 1.85.0. rust-version = "1.88.0" is set on the four published workspace crates. The OS itself still builds only on the date-pinned nightly; the MSRV applies to the host-target build of the reusable crates, not the kernel.

Publish Dry-Run Gate

make sdk-publish-dry-run is the repeatable, one-command reproduction of the slice-2 publish verification. It runs, on the host target:

  1. A coordinated multi-package cargo publish --dry-run over capos-abi -> capos-capnp-build -> capos-config -> capos-lib in dependency order. Cargo packages each crate, unpacks the earlier ones into a temporary registry, and verify-builds the later ones against them, so it fails loudly if publish metadata, packaging (including the capos-config packaged schema/capos.capnp), or the dependency order regresses. Coordinated multi-package dry-run is nightly-only, so this step runs on the repo-pinned nightly.
  2. An MSRV-floor cargo +1.88.0 build of the same set, catching a regression that would need a newer toolchain than the recorded MSRV (the nightly dry-run alone would not).

The gate is prep-only: every dry-run upload aborts, and no real cargo publish runs. Like dependency-policy-check, it is a focused target, not part of make check, because the verify-builds are slow.

capos-rt and capos are part of the publish set but are not in this host gate: they build only on the custom userspace target/code model, so a host-target cargo publish --dry-run verify-build does not apply. The release gate verifies their capOS-target builds via make capos-sdk-check rather than a host dry-run. The initial publication used the local Cargo API-token path after the final crates.io name re-check; subsequent releases can use .github/workflows/publish-crates.yml after each crate’s trusted publisher is configured on crates.io for this repository, workflow, refs/heads/main, and the crates-io-publish GitHub environment.

Ordered Slices

The near-term high-priority slices (1-3, 5) do not depend on the capnp-rpc transport rewrite and have landed. Slice 4 is split: slice 4a (a transitional RemoteTransport over the existing host DTO backend) can ship now, while slice 4b (the live-proxy capnp-rpc upgrade) is gated on the remote-session async-runtime decision.

  1. Publish-set + reservation decision (docs-status, closed 2026-05-22 23:41 UTC). The decision above pins the publish order, exact crate names, MSRV target, feature-flag story, license gate, metadata requirements, and generated-binding packaging decision.
  2. First public release of existing layers (behavior, prep landed 2026-05-23; 0.1.0 published 2026-06-05). Publish metadata (description, MIT OR Apache-2.0 license, repository, keywords, categories, rust-version = 1.88.0, README, docs.rs config) added to capos-abi, capos-capnp-build, capos-config, and capos-lib. capos-config is now self-contained: it ships schema/capos.capnp (an in-repo symlink to the single repo-root schema, materialized into the package archive) and capos-capnp-build resolves it from the crate’s own manifest dir via generate_packaged_schema_bindings(). Verified with a coordinated cargo publish --dry-run of the set in dependency order plus a stable-1.88.0 build and a local docs render. The initial 0.1.0 versions were published from origin/main on 2026-06-05 through the local Cargo API-token path after the final crates.io name re-check. The capos-config docs.rs build is accommodated by a packaged generated-binding fallback used only when DOCS_RS is set, so the docs.rs sandbox no longer needs the external pinned capnp binary. The repository-URL rewrite is no longer a blocker: decision 2026-06-02 16:10 UTC keeps repository = "https://github.com/ei-grad/capos" for the first wave (publishing many crates from one repo is standard; the repository field can be updated later at repo-migration time without republishing). This claims the capos-* prefix with shipped code.
  3. Reserve bare capos + transport seam (behavior, closed 2026-05-23 23:07 UTC). capos-rt now defines the Transport trait (src/transport.rs): the client-side seam of submit_call / submit_call_with_copy_transfers / submit_call_borrowed_wait_forever (CALL), wait / try_complete (completion after cap_enter), and release_wait (local RELEASE). RingTransport is the existing single-owner RingClient viewed through the seam (current behavior, not new behavior); both RingClient and RuntimeRingClient implement Transport. The 189 client-side typed-client methods take &mut impl Transport; the result-cap-adopting methods stay on the concrete RuntimeRingClient because generalizing result-cap adoption across transports is later server-side/promise work. The new standalone capos/ facade crate re-exports the runtime, typed clients, the entry_point! macro, and a prelude behind the default ring feature; the later 4a slice made remote a host-only feature over the transitional DTO backend. QEMU proof: make run-spawn boots demos/timer-smoke, whose typed-client code now imports from capos instead of capos-rt, and asserts [timer-smoke] Timer now/sleep ok.. capos-rt and capos 0.1.0 were published with the slice-2 set after the final name re-check and make capos-sdk-check custom userspace target verification; the repository-URL rewrite is no longer a blocker – see slice 2.
  4. remote transport backend, split into 4a/4b.
    • 4a – transitional RemoteTransport (behavior, closed 2026-06-06). The capos facade’s remote feature now builds on the host target without the default ring feature, enables capos-rt/host-test for the shared typed clients, and provides RemoteTransport over the existing host DTO backend boundary used by tools/remote-session-client. RemoteTransport authenticates through the same DTO gateway, obtains forwarded caps through CapSetGet, assigns synthetic host-side cap ids, and proves SystemInfo.motd through SystemInfoClient::motd_wait over the current length-prefixed DTO wire without making the unpublished host tool crate part of the published capos package graph. Unsupported calls fail closed with ring-style transport completions. Negative-path hardening now covers wrong-interface and missing-cap denials, released local cap ids, remote denied calls, malformed and mismatched DTO responses, and disconnects during synchronous DTO calls. This is not blocked on the async-runtime decision and remains transitional host-backend proxying, not guest-wire capnp-rpc with promise pipelining.
    • 4b – live-proxy capnp-rpc upgrade (behavior, blocked). Replace the DTO wire under RemoteTransport with standard capnp-rpc framing and live object proxies. Gated on a reviewed capOS userspace async runtime or a sync-friendly Cap’n Proto RPC adapter, tracked by remote-session Gate 1 (docs/backlog/remote-session-capset-client.md). Do not block 4a on it.
  5. Versioning + publish CI (harness-hardening, closed 2026-05-24). The “Versioning Policy” section above pins pre-1.0 SemVer, the schema/ABI-to-breaking-bump mapping, lockstep versioning, and the 1.88.0 MSRV. make sdk-publish-dry-run reproduces the slice-2 publish verification in one command (coordinated multi-package cargo publish --dry-run over the four host-buildable crates in dependency order + an MSRV-floor build); see “Publish Dry-Run Gate”. The capos facade README.md documents the working ring default and the transitional remote feature. .github/workflows/publish-crates.yml runs the same release gates, obtains a short-lived crates.io token through trusted publishing only from refs/heads/main, skips versions already present on crates.io, and publishes the six crates in dependency order when its manual publish input is enabled with the current explicit user release instruction recorded in the dispatch input. Non-main publish=true dispatches and publish dispatches without that current instruction fail before any crates.io token is requested. The initial six-crate 0.1.0 release is complete; future releases use the workflow only after explicit user authorization for that release and once crates.io trusted publishers are configured for the six crates.

Conflict Surface

  • Owns: NEW capos/ facade crate, this backlog file, the roadmap “Crate publication” section, [package.metadata]/publish metadata on the published crates, and any new docs/proposals/capos-sdk-proposal.md.
  • Coordinates (do not run blindly in parallel):
    • capos-rt/ – the Transport-trait refactor of typed clients. Serial with other capos-rt client changes.
    • tools/remote-session-client/ and the Remote Session CapSet Client plan – the remote transport reuses that host transport. The capnp-rpc rewrite is owned there, not here.
  • Must NOT touch: schema/capos.capnp or tools/generated/ (the SDK is a schema consumer, not a producer) and kernel behavior. If a slice needs a schema change, it queues on the shared schema serial surface under the owning plan, not this one.

Grounding Files

  • docs/roadmap.md “Crate publication” track.
  • docs/backlog/remote-session-capset-client.md (remote transport, gating).
  • docs/proposals/remote-session-capset-client-proposal.md.
  • docs/proposals/userspace-binaries-proposal.md (the C substrate layer under the same SDK family).
  • docs/capability-model.md, README “Core Idea” (design principle 5: each process is a capnp-rpc vat; the ring is its connection).

Capability-Infrastructure Cluster Backlog

A planning audit found a cluster of maturing proposals whose Phase 1 slices are now extractable (their stated prerequisites have landed) plus the Stage 6 capability remainder. Most of these slices ADD interfaces to schema/capos.capnp and therefore share the schema serial surface: only one plan at a time may change the schema (docs/backlog/index.md “Concurrency Notes”), and the next plan must rebase on the generated-code refresh. This file decomposes the cluster and records the recommended ordering so the slices do not all become ready at once and collide on that surface.

docs/tasks/README.md points here for the cluster; it should not inline the details.

Ordering Contract

  • The non-schema slices (capos-service framework, tickless idle, default avatar) are dispatchable in parallel today and have their own ready task files; they do NOT queue here.
  • The schema-touching slices below queue on the shared schema serial surface. Promote ONE at a time from this backlog into a docs/tasks/ file, land it, refresh generated bindings, then promote the next. Do not file all of them as ready simultaneously.
  • The ResourceProfileRecord / ManifestResourceProfile schema, capos_config::ResourceProfile carrier, and non-schema spawn-limit enforcement have landed. Crypto key caps Phase 1 has also landed. The next queued schema-serial slice is crash-recovery-stale-cap-phase1.
  • Recommended schema promotion order from here: crash-recovery stale-cap → authority-broker → live-upgrade CapRetarget → Stage 6 remainder. Reorder by explicit user priority. Do not promote a schema slice in parallel with another schema-surface task.

Schema-Serial Phase-1 Slices

Each slice names a 1-line scope, the owning proposal, and the conflict domains its eventual task file should carry. All share interface:schema-capos-capnp + path:schema/capos.capnp + path:tools/generated/ (the serial surface) in addition to the listed domains.

monitoring-log-surface (landed)

  • Scope: LogSink/LogReader schema + a minimal userspace log service backed by Console, with logLevel enforcement and scoped LogSink caps granted to children at spawn. Source: docs/proposals/system-monitoring-proposal.md.
  • Domains: resource:system-monitoring, path:kernel/src/cap/, path:demos/, docs:system-monitoring.
  • Landed (2026-05-25): additive LogSink.write @38 / LogReader.read @39 plus LogRecord/LogFilter (reusing LogLevel), backed by a bounded drop-oldest kernel ring (kernel/src/cap/log.rs). The sink drops below- SystemConfig.logLevel records (boot-seeded) and forwards accepted records to serial; the reader returns cursor/filtered records with nextCursor/dropped. capos-rt LogSinkClient/LogReaderClient, producer/reader demos, system-monitoring-log.cue, and make run-monitoring-log-smoke prove the sink drop, read-back, and reader-side minLevel filter. The wider Severity (critical), correlation fields, token-bucket backpressure, and persistent retention remain later phases. Task: docs/tasks/done/2026-05-25/cap-infra-monitoring-log-surface.md.

crypto-key-caps-phase1 (landed)

  • Scope: SymmetricKey/PrivateKey/PublicKey schema interfaces + a software-backed userspace key service + a QEMU encrypt/sign smoke over the cap boundary. Unblocks TLS, OIDC, volume encryption, signed audit, SSH cert upgrade. Source: docs/proposals/cryptography-and-key-management-proposal.md.
  • Domains: resource:crypto-key-service, path:demos/, docs:cryptography-and-key-management.
  • Landed (2026-06-06): minimal RAM-only SymmetricKey, PrivateKey, and PublicKey ABI in schema/capos.capnp, regenerated bindings, capos-tls XChaCha20+HMAC-SHA256/P-256 cores, RAM KeyVault private-key custody, and the development-only KeySource bootstrap. Local proofs cover symmetric AEAD/MAC, private/public signing, KeyVault stale-handle custody, and development-source admission/rejection. Remaining work is production/runtime key service wiring, symmetric derivation/wrapping, persistence, hardware/cloud custody, ACME/TLS handshakes, and production public-ingress key sources. Task: docs/tasks/done/2026-06-06/cap-infra-crypto-key-caps-phase1-reconcile-local-proof.md.

time-wallclock-phase1 (landed)

  • Scope: WallClock read cap + ClockProvenance label + manifest-seeded boot time; WASI clock_time_get(REALTIME) and audit timestamp delegate to it. Source: docs/proposals/time-and-clock-proposal.md.
  • Domains: resource:time-clock-authority, path:kernel/src/cap/, docs:time-and-clock.
  • Landed (2026-05-24, fixed-boot-base variant): WallClock.wallTime read cap + ClockProvenance enum (untrusted @0 fail-closed zero value), KernelCapSource::wallClock @36, kernel/src/cap/wall_clock.rs, the capos-rt WallClockClient, and a shell date command granted wall_clock in system-shell.cue and asserted by make run-shell. Manifest seedUtcSeconds, a stateful WallClockState, WASI realtime-clock delegation, and init audit/TLS grants remain Phase 1.x / Phase 2 follow-ups. Task: docs/tasks/done/2026/time-wallclock-phase1.md.

crash-recovery-stale-cap-phase1

  • Scope: stale-cap DISCONNECTED/server-death CQE propagation to in-flight callers and endpoint holders on unplanned process death, plus a redacted CrashRecord appended to AuditLog. Source: docs/proposals/crash-recovery-supervision-proposal.md.
  • Domains: resource:crash-recovery, path:kernel/src/cap/, path:kernel/src/process.rs, docs:crash-recovery.

debug-session-phase1

  • Scope: DebugSession attach cap (owner-consent or broker maintenance grant, audited) + read-only cap-table snapshot that transfers no authority. Source: docs/proposals/debug-trace-authority-proposal.md.
  • Domains: resource:debug-trace-authority, path:kernel/src/cap/, docs:debug-trace.

authority-broker-phase1

  • Scope: endpoint-served AuthorityBroker + ShutdownControl schema + runtime client + a QEMU proof that an anonymous shell cannot invoke shutdown. Source: docs/proposals/userspace-authority-broker-proposal.md.
  • Domains: resource:authority-broker, path:init/, path:shell/, docs:userspace-authority-broker.
  • Status note: the interim kernel broker no longer owns hard-coded demo binary allowlists. kernelParams.authorityBrokerPolicy now carries the admitted session-context, remote-client spawn, and worker service grant policy with manifest validation. The endpoint-served userspace broker and shutdown-control interfaces remain the queued Phase 1 work.

live-upgrade-capretarget-phase1

  • Scope: ProcessControl + retargetCaps kernel op for stateless Case 1 upgrades, with a QEMU retarget-mid-call smoke. Foundation for DDF userspace-driver fault containment. Source: docs/proposals/live-upgrade-proposal.md.
  • Domains: resource:live-upgrade, path:kernel/src/cap/, docs:live-upgrade.

system-info-hostname (done)

  • Scope: add hostname to the SystemInfo cap + kernelParams.hostname + manifest field. Source: docs/proposals/system-info-proposal.md Phase 3.
  • Domains: resource:system-info, path:kernel/src/cap/, docs:system-info.
  • Landed: SystemInfo.hostname @1 served from kernelParams.hostname (default capos), printed by the shell hostname command, asserted in run-shell. Task: docs/tasks/done/cap-infra-system-info-hostname.md.

stage6-remainder

  • Scope: the remaining Stage 6 capability semantics – SharedBuffer SQE opcode + kernel mapping authority, typed notification objects with ring Recv integration, and CapabilityManager.list/grant. Decomposed in docs/backlog/stage-6-capability-semantics.md; queue each as its own slice on the schema surface. Source: roadmap Stage 6.
  • Domains: resource:stage6-capability-semantics, path:kernel/src/cap/, path:kernel/src/cap/ring.rs, docs:stage-6.

Non-Schema Slices

These are dispatchable now and are tracked as ready or done tasks, not queued on the schema serial surface:

  • Done: cap-infra-resource-profile-enforcement-local-proof – binds the existing ResourceProfileRecord / ManifestResourceProfile and capos_config::ResourceProfile carrier to remaining cap-slot and thread spawn-limit enforcement, with rollback proof (docs/tasks/done/2026-06-06/cap-infra-resource-profile-enforcement-local-proof.md).
  • Done: capos-service-lifecycle-slice1ServiceMain/lifecycle framework above capos-rt, one converted gateway proof (docs/tasks/done/2026/capos-service-lifecycle-slice1.md).
  • Done: default-user-avatar – deterministic native-shell avatar selection over the shipped flat catalog, printed in the shell session output without schema or broker changes (docs/tasks/done/2026/default-user-avatar.md).
  • Done: scheduler-tickless-idle-step6 – enable true-idle tickless windows while keeping cap-enter polling dependencies periodic (docs/tasks/done/2026/scheduler-tickless-idle-step6.md).

Still-Gated (not in this cluster)

Memory-authority, OOM/swap, certificates/TLS, OIDC, volume-encryption, go-runtime, chat-multimedia, llm/agent, browser, GPU, formal-MAC/MIC, cloud-metadata, HPC, scientific, hosted-agent-swarm remain gated on this cluster, DDF, networking, storage persistence, or SMP Phase C / Ring v2. See each proposal’s gating note and docs/backlog/research-design-gaps.md.

SMP Phase C Backlog

ARCHIVED — milestones complete; residual full-SMP-hardware work tracked in Scheduler Evolution “Phase F.5: Full-SMP Hardware Scalability”. Both visible milestones this backlog tracks landed: Multi-Process SMP Concurrency (the make run-smp-process-scale proof is complete) and In-Process Threading Scalability (closed at commit 136b72de, 2026-05-01 14:58 UTC). No SMP track is active in docs/tasks/README.md. This file is retained as historical context and as the proof-contract reference; do not select new work from it – the next visible SMP milestone is the planning slot in scheduler-evolution.md Phase F.5.

Detailed context for the selected SMP Phase C AP scheduler-owner proof and the remaining full-concurrent-SMP and in-process thread-scaling follow-on work.

Visible Goal

Move from a single scheduler owner to multiple CPUs that can run independent scheduler-owned kernel/user work concurrently, and prove that capability-owned processes can improve wall-clock performance on a deterministic CPU-bound workload under QEMU/KVM.

This backlog tracks two distinct visible milestones:

  1. Multi-Process SMP Concurrency: make run-smp-process-scale should boot a focused manifest, run a deterministic SMP scaling demo across independent worker processes, print verified workload output, and report comparable 1/2/4-process timing. The proof is complete only when repeated KVM-backed -smp 1 and -smp 2 runs show near-linear speedup for the selected workload, while the ordinary manifest, ring, thread, park, and process-exit smokes still pass under -smp 2.
  2. In-Process Threading Scalability: make run-thread-scale should run a deterministic workload across sibling threads inside one process, verify the result, and report comparable 1/2/4-thread timing. This milestone closed at commit 136b72de (2026-05-01 14:58 UTC) against the pre-collapse per-CPU placement model: caller-aware child publication and the existing timer fast-path slices produced repeated KVM-backed physical-core evidence above the configured 1-to-2 work and total speedup thresholds. The 4-worker row remained diagnostic rather than a linear-scaling claim. The 2026-05-02 per-CPU run-queue collapse retired that placement chain (caller-aware publication, per-CPU runnable queues, local-first stealing, the WakePolicy::QueueCpu(usize) variant). A post-collapse 3-run diagnostic on capos-bench 2026-05-02 10:42 UTC measured 1-to-2 work/total 1.890x/1.792x (slight improvement) and 1-to-4 work/total 1.504x/1.436x (clear regression on single-queue scheduler-lock contention). The formal capOS+Linux accepted-evidence pair landed against the same single-global-queue scheduler on capos-bench 2026-05-02 21:38 UTC against main commit 374f8556: capOS work 1.883x / total 1.787x clear the configured 1-to-2 gates, while the 1-to-4 row (capOS 1.566x/1.538x vs Linux 3.963x/3.858x) is the diagnostic gating Phase D’s fair-share enqueue policy. Reintroducing per-CPU runnable queues with that policy must materially close the capOS-vs-Linux 1-to-4 gap before per-CPU queues land back in the scheduler. See docs/architecture/scheduling.md, docs/benchmarks.md, and docs/backlog/scheduler-evolution.md for the current state.

Full concurrent SMP scheduling remains the underlying kernel goal for the multi-process milestone. It means more than one CPU can own scheduler work simultaneously, including per-CPU runnable ownership, cross-CPU idle-to-runnable handoff, reschedule IPIs, safe current-thread tracking, and reviewed lock/residency rules. The multi-process scaling demo is the first user-visible acceptance test for that kernel capability.

Completed Gates

  • Ground the multi-CPU scheduling slice in the SMP proposal, scheduler and threading docs, and relevant docs/research/ files.
  • Migrate syscall entry/exit to the GS-base/swapgs per-CPU path, including non-sysretq scheduler/exit paths.
  • Add LAPIC timer, EOI, and IPI support for per-CPU ticks and cross-CPU coordination. The active backend is PIT-calibrated xAPIC MMIO with PIT/PIC fallback; x2APIC remains a later backend.
  • Add TLB shootdown before any user address space can run on more than one CPU over its lifetime.
  • Extend scheduler state from BSP-only ownership to per-CPU current-thread tracking with AP idle/runnable handoff. The first AP scheduler proof uses one AP as scheduler owner while the BSP stays in kernel idle, preserving the process-wide ring invariant.
  • Add QEMU proof that AP cpu=1 executes scheduler-owned work and the existing manifest/ring/thread/park smokes still pass under -smp 2.

In-Process Threading Closeout Rules

  • Resolve the scheduler hot-lock blocker before calling the selected milestone a scalability proof. The implementation at the time had per-CPU runnable queues and dispatch state, but they remained under one global Scheduler lock. A closeout branch should either split the hot dispatch path so ordinary timer preemption, local run-queue selection, and sibling CPU-bound thread requeue do not serialize on one global lock, or explicitly narrow the milestone to “functional in-process threading” and select a follow-on scheduler-lock scalability milestone. Completed 2026-05-01 14:58 UTC after repairing the benchmark shape against Linux baseline evidence and tightening caller-aware child publication: the repaired blocking-parent 16 MiB/64-round shape scales on Linux, and controlled physical-core capOS evidence passed the enforced 1-to-2 work and total gates. Four-worker capOS scaling remained a separate follow-up because total time still showed scheduler/exit/join overhead. (Update 2026-05-02: the per-CPU runnable queues and the caller-aware child publication described here were later collapsed into a single global runnable queue with the per-CPU run-queue-collapse cleanup slice; the recorded 1-to-2 capOS gates were against that pre-collapse placement model. The current single-global-queue scheduler now has its own formal accepted 1-to-2 pair on capos-bench 2026-05-02 21:38 UTC against main commit 374f8556 (capOS work 1.883x / total 1.787x; Linux baseline 1.988x/1.987x); the 1-to-4 row remains the diagnostic gating Phase D’s fair-share enqueue policy. Per-CPU queues and caller-aware placement return when that policy ships and materially closes the capOS-vs-Linux 1-to-4 gap. See docs/architecture/scheduling.md, docs/benchmarks.md, and docs/backlog/scheduler-evolution.md for current state.)

  • Add a bounded timer continuation fast path as a conservative split-prep slice. Completed 2026-05-01 10:29 UTC: a user-mode LAPIC timer tick may keep running the current non-idle thread without entering sched::schedule() only when a previous locked slow path has published a clean hard-work summary, the CPU has no pending reschedule IPI, and the per-CPU one-skip budget has not been exhausted. The 2026-05-01 11:40 UTC follow-up keeps every dirty producer forcing at least one locked timer pass, then allows remaining run queues and handoff-current markers alone to be treated as fairness/protection state for one continued tick. Direct IPC, deferred cleanup, Timer sleeps, and timed cap-enter/Park waiters still keep the hard slow-path bit set. The full scheduler path remains authoritative and still runs regularly for ring SQEs, cap-wait scans, cleanup, and accounting. This narrows timer-side scheduler-lock contention but does not by itself close the selected scalability milestone. Controlled capos-bench physical-core 0-3 before/after evidence for the initial strict-clean version stayed accepted=false: baseline target/thread-scale/timer-fastpath-baseline-main-physical-20260501T102938/ reported work speedups 0.998x and 0.998x; after-change target/thread-scale/timer-fastpath-after-physical-20260501T104700/ reported work speedups 1.001x and 0.999x. Controlled capos-bench physical-core 0-3 evidence for the fairness-only follow-up also stayed accepted=false: baseline target/thread-scale/20260501T120224Z/ recorded work speedups 1.001x and 0.999x plus total speedups 0.913x and 0.587x; after-change target/thread-scale/20260501T120709Z/ recorded work speedups 1.001x and 1.000x plus total speedups 1.125x and 0.828x.

  • Add timer-fast-path attribution counters for guest-measure thread-scale runs. Completed 2026-05-01 10:58 UTC: aggregate and per-phase timer lines now report fast-path attempts, continues, and fallback reasons for slow-required/dirty summaries, skip-budget exhaustion, pending reschedule IPIs, no-current/non-idle CPUs, and inactive/invalid scheduler CPUs. These counters answer whether the bounded continuation path fires inside benchmark phases. They are benchmark-only instrumentation and do not close the current accepted=false speedup gate. Local one-run evidence in target/thread-scale/20260501T110157Z/ passed with the new fields present in every 1/2/4-thread measure.log; the timed work phase recorded fast_path_continues=0 for all three rows.

  • Add timer slow-summary reason attribution for guest-measure thread-scale runs. Completed 2026-05-01 11:28 UTC: aggregate and per-phase timer_slow_summary lines now report required/clean counts and the predicate reasons that keep TIMER_SLOW_PATH_REQUIRED set after a locked timer slow path. Reason fields cover nonempty run queues, direct IPC targets, handoff-current state, pending process termination/drop/stack release, timer sleeps, and timed cap-enter versus park waiters. Local one-run evidence in target/thread-scale/20260501T112359Z/ passed; the work phase showed required=2/4/8, clean=0, run_queue_nonempty=2/4/8, handoff_current=2/4/8, and zero timer sleeps/timed waiters for the 1/2/4-thread rows. The behavior follow-up keeps the output shape but changes required to mean hard timer work, not run queues or handoff markers alone. This attribution does not close the selected accepted=false speedup gate.

  • Add explicit thread-placement evidence and conservative new-child publication spreading. Completed 2026-05-01 12:37 UTC, refined 2026-05-01 13:20 UTC, and repaired 2026-05-01 14:58 UTC after the blocking-parent benchmark exposed a placement regression. Guest-measure runs now emit aggregate and per-phase thread_placement lines for publish targets, caller-current publish buckets, caller-aware avoid, fallback, and strict-load fallback counts, selected CPUs, first-selected CPUs, and migration events across CPU slots 0-3. Newly created non-single-owner threads avoid the caller’s current CPU only when another active ready scheduler CPU has a strictly lower non-idle dispatch load under the scheduler lock; on equal load, an active-ready caller CPU wins the tie instead of falling through to CPU0-biased least-loaded scanning. Single-owner processes stay pinned to CPU0. Timer, unblock, direct-IPC fallback, steal retry, and steal requeue paths keep their existing allocation-free targeting behavior. (Update 2026-05-02: the per-CPU run queues described here were later collapsed into a single global run queue, retiring the caller-aware placement and steal scans. See docs/architecture/scheduling.md and the per-CPU run-queue collapse entry in docs/backlog/scheduler-evolution.md for current state. Per-CPU queues return with the fair-share enqueue policy that Phase D will own.)

    The earlier avoid-caller rule passed the old spinning-parent 1-to-2 gate
    but was wrong for the repaired blocking-parent benchmark: a controlled
    run before the strict-load fix regressed to 1-to-2 work/total speedups
    `0.886x`/`0.928x` because the children were biased away from an otherwise
    available caller CPU. After the strict-load fix, controlled physical-core
    evidence passed the enforced 1-to-2 work/total gates with
    `1.828x`/`1.687x`. The same run recorded diagnostic 1-to-4 work/total
    speedups `3.029x`/`2.386x`; with scheduler switch diagnostics suppressed,
    those 1-to-4 diagnostics recorded `3.272x`/`2.303x`. Four-worker capOS
    scaling remains a follow-up, not a completed linear-scaling claim.
    
  • Preserve correctness gates while narrowing the lock: generation-checked ThreadRef ownership, no stale runnable queue entries after process or thread exit, direct-IPC preference without bypassing ownership checks, allocation-free timer/unblock runnable publication, and clean run-smp2-smokes evidence. Completed 2026-05-01 14:58 UTC: the caller-aware publication change preserves single-owner pinning and leaves timer/unblock/requeue/direct-IPC targeting unchanged; ordinary -smp 2 regression coverage passed.

  • Rerun controlled physical-core evidence after any scheduler hot-lock change. The milestone should stay open until host-summary work and total gates pass, or until the milestone scope is intentionally changed and recorded in docs/tasks/README.md, docs/roadmap.md, and this backlog. Completed 2026-05-01 14:58 UTC after benchmark repair: the matching Linux baseline validated the repaired blocking-parent 16 MiB/64-round shape on the selected physical CPU set with 1-to-2 work/total speedups 1.991x/1.990x and 1-to-4 work/total speedups 3.958x/3.834x. Controlled capOS evidence passed the enforced 1-to-2 work/total gates with 1.828x/1.687x.

  • Track post-closeout 4-worker scalability caveats separately from the recorded 1-to-2 milestone. The repaired benchmark proved the configured 1-to-2 work and total thresholds only against the pre-2026-05-02 per-CPU placement model. Linux now scales under the same repaired shape, so the remaining 4-worker capOS gap was not a benchmark-shape excuse. The strongest evidence at that time was: unsuppressed capOS 1-to-4 work/total speedups 3.029x/2.386x, scheduler-switch-log-suppressed diagnostics 3.272x/2.303x, and guest-measure runs that showed global Scheduler lock wait/hold cycles plus exit/join/block/schedule overhead while shared kernel locks were not visibly contended. Treat those numbers as historical; superseded by the formal capos-bench 2026-05-02 21:38 UTC pair against main commit 374f8556 (capOS work 1.883x / total 1.787x clears the configured 1-to-2 gates; 1-to-4 capOS 1.566x/1.538x vs Linux 3.963x/3.858x remains the diagnostic that gates Phase D’s fair-share enqueue policy). Future four-core scaling claims should add an explicit 1-to-4 gate, keep placement evidence enabled, separate work-window from total-time attribution, and continue splitting hot scheduler metadata/lock paths.

Multi-Process SMP Concurrency Gates

  • Split the current one-owner scheduler latch into per-CPU scheduler run queues or equivalent ownership that can keep more than one CPU executing scheduler-owned work at the same time. Completed in commit 20f6894 (2026-04-30 05:30 UTC) with per-CPU scheduler ownership, current and handoff tracking, per-CPU idle/fallback cleanup slots, and temporary BSP pinning for endpoint-, launcher-, spawner-, and thread-authority holders so process-wide ring paths remain single-owner during this milestone.
  • Add reschedule IPIs for idle-to-runnable handoff across scheduler owners. The current scheduler tree tracks pending reschedule IPIs per target CPU, wakes halted scheduler-owner loops for newly runnable work, and uses the same serialized fixed LAPIC IPI send path as TLB shootdown without claiming a general preemptive reschedule interrupt.
  • Prove concurrent scheduler-owned work on more than one CPU with independent worker processes first. This avoids process-wide capability ring races while still proving real multi-core execution. The focused proof harness is on mainline as of commit c2790c0 (2026-04-30 07:38 UTC), and the completed milestone is recorded at commit 3fb89923 (2026-04-30 09:45 UTC).
  • Add an SMP scaling demo binary and focused manifest. The first workload is segmented prime counting over generated ranges. It partitions work statically by worker index, avoids hot-path syscalls and serial output, produces aggregate prime-count/checksum verification, and prints one compact result line per accepted case.
  • Add a host harness for make run-smp-process-scale that runs the same workload under -smp 1, -smp 2, and optionally -smp 4, captures raw logs, and reports worker count, CPU count, ticks or cycles, output checksum, and speedup. A single noisy QEMU run is not enough evidence for a scaling claim; keep raw repeated-run artifacts for review. tools/qemu-smp-process-scale-harness.sh builds/uses capos-smp-process-scale.iso, stores serial logs under target/smp-process-scale/<timestamp>/, defaults to five repetitions, reports per-case medians, and enforces the 1.6x 1-to-2 median threshold only when KVM-backed evidence is available.
  • Treat near-linear 1-to-2 CPU speedup as the first publishable target. Use a threshold high enough to reject accidental concurrency illusions but low enough for QEMU/KVM variance, for example at least 1.6x median speedup over repeated runs. Record the exact threshold in the harness when this milestone is selected for implementation.

make run-smp-process-scale Proof Contract

This target is the acceptance test for Multi-Process SMP Concurrency. It must stay narrower than the later in-process threading milestone: one process ring per worker process, no sibling threads in the timed section, no shared ParkSpace words, no IPC throughput loop, and no completion-ring demux claim.

The first implementation should add:

  • a focused system-smp-process-scale.cue manifest;
  • a coordinator binary that receives the manifest-granted ProcessSpawner, spawns a fixed set of worker process cases, waits for each child, verifies aggregate results, and prints the compact result lines;
  • a worker binary or a small family of worker binaries that execute one static partition of the deterministic workload and report only their final result through a parent endpoint or other existing spawn-result path after the timed section finishes;
  • a tools/qemu-smp-process-scale-harness.sh host harness wired to make run-smp-process-scale.

The workload should be segmented prime counting over generated integer ranges. Each run case divides the same total range into workers contiguous segments. Worker i handles segment i without terminal output, IPC calls, heap-heavy allocation, or capability operations in the timed region. The coordinator collects one post-compute result per worker and verifies the aggregate prime count plus a stable checksum or hash against known constants before it accepts timing evidence.

The guest must print one line per accepted run case in this shape:

[smp-process-scale] cpus=<n> workers=<n> range=<lo>..<hi> primes=<count> checksum=<hex> elapsed=<ticks-or-cycles> verified=true

The exact time source can be monotonic ticks or a cycle counter, but it must be an in-guest measurement that brackets only the worker-process computation after spawn/setup and before serial reporting. If timer granularity makes the proof too noisy, increase the total range instead of measuring host wall time as the primary signal. Host wall time may be reported as secondary harness metadata.

The host harness policy is:

  • default to CAPOS_SMP_SCALE_RUNS=5 complete repetitions per CPU-count case;
  • run and report the advertised 1/2/4-worker timing cases. At minimum that means -smp 1/one worker, -smp 2/two workers, and a 4-worker timing case; the preferred 4-worker case is -smp 4 when the local QEMU/KVM host exposes four usable vCPUs, otherwise the harness must still report the 4-worker case under the largest available SMP count and mark why a 4-vCPU run was not collected;
  • require KVM for a speedup claim. If /dev/kvm or QEMU KVM acceleration is unavailable, the target may run a functional verification mode, but it must report that publishable speedup evidence was not collected;
  • keep raw serial and terminal logs under a stable target/ subdirectory such as target/smp-process-scale/<timestamp>/;
  • summarize the median verified elapsed value for each case and require at least 1.6x median speedup from the -smp 1/one-worker baseline to the -smp 2/two-worker case before accepting the near-linear 1-to-2 speedup claim;
  • rerun the ordinary manifest, ring, thread, park, and process-exit smokes under -smp 2 before marking the selected milestone complete.

As of commit 3fb89923 (2026-04-30 09:45 UTC), the focused manifest, process-scale demo, and host-side harness wiring produce passing default repeated KVM-backed speedup evidence. The accepted run in target/smp-process-scale/cycle-balanced-default/ recorded medians smp1=1693, smp2=1053, smp4=2314, or 1.608x, satisfying the required 1.6x threshold. The worker-reported elapsed value is a scaled user-mode cycle count, and the static worker ranges are contiguous but cost-balanced for the prime-counting loop. The ordinary -smp 2 smoke gate also passed: target/smp2-smokes/run-smoke.log covers the default manifest smoke, and target/smp2-smokes/run-spawn.log covers endpoint roundtrip, ring-reserved opcodes, timer/runtime children, thread lifecycle, park cleanup, generic child waits, and process exit. The Multi-Process SMP Concurrency milestone is complete. The harness fails closed when the focused manifest, ISO, expected compact proof lines, or speedup evidence are unavailable instead of fabricating timing evidence.

tools/linux-smp-process-scale-baseline.sh is the reference-OS comparison for this proof. It builds a tiny static Linux initramfs that runs the same forked, deterministic prime-counting workload under the same QEMU/KVM CPU and memory envelope, records raw logs under target/linux-smp-process-scale/, and uses the same default five-run median policy. The script defaults now match capOS’ balanced contiguous splits; rerun the Linux comparison before publishing a new OS-comparison table for the accepted capOS evidence.

The process-scale harnesses also expose an opt-in smp8-smt diagnostic through CAPOS_SMP_SCALE_INCLUDE_SMT=1 and LINUX_SMP_SCALE_INCLUDE_SMT=1. It uses the same range and aggregate verifier with eight contiguous ranges and is collected only when the host reports at least eight logical CPUs. This case is for SMT behavior on 4-core/8-thread hosts; it must not be treated as 8-core evidence or included in the accepted 1-to-2 speedup gate.

The proof must not depend on KVM paravirtual APIC, IPI, or TLB-flush features. The current architectural xAPIC MMIO LAPIC timer/IPI path remains the correctness surface; paravirtual APIC acceleration is future performance work.

Before the scheduler implementation branch claims this target, review the non-blocked findings that could invalidate the evidence:

  • panic-surface hardening for guarded unwraps, stale queues, blocking waits, process/thread exit, endpoint cancellation, and rollback restoration paths touched by scheduler ownership changes;
  • quota/exhaustion behavior for the child-process, process-handle, outstanding call, scratch, frame, and invalid-SQE paths used by the coordinator and workers;
  • release/revoke epoch behavior only for capabilities the demo actually grants.

Findings unrelated to this proof, such as DMA provenance, shared ParkSpace unmap/reuse, or same-process per-thread ring routing, should stay tracked in the migrated review-finding task records but must not be represented as blockers for independent worker-process SMP scaling.

SMP Review-Finding Reconciliation

This section classifies the review-finding task records for the selected multi-process SMP proof. It does not close those findings; it defines what the next scheduler and harness branches must satisfy before they can depend on the paths involved in the proof.

Blocking or proof-invalidating for this milestone:

  • Scheduler panic surfaces touched by ownership changes. A branch that changes scheduler ownership, per-CPU queues, idle-to-runnable handoff, or process/thread exit cleanup must audit and either harden or explicitly test the relevant docs/panic-surface-inventory.md scheduler rows: block_current_on_cap_enter, capos_block_current_syscall, stale run-queue process references, exit_current, current_ring_and_caps, scheduler start, and context-restore CR3 assumptions. The branch should add targeted host or QEMU coverage for each panic surface it claims to close.
  • Process/resource exhaustion on paths used by the coordinator. The proof depends on ProcessSpawner, ProcessHandle.wait, result-cap adoption, and likely a parent endpoint or equivalent post-compute result path. Those paths must keep controlled failures for cap-slot exhaustion, process-handle exhaustion, endpoint queue pressure, scratch/result-buffer pressure, outstanding call pressure, and frame-grant/frame-exhaustion pressure from loading worker ELF pages, stacks, and TLS. Existing endpoint pending-RECV and queued-CALL overload coverage can be reused, but new coordinator-specific resource pressure introduced by the demo needs matching coverage before the proof is used as milestone evidence.
  • Runtime invalid-SQE flood handling if the harness exercises malformed submissions. The process-scaling demo should not need malformed SQEs. If a future scheduler or harness branch adds invalid-submission stress to this target, it inherits whatever invalid-submission review-finding task records remain open at that time. Runtime flood handling and log/rate-limit suppression should be evaluated separately because active remediation may close one without closing the other. Otherwise invalid-submission remediation remains a separate track and should not block the pure scaling proof.

Guardrails that must be preserved but are not standalone blockers for the independent worker-process proof:

  • Explicit revoke/epoch tests. The demo should use only the capabilities needed to spawn workers and collect their final results. It must not claim peer revocation, stale session rejection, or object-epoch behavior unless it grants revocable/session-sensitive authority and adds flow-specific revoke or expiry tests.
  • ParkSpace unmap/reuse enforcement. Independent worker processes should avoid shared ParkSpace words in the timed workload. The ordinary park smoke still has to pass under -smp 2 before milestone completion.
  • Process-wide capability ring constraint. The proof remains valid only because each worker has its own process ring and the timed section avoids ring traffic. It must not be cited as evidence for same-process sibling thread scalability, per-thread completion routing, or Ring v2.
  • Raw evidence retention. Local repeated KVM logs are enough for this development milestone, but production/reproducibility claims remain governed by the provenance finding. Keep raw target/smp-process-scale/<timestamp>/ artifacts for review and avoid implying third-party reproducibility.

Out of scope for this milestone unless a branch expands the demo surface:

  • DMA owner state, generation-checked DMA/MMIO/IRQ handles, stale interrupt proofs, and DMA ResourceLedger/OOM implementation;
  • shared ParkSpace unmap/reuse beyond preserving existing park smokes;
  • same-process thread creation, join, TLS, per-thread rings, and Ring v2 completion routing.

In-Process Threading Scalability Gates

  • Define the per-thread capability-ring/completion-routing contract needed before same-process sibling threads can claim independent scaling. Completed 2026-04-30 10:19 UTC in docs/proposals/ring-v2-smp-proposal.md: the first Ring v2 slice uses kernel-chosen child-thread ring mappings, a shared RingEndpoint record for initial and child rings, and ThreadRef -> RingEndpoint as the routing model.
  • Move capability-ring waiting/completion routing to the per-thread ThreadRef model before claiming same-process sibling threads scale independently on different CPUs. Endpoint, timer, park, process-wait, thread-join, deferred-cancel, and direct IPC completion paths must all route through the target thread’s RingEndpoint before same-process scaling can be claimed. Completed through the Ring v2/thread-scale substrate: spawned child threads receive independent ring endpoints, and local/controlled thread-scale evidence verifies child rings.
  • Ensure thread creation, FS/TLS setup, thread exit, join, park waits, and process exit remain generation-checked and safe when sibling threads can be resident on different CPUs. Completed through the reviewed thread-scale implementation and the closeout run-smp2-smokes pass.
  • Add an in-process thread scaling demo that uses the same class of deterministic CPU-bound workload as the multi-process proof, but splits work across sibling threads in one process. Prefer fixed-size parallel hashing/checksum chunks over prime counting for this milestone: equal-byte chunks have much more uniform work than trial division over increasing integer ranges, still keep the timed region syscall-free, and verify through one deterministic root hash. Print one compact result line per run. Completed with the demos/thread-scale proof and reusable demos/thread-scale-workload crate.
  • Add a host harness for make run-thread-scale that runs 1/2/4-thread cases under matching QEMU CPU counts, captures raw logs, and rejects results until the verified median speedup reaches the accepted threshold. Completed 2026-05-01 14:58 UTC after benchmark repair: the harness enforces KVM-backed 1-to-2 work and total thresholds when requested, carries parent_wait and work_rounds through CSV metadata, and the repaired blocking-parent 16 MiB/64-round run passed both enforced physical-core gates. 2026-04-30 12:34 UTC functional checkpoint: this branch adds the same-process demo and QEMU harness as diagnostic evidence only. The harness retains raw serial logs under target/thread-scale/<timestamp>/, parses exactly one verified [thread-scale] line per 1/2/4-thread case, and reports median elapsed values plus diagnostic speedups. Focused phase diagnostics now add guest cycle fields for spawn_ready, work, shutdown, and total to separate thread creation/ready time, the syscall-free workload window, and thread exit/join time. elapsed remains the workload value and is an alias of work, so harness speedup calculations continue to use only the timed workload. The retained artifacts are raw QEMU serial/terminal/stdout/stderr logs plus results.csv and summary.log. Host-side QEMU profiling is opt-in through CAPOS_THREAD_SCALE_PROFILE=1; it requires perf and stores perf.data, perf.script, perf.report.txt, and profile-command.txt plus qemu.status in each case-run artifact directory. These are host samples of the QEMU process and the preserved workload exit status, not guest symbol attribution by themselves, so the guest phase counters remain the default diagnostic. Guest-side kernel measurement is separately opt-in through CAPOS_THREAD_SCALE_GUEST_MEASURE=1; it rebuilds the thread-scale ISO with the benchmark-only kernel measure feature and retains release symbols for that benchmark build only. It writes the kernel measure: segment summaries from each case-run serial log to that case-run’s measure.log and records the per-case userspace symbol map path in results.csv under guest_symbol_map. It also writes a user-pc-symbols.log report beside each measure.log and records that path under user_pc_symbol_report; the report maps aggregate and per-phase user_pc_samples exact-RIP buckets to the nearest userspace symbol address not greater than the PC. Those segment counters cover scheduler choice, schedule save/requeue, timer and park wake paths, cap-wait scans, thread exit/join cleanup, and process exit/drop cleanup. First-slice shared-kernel contention counters now add aggregate and per-phase shared_kernel_lock lines for frame allocator alloc/free lock acquisitions, contention, and spin loops, plus the ring-dispatch cap-table and ring-scratch locks before cap::ring::process_ring. Follow-up counters also cover endpoint inner queue locks, endpoint cancellation scratch locks, and all direct per-process address-space lock sites. Heap attribution now routes the global allocator mutex through SharedKernelLock::Heap in measure builds; one-run guest-measure evidence recorded zero timed-work-phase heap acquisitions for the syscall-free benchmark and nonzero spawn/shutdown allocator activity. These remain benchmark-only measure attribution and do not close the broader shared-service contention finding. Fresh result rows now explicitly classify the benchmark hot section as syscall-free CPU work with ring and allocator activity limited to setup/shutdown, no endpoint or network activity, and result-only logging. The harness requires those benchmark-class fields for new QEMU parses, validates the expected values for this benchmark, carries them into results.csv, and keeps summary-only replay tolerant of legacy CSV files that predate the class columns. Local one-run evidence is retained in target/thread-scale/20260501T083254Z/. Network/polling attribution now adds aggregate and per-phase measure: network_poll lines for initialized virtio-net scheduler, runtime, and interface polling; the built-in TCP HTTP proof poll; virtqueue poll spins and completions; and pending network waiter scans. The guest-measure harness requires those lines. Local one-run evidence in target/thread-scale/20260501T093505Z/ passed and retained zero aggregate and per-phase network/poll counters for the 1/2/4-thread rows. The default thread-scale manifest has no virtio-net device, and the scheduler poll entry returns before the driver mutex in that no-device case. For this CPU-bound benchmark they are zero-evidence guardrails, not service-throughput proof and not milestone acceptance. The symbol map and resolved report are benchmark-only nearest-symbol attribution aids for interpreting raw user_pc_samples buckets, not line-level profiling, a complete guest profiler, or normal-build guest attribution. These diagnostics are for reviewers, not speedup acceptance. The guest result line deliberately prints accepted=false as diagnostic guest-side state. Host acceptance is a separate summary decision: CAPOS_THREAD_SCALE_REQUIRE_SPEEDUP=1 requires KVM-backed evidence and the configured 1-to-2 median work/elapsed speedup threshold, but it does not fail merely because parsed guest rows carry accepted=false. The total-case summary gate is separate and opt-in: CAPOS_THREAD_SCALE_REQUIRE_TOTAL_SPEEDUP=1 requires KVM-backed evidence and the configured CAPOS_THREAD_SCALE_TOTAL_SPEEDUP_THRESHOLD against the 1-to-2 median total speedup. It is also supported by summary-only replay and is not enforced by default. capos-bench diagnostic run target/thread-scale/capos-bench-thread-20260430T125613Z/ used n2-highcpu-8 KVM with QEMU pinned to physical CPUs 0-3 for five runs per case. Median elapsed cycles were thread1 56244112, thread2 84429072, and thread4 140666438; diagnostic speedups were thread1-to-thread2 0.666x and thread1-to-thread4 0.400x, with all rows still accepted=false. After phase diagnostics landed, capos-bench run target/thread-scale/capos-bench-phase-20260430T134301Z/ used the same pinned physical CPU set and recorded five-run medians: thread1 elapsed/work=56285136, spawn_ready=43054612, shutdown=57693626, total=157008630; thread2 84432724, 76247932, 142200058, 303096216; thread4 140768008, 205527230, 395434364, 741943554. The phase output shows shutdown/join cost increasing sharply with worker count, but all rows still remain accepted=false. After child-ring endpoints and the optional SMT8 diagnostic landed, capos-bench physical-core run target/thread-scale/20260430T151909Z/ recorded five-run medians pinned to logical CPUs 0-3: thread1 elapsed/work=56215128, spawn_ready=41692656, shutdown=57753172, total=155536564; thread2 84420848, 74791942, 142065130, 301170274; thread4 140697028, 143691606, 395397620, 679786606. Final SMT diagnostic run target/thread-scale/capos-bench-final-smt8-20260430T154058Z/ at commit 19f2fc66 used logical CPUs 0-7 and recorded medians: thread1 56272620, 54277322, 57824172, 168448508; thread2 84343990, 72757730, 142229724, 299693446; thread4 140992614, 144614212, 396264522, 681167764; thread8 253352976, 290422132, 1239856304, 1786188514. All rows remain accepted=false, and thread8 is informational SMT evidence only. Scheduler-unpin final diagnostic run target/thread-scale/scheduler-unpin-final2-20260430T160700Z/ removed the scheduler’s transient same-pid pinning and verified 1/2/4-thread cases without the child-ring map/unmap TLB shootdown panics seen during this slice. One-run medians were thread1 elapsed/work=56293734, spawn_ready=39202342, shutdown=34848540, total=130344694; thread2 57101752, 95921604, 69869786, 222894030; thread4 274828354, 275826356, 407818252, 958473044. Diagnostic speedups were thread1-to-thread2 0.986x and thread1-to-thread4 0.205x; all rows remain accepted=false. Follow-up local checks passed make run-smp2-smokes in target/smp2-smokes/20260430T160936Z/ and reran three thread-scale samples in target/thread-scale/scheduler-unpin-rerun-20260430T161104Z/. That rerun kept correctness intact but recorded thread4 902520658 cycles under local oversubscription, so it remains diagnostic only. After guest-side measurement landed, capos-bench runs at commit a5c4f789 recorded five-run medians with QEMU pinned to host logical CPUs 0-3, which map to distinct physical cores on that host: thread1 56341030, thread2 56166300, thread4 70122044 (1.003x, 0.803x). The SMT diagnostic pinned to logical CPUs 0-7 recorded medians thread1 56315082, thread2 56233080, thread4 62630052, thread8 125488946 (1.001x, 0.899x, 0.449x). The one-run guest-measure pass in target/thread-scale/20260430T182824Z/ recorded per-case measure.log files. Top measured guest-side cycle totals were ring_processing and method_body, with sched_choose_next and thread_exit_join_cleanup growing at higher thread counts. A follow-up local phase-aware guest-measure pass in target/thread-scale/20260430T184532Z/ verified that each case measure.log now includes final-summary measure: checkpoint and measure: phase attribution for spawn_ready, work, shutdown, and final_total; the harness rejects guest-measure runs missing any of those phase summaries. These runs remain diagnostic and accepted=false. After phase-aware guest measurement landed on main at commit da92ed42, capos-bench reran the diagnostic with QEMU pinned to host logical CPUs 0-3, which map to distinct physical cores on that host. Run target/thread-scale/capos-bench-phase-main-20260430T191146Z/ recorded five-run medians: thread1 elapsed/work=56242252, spawn_ready=38789562, shutdown=34859130, total=130093430; thread2 56233998, 91718518, 61923280, 205126974; thread4 62926552, 109723566, 119015960, 297970796. SMT diagnostic run target/thread-scale/capos-bench-phase-smt8-main-20260430T191408Z/ pinned QEMU to logical CPUs 0-7 and recorded medians: thread1 56198166, 41134070, 34781494, 132161420; thread2 56196302, 42453050, 63546086, 162449504; thread4 62361512, 87093620, 109458814, 258043804; thread8 125378372, 249877254, 528656458, 904149404. A one-run host-profile plus guest-measure sample in target/thread-scale/capos-bench-profile-phase-main-20260430T191703Z/ used temporary host perf access with QEMU pinned to logical CPUs 0-3, then restored kernel.perf_event_paranoid=4. The host reports still show QEMU/KVM execution, ioctl, QEMU mutexes, and MMIO/read helpers near the top; guest phase counters show no ring dispatches in the measured work phase, while shutdown/join and scheduler choice costs grow with worker count. These results remain diagnostic and accepted=false. Artifact content verification after collection checked summary.log and results.csv for the two five-run diagnostics and the one-run profile sample, plus the profile sample’s measure.log and perf.report.txt, against the recorded medians, pinning, accepted=false status, guest phase claims, and host-profile claims. Join-cleanup optimization follow-up on branch workplan/thread-scale-join-cleanup adds per-thread pending join-waiter accounting so exiting worker threads that never blocked in ThreadHandle.join skip the thread-handle waiter scan. Local evidence: target/thread-scale/join-cleanup-local-20260430T193657Z/ passed functional guest-measure verification, and target/thread-scale-join-cleanup-run-spawn.log passed make run-spawn; local timing remains diagnostic because the host was not a controlled benchmark environment. Controlled capos-bench reruns for this branch kept all rows accepted=false: physical-core run target/thread-scale/capos-bench-join-cleanup-20260430T194536Z/ recorded medians thread1 56173118, thread2 56166224, thread4 62070170 (1.000x, 0.905x), and SMT diagnostic target/thread-scale/capos-bench-join-cleanup-smt8-20260430T194734Z/ recorded medians thread1 56251116, thread2 56197306, thread4 62519276, thread8 122089762 (1.001x, 0.900x, 0.461x). Scheduler-choice cleanup follow-up on branch workplan/thread-scale-scheduler-choice removes a redundant blocked-thread scan from the idle fallback in choose_next_locked. Local functional evidence: target/thread-scale/scheduler-choice-local-20260430T200257Z/ passed guest-measure verification. Controlled capos-bench run target/thread-scale/capos-bench-scheduler-choice-20260430T201041Z/ recorded medians thread1 56171526, thread2 56301462, thread4 62433702 (0.998x, 0.900x), so the cleanup does not close the milestone. The immediate review-finding note that the scheduler still had a two-CPU owner mask is addressed by raising the temporary scheduler-owned CPU slot count and wake mask to four, so the 4-thread diagnostic can exercise four scheduler owners. This is only a blocker-removal step. The open attribution, serial/logging, scheduler-lock counter, workload-baseline, and per-CPU run-queue findings in the migrated review-finding task records remain required before accepting a speedup claim. Initial local build gates passed. The first make run-smp2-smokes attempt in target/smp2-smokes/four-scheduler-cpus-20260430T202129Z/ exposed an early boot failure after the enlarged static scheduler value crossed a fragile initialization path. The implementation now uses a capacity-reserved deferred process-drop queue instead of embedding one Process slot per scheduler CPU in the Scheduler static. Bounded run-spawn smoke evidence passed in target/smp2-smokes/four-scheduler-cpus-spawn-pending-vec-20260430T203055Z/. Full make run-smp2-smokes passed in target/smp2-smokes/four-scheduler-cpus-full-20260430T203214Z/. Local thread-scale guest-measure verification passed in target/thread-scale/four-scheduler-cpus-local-20260430T203356Z/ with CAPOS_THREAD_SCALE_RUNS=1, QEMU pinned to local CPUs 0-1, and cases through -smp 4; local timing remains noisy and is not controlled speedup evidence. Controlled capos-bench runs then verified the effect on the benchmark host. Physical-core run target/thread-scale/capos-bench-four-scheduler-cpus-20260430T203733Z/ used QEMU pinned to logical CPUs 0-3, recorded medians thread1 56144884, thread2 56190496, thread4 36386164 (0.999x, 1.543x), and kernel logs show AP scheduler owners on CPUs 1-3 starting benchmark threads. SMT diagnostic target/thread-scale/capos-bench-four-scheduler-cpus-smt8-20260430T203945Z/ used logical CPUs 0-7, recorded medians thread1 56181720, thread2 56191504, thread4 56213928, thread8 116270280 (1.000x, 0.999x, 0.483x). Both rows remain accepted=false; the physical 4-thread speedup is close to but below the 1.6x threshold, and the SMT8 row is informational because the scheduler owner mask remains four CPUs. Scheduler-attribution follow-up branch workplan/thread-scale-scheduler-attribution adds guest-side total and per-phase scheduler counters for direct-target, run-queue, and idle candidate classes; runnable/retry/drop outcomes; and reschedule IPI target/sent/skipped/failure counts. Local functional verification in target/thread-scale/scheduler-attribution-local-20260430T210322Z/ passed all 1/2/4-thread cases with CAPOS_THREAD_SCALE_GUEST_MEASURE=1, CAPOS_THREAD_SCALE_RUNS=1, and QEMU pinned to local CPUs 0-1; the shell wrapper reported failure only because it reused zsh’s read-only status parameter after the harness had already written a successful summary.log. The 4-thread work phase now records scheduler retry pressure (55 run-queue candidate checks, 7 idle candidate checks, 28 runnable outcomes, and 34 retry outcomes) while still recording zero ring dispatches. This materially improves attribution but does not close the broader scheduler-lock, serial, CR3/TLB, guest-symbol, or workload-baseline requirements in the migrated review-finding task records. Serial-attribution follow-up adds guest-side total and per-phase serial byte counters to CAPOS_THREAD_SCALE_GUEST_MEASURE=1. Bytes are counted after LF-to-CRLF expansion and after a UART byte is emitted, including emergency writes in measure kernels. Local functional verification in target/thread-scale/serial-attribution-local-20260430T212243Z/ passed all 1/2/4-thread cases with CAPOS_THREAD_SCALE_RUNS=1 and QEMU pinned to local CPUs 0-1; the stricter harness now requires aggregate and per-phase serial lines. The run recorded total serial bytes of 4161, 4788, and 6295; work-phase serial bytes stayed at 74 in each case, while shutdown serial bytes rose from 70 to 145 to 631. This closes the serial-byte counter blind spot, but it does not close scheduler-lock, CR3/TLB, guest-symbol, workload-baseline, or logging-suppression A/B requirements in the migrated review-finding task records. Scheduler-lock attribution follow-up adds guest-side total and per-phase global scheduler-lock counters to CAPOS_THREAD_SCALE_GUEST_MEASURE=1. It records acquisitions, contended acquisitions, try-lock failures as spin_loops, contended wait cycles, and hold cycles. Local functional verification in target/thread-scale/lock-attribution-local-20260430T214854Z/ passed all 1/2/4-thread cases with CAPOS_THREAD_SCALE_RUNS=1 and QEMU pinned to local CPUs 0-1; the stricter harness now requires aggregate and per-phase scheduler-lock lines. The local 4-thread final-total counters were 234 acquisitions, 104 contended acquisitions, 2,161,691 spin loops, 1,239,033,542 wait cycles, and 570,372,812 hold cycles; the 4-thread work phase still had 15 acquisitions, 5 contended acquisitions, 95,047 spin loops, 37,181,792 wait cycles, and 32,762,392 hold cycles. This closes the first scheduler-lock counter blind spot; hold cycles include measure acquisition-counter update overhead and exclude release-counter update and unlock overhead, so they are first-pass attribution rather than exact critical-section time. At that point, CR3/TLB, guest-symbol, workload-baseline, logging-suppression A/B, and controlled benchmark-host confirmation requirements in migrated review-finding task records remained open; timer tick count attribution was queued for the follow-up recorded below. Controlled capos-bench reruns after this landed on main at commit 6eff7ae4 used QEMU pinned to logical CPUs 0-3 for physical-core evidence and 0-7 for the informational SMT diagnostic. Physical-core run target/thread-scale/capos-bench-lock-main-physical-20260430T220944Z/ recorded medians thread1 56309194, thread2 56302666, thread4 28301916 (1.000x, 1.990x); SMT diagnostic target/thread-scale/capos-bench-lock-main-smt8-20260430T221246Z/ recorded medians thread1 56379514, thread2 56186566, thread4 28259776, thread8 131264324 (1.003x, 1.995x, 0.430x). A one-run guest-measure confirmation in target/thread-scale/capos-bench-lock-main-measure-20260430T221543Z/ verified scheduler, serial, and scheduler-lock lines on the benchmark host. Host perf profiling was not collected because perf_event_paranoid=4 blocked unprivileged perf on the restarted VM. Timer-attribution follow-up on branch workplan/thread-scale-timer-attribution adds guest-side total and per-phase timer counters to CAPOS_THREAD_SCALE_GUEST_MEASURE=1, distinguishing user-mode timer interrupts entering the scheduler path from kernel-mode timer interrupts that only advance time and EOI, with separate BSP tick-advance counts. The harness now requires aggregate and per-phase timer lines. Local functional verification in target/thread-scale/timer-attribution-local-20260430T223441Z/ passed all 1/2/4-thread cases with CAPOS_THREAD_SCALE_RUNS=1, QEMU pinned to local CPUs 0-1, and guest measurement enabled. Aggregate timer counters were 7/7/0/7, 25/17/8/9, and 132/101/31/23 (interrupts/user_scheduler/kernel_only/bsp_tick_advances); the 4-thread work phase recorded 7/7/0/1. The remaining attribution requirements at that point were CR3/TLB, guest-symbol or guest-PC sampling, workload-baseline, and logging-suppression A/B evidence. CR3/TLB-attribution follow-up on branch workplan/thread-scale-tlb-attribution adds guest-side total and per-phase TLB counters to CAPOS_THREAD_SCALE_GUEST_MEASURE=1, covering runtime CR3 writes, pending-flush checks, pending full TLB flushes, remote shootdown requests, target CPUs, shootdown IPIs, and deferred completion drains. The harness now requires aggregate and per-phase TLB lines. Local functional verification in target/thread-scale/tlb-attribution-local-20260430T225628Z/ passed all 1/2/4-thread cases with CAPOS_THREAD_SCALE_RUNS=1, QEMU pinned to local CPUs 0-1, and guest measurement enabled. Aggregate TLB counters were 3/28/0/0/0/0/0, 7/52/3/3/3/3/2, and 14/139/17/7/17/17/4 (cr3_writes/pending_flush_checks/pending_flush_all/shootdown_requests/shootdown_target_cpus/shootdown_ipis/deferred_completion_drains); the 4-thread work phase recorded 0/10/0/0/0/0/0. The remaining attribution requirements at that point were guest-symbol or guest-PC sampling, workload-baseline evidence, and logging-suppression A/B evidence. Logging-suppression A/B follow-up adds CAPOS_THREAD_SCALE_SUPPRESS_SWITCH_LOGS=1 to make run-thread-scale. The knob suppresses scheduler transition diagnostics in the benchmark kernel while preserving proof, error, and measurement output. Local one-run A/B verification with CAPOS_THREAD_SCALE_GUEST_MEASURE=1, CAPOS_THREAD_SCALE_RUNS=1, and QEMU pinned to local CPUs 0-1 produced artifacts in target/thread-scale/logging-ab-baseline-local-20260430T231800Z/ and target/thread-scale/logging-ab-suppressed-local-20260430T232600Z/. Targeted scheduler diagnostic line counts dropped from 7/12/18 to 0/0/0 for the 1/2/4-thread cases, and aggregate serial bytes dropped from 4161/4743/5889 to 3894/4280/5047. This closes only the logging A/B blind spot; guest-symbol or guest-PC sampling and workload/cacheline baseline evidence remained open. Linux pthread baseline follow-up adds make run-linux-thread-scale-baseline for the exact fixed-size thread-scale checksum workload. Controlled native capos-bench runs at commit 370ce145 with taskset pinned to physical-core logical CPUs 0-3 recorded padded-slot capOS-shaped work-window medians of 306776, 152293, and 1120024 ns for 1/2/4 workers (2.014x, 0.274x). Compact-slot medians were similar at 316388, 152291, and 1123534 ns (2.078x, 0.282x), so result-slot false sharing is not the visible differentiator for the current workload shape. The SMT diagnostic pinned to 0-7 recorded padded work medians 303877, 155565, 170019, and 243481 ns for 1/2/4/8 workers (1.953x, 1.787x, 1.248x). The exact baseline shows the one-megabyte workload and coordinator spin window are not a clean four-core linear-scaling reference. This closes the exact Linux pthread baseline and result-slot padding blind spots only; guest-symbol or guest-PC sampling and larger-workload/Amdahl- sensitivity evidence remain open. Benchmark repair follow-up completed 2026-05-01 14:58 UTC: the default host baselines now use blocking parent join, 262,144 blocks (16 MiB), and work_rounds=64 instead of the old 1 MiB/spinning-parent shape. Controlled Linux evidence on the selected physical CPU set recorded 1-to-2 work/total speedups 1.991x/1.990x and 1-to-4 work/total speedups 3.958x/3.834x, proving the repaired benchmark shape can scale on the host before capOS results are interpreted as scheduler evidence. Guest-PC sampling follow-up adds a measure-only exact-RIP histogram for user-mode timer interrupts while a thread-scale case is active. The harness now requires aggregate and per-phase user_pc_samples lines for CAPOS_THREAD_SCALE_GUEST_MEASURE=1. Local one-run verification in target/thread-scale/guest-pc-sampling-local-20260501T001500Z/ used CAPOS_THREAD_SCALE_RUNS=1 with QEMU pinned to local CPUs 0-1 and passed all 1/2/4-thread cases. Aggregate PC sample counts were 6, 17, and 55 with zero overflow; the 4-thread phase counts were spawn-ready 13, work 9, shutdown 33, and final-total 55. This closes the guest-PC sampling blind spot only; the later symbol-map harness slice preserves a benchmark-only userspace map for interpreting those raw PC buckets, and larger-workload Amdahl-sensitivity evidence remained open until the follow-up below. Resolved PC attribution report follow-up completed 2026-05-01 06:13 UTC on branch workplan/thread-scale-pc-symbol-report: guest-measure case-runs now write user-pc-symbols.log beside measure.log and record it in results.csv under user_pc_symbol_report. Local verification in target/thread-scale/20260501T060822Z/ used CAPOS_THREAD_SCALE_GUEST_MEASURE=1, CAPOS_THREAD_SCALE_RUNS=1, and QEMU pinned to local CPUs 0-1; the thread4 report resolves sampled PCs to worker_entry, run_case, and RingClient::wait nearest symbols and keeps PCs below the first symbol as explicit <unmapped> rows. Larger-workload/Amdahl follow-up adds CAPOS_THREAD_SCALE_TOTAL_BLOCKS and LINUX_THREAD_SCALE_TOTAL_BLOCKS so the same deterministic checksum workload can run beyond the default one-megabyte case. Controlled capos-bench runs at commit 32c066b8 used 1,048,576 blocks (64 MiB). With QEMU pinned to physical-core logical CPUs 0-3, capOS work medians were 112590712, 112511206, and 36369098 cycles for 1/2/4 workers (1.001x, 3.096x), while total medians were 189204910, 218898002, and 205640850 cycles (0.864x, 0.920x). The matching native Linux physical-core baseline recorded work medians 17766664, 8961256, and 7442107 ns (1.983x, 2.387x) and total medians 17883289, 9094596, and 10090354 ns (1.966x, 1.772x). SMT diagnostic rows pinned to 0-7 recorded capOS 1/2/4/8-worker work speedups of 1.002x, 2.870x, and 0.644x and Linux speedups of 1.993x, 2.458x, and 2.658x. Raw artifacts are under target/thread-scale/amdahl-1048576-physical-20260501T003700Z/, target/thread-scale/amdahl-1048576-smt8-20260501T004200Z/, target/linux-thread-scale/amdahl-1048576-physical-20260501T003400Z/, and target/linux-thread-scale/amdahl-1048576-smt8-20260501T004000Z/. This closes the larger-workload evidence blind spot, but the milestone remains open because 1-to-2 work scaling is flat and total-case scaling remains below 1x for 2/4 workers. The guest rows still carry diagnostic accepted=false; host-summary acceptance remains gated by KVM evidence and the configured 1-to-2 median work and opt-in total thresholds. Guest-measure runs now preserve the benchmark-only userspace symbol map needed to interpret raw PC buckets after collection. Post-threshold-policy capos-bench reruns at main commit f198b099 verified the host-summary total-speedup fields while keeping the milestone open. Physical-core pinning 0-3 recorded work speedups 1.002x and 1.002x plus total speedups 0.911x and 0.601x for 2/4 workers in target/thread-scale/total-threshold-main-physical-20260501T065028Z/. SMT diagnostic pinning 0-7 recorded 1/2/4/8 work speedups 1.001x, 0.998x, and 0.333x plus total speedups 0.913x, 0.621x, and 0.200x in target/thread-scale/total-threshold-main-smt8-20260501T065443Z/. Scheduler-lock site attribution follow-up completed 2026-05-01 09:52 UTC: guest-measure kernels keep the existing aggregate measure: scheduler_lock line and add aggregate plus per-phase measure: scheduler_lock_site counters for generic, timer pre-ring, timer select, blocking, process exit, thread exit, start/idle selection, wake/unblock, and metadata classes. The harness requires those lines for CAPOS_THREAD_SCALE_GUEST_MEASURE=1. Local one-run evidence in target/thread-scale/20260501T100202Z/ verified the new lines and still reported accepted=false with 1-to-2/1-to-4 work speedups 0.998x and 1.001x and total speedups 0.921x and 0.509x. This is bounded split-prep attribution for the known global scheduler-lock bottleneck, not speedup evidence; the later caller-aware placement closeout above is the controlled evidence that passed the work and total gates.
  • Record aggregate same-process worker placement for make run-thread-scale and fix creation-time local concentration. Completed 2026-05-01 12:37 UTC: guest-measure output recorded aggregate publish, selected-CPU, first-selected CPU, and migration buckets for CPU slots 0-3. Newly created non-single-owner threads were published to the least-loaded active scheduler CPU slot, while single-owner capability pinning, generation checks, direct-IPC preference, and allocation-free timer/unblock paths were preserved. This aggregate evidence proved the 4-worker first-selected distribution reached all four scheduler CPU slots, but it was not per-worker identity tracking and it was not speedup evidence. (Update 2026-05-02: the publish counters and the caller-aware placement chain were retired with the per-CPU run-queue collapse; make run-thread-scale and the kernel measure printer no longer emit the publish__cpu / publish_caller_* fields. Selected-CPU, first-selected CPU, and migration buckets remain. Per-CPU placement evidence returns with the fair-share enqueue policy that Phase D will own.)
  • If later attribution needs individual worker histories, add per-worker placement output for first scheduled CPU, latest scheduled CPU, migration count, and runnable-owner distribution without replacing the aggregate counters used by the thread-scale harness.
  • Treat same-process speedup as a separate claim from multi-process SMP concurrency. Passing make run-smp-process-scale must not imply this milestone is complete. Completed: same-process speedup was accepted only after make run-thread-scale controlled evidence on the thread-scale harness, separate from the earlier process-scale milestone.
  • Keep the ordinary -smp 2 regression gate repeatable while the thread-scaling implementation evolves. The make run-smp2-smokes target runs the default manifest smoke and the spawn manifest smoke with -smp 2, retaining raw per-target logs under the configured target directory. Closeout evidence passed.

Task Selection

Choose a task that isolates scheduler and CPU parallelism rather than a subsystem bottleneck. Both milestones should use workload shapes with these properties:

  • CPU-bound and deterministic, with no network, disk, terminal, or heap-heavy hot path.
  • Naturally partitionable into independent chunks so workers do not share a lock, mutable buffer, or capability ring while the timed section runs.
  • Verifiable by a compact checksum, count, or known-answer oracle.
  • Long enough to dominate boot, process spawn, timer granularity, and serial logging overhead.
  • Runnable as independent worker processes for the multi-process milestone, and runnable as sibling threads through the per-thread completion-routing model used by the in-process milestone.

Avoid using IPC throughput, capability-ring dispatch, park wake storms, console logging, or allocator stress as the first SMP scaling claim. Those are valid later benchmarks, but they measure shared kernel bottlenecks as much as CPU scheduling. Same-process thread scaling remains a separate milestone because it needs accepted per-thread-ring timing evidence, not only functional sibling execution.

For the in-process milestone, the default workload should be a uniform fixed-size chunk workload such as BLAKE3-style tree hashing, CRC32C over disjoint buffers, or a small native deterministic block-hash loop. The first implementation does not need a cryptographic dependency; it does need fixed-size chunks, per-thread private output slots, and a root checksum that detects missing, duplicated, or reordered chunks. Prime counting remains valid historical evidence for multi-process concurrency, but it is a weaker same-process scaling workload because numeric range cost is not uniform.

Grounding Files

  • docs/proposals/smp-proposal.md
  • docs/proposals/ring-v2-smp-proposal.md
  • docs/architecture/scheduling.md
  • docs/architecture/threading.md
  • docs/research/completion-ring-threading.md
  • docs/research/out-of-kernel-scheduling.md
  • docs/research/sel4.md
  • docs/research/zircon.md
  • docs/research/x2apic-and-virtualization.md

Notes

Initial multi-CPU scheduling may keep the current process ring while the runtime serializes process-ring consumption. Full SMP where sibling threads from one process wait independently on different CPUs should not keep the process-wide CQ as the kernel ABI endpoint. The target transport model is per-thread capability rings: cap_enter(min_complete, timeout_ns) waits on the current thread’s CQ, kernel waiters route completions by generation-checked ThreadRef, and SQPOLL becomes a per-ring mode with one kernel SQ consumer.

SharedParkSpace park-words still need MemoryObject mapping provenance or object pins before shared-key derivation lands.

2026-04-25 11:36 UTC: commit d88bca7 recorded the First AP Scheduler proof. AP cpu=1 can run scheduler-owned user contexts under -smp 2, and a one-way scheduler-owner latch prevents the BSP and AP from both entering scheduler-owned user work while the process-wide ring remains the active transport.

Scheduler Evolution Backlog

This backlog decomposes future scheduler architecture from Scheduler Evolution. It also retains the completed attribution and placement history that closed the In-Process Threading Scalability milestone; new selected-milestone work now continues from docs/tasks/README.md.

Design Grounding Checklist

Before implementation slices, read:

  • docs/architecture/scheduling.md
  • docs/backlog/smp-phase-c.md
  • docs/proposals/smp-proposal.md
  • docs/proposals/ring-v2-smp-proposal.md
  • docs/proposals/tickless-realtime-scheduling-proposal.md
  • docs/proposals/stateful-task-job-graphs-proposal.md
  • docs/proposals/scheduler-evolution-proposal.md
  • docs/proposals/system-performance-benchmarks-proposal.md
  • docs/proposals/hpc-parallel-patterns-proposal.md
  • docs/research/future-scheduler-architecture.md
  • docs/research/nohz-sqpoll-realtime.md
  • docs/research/out-of-kernel-scheduling.md
  • docs/research/completion-ring-threading.md
  • docs/research/hpc-parallel-patterns.md

For realtime or isolation slices, also read:

  • docs/research/multimedia-pipeline-latency.md
  • docs/research/robotics-realtime-control.md
  • docs/research/x2apic-and-virtualization.md

Phase A: Attribution and Guardrails

  • Finish first-pass thread-scale attribution guardrails. Scheduler candidate/outcome, reschedule-IPI, serial-byte, scheduler-lock, timer interrupt, CR3/TLB, raw guest-PC sample, logging-suppression A/B, exact Linux pthread baseline, compact-versus-padded result-slot diagnostic, and larger-workload/Amdahl evidence now exist. The evidence does not identify the primary remaining non-scaling cause; it keeps per-CPU runnable ownership, accepted threshold-passing work/total evidence, and optional symbolic attribution as follow-on work.
  • Add bounded scheduler-lock site attribution before a structural lock split. As of 2026-05-01 09:52 UTC, measure builds keep the compatible aggregate scheduler_lock line and also emit aggregate plus per-phase scheduler_lock_site counters for generic, timer pre-ring, timer select, blocking, process exit, thread exit, start/idle selection, wake/unblock, and metadata classes. This is split-prep attribution only; it does not accept the in-process thread-scale milestone.
  • Add timer-fast-path attribution for the bounded continuation path. As of 2026-05-01 10:58 UTC, measure builds extend the aggregate and per-phase timer counter lines with fast-path attempts, continues, and fallback reasons for slow-required/dirty summaries, skip-budget exhaustion, pending reschedule IPIs, no-current/non-idle CPUs, and inactive/invalid scheduler CPUs. The thread-scale harness requires those fields only for CAPOS_THREAD_SCALE_GUEST_MEASURE=1. This is attribution only; it does not change scheduler behavior and does not close the current accepted=false work or total gates. Local one-run evidence in target/thread-scale/20260501T110157Z/ passed with the new fields present in every 1/2/4-thread measure.log; the timed work phase recorded fast_path_continues=0 for all three rows.
  • Add timer slow-summary reason attribution for dirty fast-path summaries. As of 2026-05-01 11:28 UTC, measure builds emit aggregate and per-phase timer_slow_summary lines with required/clean counts plus reason fields for nonempty run queues, direct IPC targets, handoff-current state, pending process termination/drop/stack release, timer sleeps, and timed cap-enter versus park waiters. The harness requires those lines only for CAPOS_THREAD_SCALE_GUEST_MEASURE=1. Local one-run evidence in target/thread-scale/20260501T112359Z/ passed with the new lines present in every 1/2/4-thread measure.log; the timed work phase reported dirty summaries attributable to run_queue_nonempty and handoff_current only, with required=2/4/8, clean=0, and timer sleeps/timed waiters at zero for the 1/2/4-thread rows. The subsequent fairness-only behavior slice keeps the same fields, but required now means direct IPC, deferred cleanup, timer sleeps, or timed waiter work still force the next locked timer pass.
  • Complete thread-scale shared-kernel-state contention attribution guardrails beyond the first measure-only lock-counter slice. As of 2026-05-01 08:07 UTC, CAPOS_THREAD_SCALE_GUEST_MEASURE=1 emits aggregate and per-phase shared_kernel_lock counters for frame allocator alloc/free locks, ring-dispatch cap-table and ring-scratch locks before cap::ring::process_ring, endpoint inner/cancellation scratch locks, direct per-process address-space locks, and heap allocator locks. As of 2026-05-01 08:29 UTC, fresh thread-scale rows also carry explicit benchmark-class fields and the harness requires, validates, and exports those fields to results.csv; local one-run evidence is retained in target/thread-scale/20260501T083254Z/. As of 2026-05-01 08:49 UTC, guest-measure runs also emit and require aggregate and per-phase network_poll counters for initialized virtio-net scheduler/runtime/interface polling, the built-in TCP HTTP proof poll, virtqueue poll spins and completions, and pending network waiter scans. Local one-run evidence in target/thread-scale/20260501T093505Z/ passed and retained zero aggregate and per-phase network/poll counters for the 1/2/4-thread rows. The default thread-scale manifest has no virtio-net device, and the scheduler poll entry returns before the driver mutex in that no-device case. Those counters are expected zero-evidence for the CPU-bound thread-scale benchmark. They do not prove service throughput; future service/network benchmarks still need their own hot-section attribution and acceptance evidence.
  • Add a benchmark-kernel mode that suppresses per-context-switch logging during measured cases so serial MMIO cannot masquerade as scheduler cost. Completed with CAPOS_THREAD_SCALE_SUPPRESS_SWITCH_LOGS=1; benchmark proof/error output and measure lines remain enabled.
  • Decide which counters are permanent observability and which stay behind measure. Completed 2026-05-01 04:55 UTC in docs/architecture/scheduling.md: all existing kernel/src/measure.rs counters remain benchmark-only behind the measure feature. Permanent scheduler observability should be added later through a separate low-overhead operator snapshot surface after the Phase C runtime accounting ledger exists, starting with runtime, context-switch, preemption, voluntary-block, migration, queue-depth, reschedule-IPI, TLB-shootdown, and policy admission/denial counts. Phase/cycle attribution, scheduler-lock wait/hold cycles, serial byte attribution, timer/TLB benchmark totals, raw user-PC samples, and thread-scale phase checkpoints stay behind CAPOS_THREAD_SCALE_GUEST_MEASURE=1. Grounding read: docs/architecture/scheduling.md, docs/proposals/scheduler-evolution-proposal.md, docs/research/future-scheduler-architecture.md, docs/research/out-of-kernel-scheduling.md, docs/research/nohz-sqpoll-realtime.md, and docs/research/completion-ring-threading.md.
  • Record controlled benchmark-VM evidence before and after each scheduler structure change. Latest follow-up after the first Phase C runtime-accounting slice reran the in-process thread-scale diagnostic at main commit a88e7906 with QEMU pinned to physical-core logical CPUs 0-3 and SMT logical CPUs 0-7. All rows remained accepted=false: physical 1/2/4 work speedups were 1.000x and 0.999x, and SMT 1/2/4/8 work speedups were 1.000x, 1.001x, and 0.333x. Follow-up after the total-speedup host-summary gate landed reran current main commit f198b099 on the benchmark VM with QEMU pinned to 0-3 and 0-7. The harness now reports total-speedup diagnostics explicitly: physical 1/2/4 work speedups were 1.002x and 1.002x, total speedups were 0.911x and 0.601x; SMT diagnostic 1/2/4/8 work speedups were 1.001x, 0.998x, and 0.333x, total speedups were 0.913x, 0.621x, and 0.200x. Both host-summary gates remain unsatisfied.

Phase B: Per-CPU Runnable Ownership

  • Land the first bounded per-CPU runnable queue slice. Commit 1a8bf909 replaces the single global scheduler VecDeque with four per-scheduler-CPU FIFO queues under the existing global scheduler lock, centralizes enqueue/requeue/removal helpers, keeps single-owner capability processes on CPU0, prefers local work before bounded stealing, preserves direct IPC preference, and removes stale runnable entries for process/thread exit. Review fixes track live run-queue reservations, reserve all per-CPU queues to that count before publishing a new runnable thread, and release reservations on process/thread exit or pre-publication rollback, keeping timer and unblock requeue paths allocation-free after cross-CPU steals. Verification covered run-spawn, run-smp2-smokes, and controlled benchmark-VM 1/2/4/8-thread diagnostics. The default workload and total-case 64 MiB rows remain accepted=false, so this is structure evidence, not milestone closeout.

  • Finish PerCpuRunQueue ownership invariants as a documented contract. Completed 2026-05-01 02:13 UTC in docs/architecture/scheduling.md: a live generation-checked ThreadRef has at most one runnable dispatch owner across current slots, per-CPU run queues, and the direct IPC target; migration is a scheduler-lock-contained remove-before-publish transfer; local-first stealing is bounded by the scheduler CPU slots; and live run-queue reservations keep timer, unblock, direct-IPC fallback, steal retry, and steal requeue paths allocation-free.

  • Split current-thread and runnable ownership from shared process/thread metadata without widening emergency-path allocation. Completed 2026-05-01 04:22 UTC in commit d7221648: Scheduler::processes remains the shared process/thread metadata table, while SchedulerDispatch now owns per-CPU run queues, current and handoff slots, idle slots, the direct IPC target, run-queue reservation count, pending process drops, and pending thread stack releases. The existing global scheduler lock and generation checks are unchanged, and the dispatch split keeps the pre-reserved run-queue capacity model for timer, unblock, direct-IPC fallback, steal retry, and steal requeue paths. Verification passed make fmt-check, cargo build --features qemu, a cached make run-spawn rerun, and make run-smp2-smokes in target/smp2-smokes/20260501T042343Z/. Controlled benchmark-VM timing after merge 56458b12 stayed accepted=false:

    | Pinning | Workers | Work Median | Total Median | Work Speedup | Total Speedup |
    | --- | ---: | ---: | ---: | ---: | ---: |
    | physical `0-3` | 1 | `56275842` | `140953762` | `1.000x` | `1.000x` |
    | physical `0-3` | 2 | `56290542` | `153327094` | `1.000x` | `0.919x` |
    | physical `0-3` | 4 | `56315094` | `237018874` | `0.999x` | `0.595x` |
    | SMT `0-7` | 1 | `56258010` | `140620194` | `1.000x` | `1.000x` |
    | SMT `0-7` | 2 | `56313324` | `153367860` | `0.999x` | `0.917x` |
    | SMT `0-7` | 4 | `56352472` | `237971426` | `0.998x` | `0.591x` |
    | SMT `0-7` | 8 | `169006414` | `727393630` | `0.333x` | `0.193x` |
    
  • Add a bounded timer continuation fast path before a broader scheduler lock split. Completed 2026-05-01 10:29 UTC: user-mode LAPIC timer ticks can continue the current non-idle thread without calling sched::schedule() only when a previous locked timer slow path published a clean hard-work summary, the current CPU is a valid active scheduler slot, no reschedule IPI is pending for that CPU, and the per-CPU one-skip budget is not exhausted. Dirty producers still force at least one locked pass before bypass, but the 2026-05-01 11:40 UTC follow-up lets that pass classify remaining nonempty run queues and handoff-current markers as fairness/protection-only state. Direct IPC targets, deferred termination/drop/stack cleanup, Timer sleeps, and timed cap-enter/Park waiters still keep the hard slow-path bit set; ordinary ring SQEs and indefinite cap wait scans are still serviced by forced slow-path ticks. This is a correctness-first split-prep slice, not a replacement for narrower scheduler metadata locks or accepted thread-scale evidence. Controlled benchmark-VM physical-core 0-3 before/after runs for the initial strict-clean version retained accepted=false: baseline target/thread-scale/timer-fastpath-baseline-main-physical-20260501T102938/ recorded work speedups 0.998x and 0.998x plus total speedups 0.907x and 0.620x; after-change target/thread-scale/timer-fastpath-after-physical-20260501T104700/ recorded work speedups 1.001x and 0.999x plus total speedups 0.909x and 0.602x. Controlled benchmark-VM physical-core 0-3 before/after runs for the fairness-only follow-up stayed accepted=false: baseline target/thread-scale/20260501T120224Z/ recorded work speedups 1.001x and 0.999x plus total speedups 0.913x and 0.587x; after-change target/thread-scale/20260501T120709Z/ recorded work speedups 1.001x and 1.000x plus total speedups 1.125x and 0.828x.

  • Add cross-CPU wake policy for endpoint, timer, park, process wait, and thread join completions. Completed 2026-05-01 03:06 UTC: queued wakeups now target the selected per-scheduler-CPU FIFO owner instead of scanning all idle scheduler CPUs.

  • Add explicit placement evidence and placement policy for newly runnable same-process worker threads. Completed 2026-05-01 12:37 UTC, refined 2026-05-01 13:20 UTC, and repaired 2026-05-01 14:58 UTC after the blocking-parent benchmark exposed a placement regression. Measure builds emit aggregate and per-phase thread_placement lines with single-owner publish buckets, normal publish buckets, caller-current publish buckets, caller-aware avoid, fallback, and strict-load fallback counts, selected CPU buckets, first-selected CPU buckets, and migration totals/targets for CPU slots 0-3. publish_created_thread() receives the caller thread from ThreadSpawner.create, keeps single-owner processes on CPU0, and avoids the caller’s current CPU only when another active ready scheduler CPU has a strictly lower non-idle dispatch load. On equal load, an active-ready caller CPU wins the tie instead of falling through to CPU0-biased least-loaded scanning; if the caller slot is unknown or ineligible, publication falls back to the least-loaded active scheduler CPU behavior. Timer, unblock, direct-IPC fallback, steal retry, and steal requeue paths keep their existing allocation-free targeting behavior.

    The earlier avoid-caller policy passed the old spinning-parent 1-to-2
    gate but failed the repaired blocking-parent shape: before the strict-load
    fix, controlled capOS evidence regressed to 1-to-2 work/total speedups
    `0.886x`/`0.928x` because children were biased onto the non-caller queue
    even when the caller CPU had equal load. The repaired benchmark shape uses
    blocking parent join, 262,144 blocks (16 MiB), and `work_rounds=64`. The
    matching Linux baseline scales on the selected physical CPU set with
    1-to-4 work/total speedups `3.958x`/`3.834x`. Controlled capOS evidence
    on the same CPU set passed the enforced 1-to-2 work/total gates with
    `1.828x`/`1.687x`; the unsuppressed 1-to-4 diagnostic recorded
    `3.029x`/`2.386x`, and scheduler-switch-log-suppressed diagnostics
    recorded `3.272x`/`2.303x`. Remaining four-worker limits are now
    scheduler implementation issues, not benchmark-shape excuses: serial
    switch logging, global `Scheduler` lock contention, total-time
    exit/join/block/schedule overhead, and the temporary four-owner CPU mask.
    
  • Add bounded reschedule IPI behavior for idle-to-runnable transitions. Completed 2026-05-01 03:06 UTC: queued wakeups target at most one queue-owner CPU, direct IPC targets at most one eligible idle scheduler CPU, and measure builds emit wake scan, eligible idle CPU, target, sent, pending-skip, not-ready-skip, missing-target, and failure counters.

  • Preserve direct IPC handoff as a scheduling preference without bypassing per-CPU ownership or generation checks. Completed 2026-05-01 03:06 UTC: direct IPC still uses the single preference slot when available and falls back to the normal queued owner path when the target cannot run directly.

  • Prove process/thread exit cleanup cannot leave a stale runnable entry on any CPU queue. Completed 2026-05-01 03:14 UTC: process termination, current-process exit, and ThreadControl.exitThread cleanup now assert under the scheduler lock that the exiting process or thread no longer appears in any per-scheduler-CPU FIFO or in the direct IPC target slot. The focused spawn smoke asserts the serial proof markers emitted by the exercised process/thread exit paths.

  • Rerun make run-thread-scale, make run-smp2-smokes, ordinary smoke, spawn/thread, park, ring, and process-exit focused proofs. Completed 2026-05-01 04:18 UTC: local serial reruns passed normal make run-thread-scale in target/thread-scale/scheduler-phaseb-rerun-local-normal-20260501T034800Z/ and make run-smp2-smokes in target/smp2-smokes/20260501T034414Z/. Controlled benchmark-VM reruns at main commit 87be6e25 pinned QEMU to physical-core logical CPUs 0-3 and SMT logical CPUs 0-7; all rows remained accepted=false, so this closes the Phase B rerun-evidence gate but not the selected in-process speedup milestone.

Phase C: CPU Accounting

  • Add monotonic runtime charge points when a running thread leaves the CPU at context switch, preemption, blocking syscall, direct IPC handoff, and thread exit. Completed 2026-05-01 05:08 UTC: running intervals are charged with crate::arch::context::monotonic_ns() when a current thread stops running through timer preemption, blocking cap_enter/ParkSpace, thread/process exit, and direct switch or handoff paths that select the next current thread.
  • Observe blocked runtime stability at unblock without charging non-running time. Completed 2026-05-01 05:08 UTC: unblock paths check the blocked runtime snapshot before making the thread ready.
  • Track per-thread runtime, virtual runtime seed, context switches, preemptions, voluntary blocks, and migrations. Completed 2026-05-01 05:08 UTC: ThreadCpuAccounting is stored on each Thread record and updated under the scheduler/process lock. Context switch counters increment when a thread is selected, preemptions increment only for timer-driven running-to-ready requeue, voluntary blocks increment for blocking cap_enter and ParkSpace waits, and migrations increment when a thread runs on a different scheduler CPU than its previous run.
  • Add process/session/service aggregation only after the per-thread record has a single ledger of record. Completed 2026-05-22 13:50 UTC: a per-Process ProcessCpuAccounting ledger sums runtime_ns and a process-level context_switches dispatch count incrementally at the same scheduler/process-lock charge points that update ThreadCpuAccounting, so it captures exited threads’ contributions. Only the always-present (non-measure) per-thread quantities are rolled up; the measure-gated preemptions/voluntary_blocks/migrations counters are intentionally not aggregated so the default-build proof stays meaningful. The kernel emits a sched: process_cpu_accounting pid=... runtime_ns=... context_switches=... line at per-process exit and make run-spawn asserts a nonzero aggregate. Session/service aggregation remains a stretch follow-on.
  • Add tests or QEMU diagnostics proving runtime increases while running and stops while blocked. Completed 2026-05-01 05:08 UTC: make run-spawn now asserts a compact scheduler proof line that requires nonzero runtime, context switches, preemptions, and voluntary blocks, plus stable blocked and exited runtime observations.
  • Keep runtime accounting independent of tickless idle by using the monotonic clocksource layer. Completed 2026-05-01 05:08 UTC: normal accounting uses monotonic_ns() and does not read kernel/src/measure.rs cycle counters.

Phase D: Best-Effort Fair Scheduling

Phase D accepted its Task 6 diagnostic closeout at commit 77caafc0 (2026-05-10 19:39 UTC, docs(scheduler): record phase d thread-scale gate) and closed in docs commit 1a08ec23 (2026-05-10 21:47 UTC, docs(scheduler): close phase d). The first Phase D policy is weighted fair queueing on top of the existing per-thread runtime_ns / virtual_runtime_ns accounting, with a capability-authorized SchedulingPolicyCap for weight and latency-class mutation. The controlled Task 6 benchmark pair passed the harness-enforced 1-to-2 work/total gates; capOS recorded 1-to-4 work/total diagnostics 3.088x / 2.700x at 4 workers versus the prior single-global-queue baseline 1.566x / 1.538x, and that 1-to-4 row was manually accepted for Phase D closeout. The matching Linux pthread baseline on the same host and physical-core logical CPUs 0,1,2,3 recorded 3.974x / 3.850x. EEVDF is now a follow-on policy evaluation, not a Phase D blocker. The design content is in docs/proposals/scheduler-evolution-proposal.md “Phase D first-policy decision”, “Phase D capability surface”, “Phase D migration fairness sketch”, “Phase D test matrix”, and “Phase D overload behavior” sections. The completed implementation plan is archived at docs/backlog/scheduler-evolution.md.

The bullets below retain the closed acceptance gates and the Phase D follow-ons that should be selected explicitly. Phase E SchedulingContext is the next scheduler authority phase, followed by Phase F auto-nohz / SQPOLL / tickless idle; generic full-nohz remains deferred behind those prerequisites.

  • Choose initial weighted-fair or EEVDF-like policy based on accounting and queue data. Resolved 2026-05-05 19:00 UTC: WFQ first; EEVDF deferred. See docs/proposals/scheduler-evolution-proposal.md “Phase D first-policy decision”.
  • Add scheduler entity weights and latency class metadata through a capability-authorized policy path, not ambient process fields. Closed by docs/backlog/scheduler-evolution.md Tasks 1-2: SchedulingPolicyCap schema + kernel cap, per-thread weight/latency_class fields, weighted vruntime, and caller-thread cap binding.
  • Preserve fairness across CPU migration. Implementation tracked in docs/backlog/scheduler-evolution.md Task 4 (vruntime travels with the thread, virtual_finish_ns recomputed at destination enqueue, bounded steal targets the queue whose head has the lowest virtual_finish_ns, matching the local pick rule of taking the front of the ascending per-CPU queue). Closed 2026-05-08 00:53 UTC: invariants made explicit on refresh_virtual_finish_ns_locked and at the steal-insert site; the cfg(feature = "measure")-gated ThreadCpuAccounting.migrations counter moved from the dispatch-time scheduled_measure path to enqueue-time record_placement_spread_migration_locked and record_steal_migration_locked arms; weight-change-while- enqueued contract proved by construction with a debug_assert! reinforcement in Process::refresh_thread_virtual_finish_ns.
  • Test CPU hogs, short sleepers, direct IPC server/client pairs, multi-process load, and same-process sibling load. Implementation tracked in docs/backlog/scheduler-evolution.md Task 5 (test matrix smokes) and Task 6 (the controlled make run-thread-scale evidence pair: harness-enforced 1-to-2 gates plus a manually accepted 1-to-4 diagnostic closeout row). Closed 2026-05-10 19:46 UTC: the benchmark-VM Task 6 run at commit 76025f0963a4 recorded capOS 1-to-4 work/total diagnostics 3.088x / 2.700x; the 1-to-2 gate stayed green at 1.809x / 1.774x. The matching Linux pthread baseline on the same physical-core logical CPUs 0,1,2,3 recorded 3.974x / 3.850x.
  • Define overload behavior when runnable entities exceed the selected CPU set or when migration cannot keep up. Resolved at the design level 2026-05-05 19:00 UTC: soft overload uses vruntime ordering (no entity is starved); hard overload defers to Phase F CpuIsolationLease and Phase G RealtimeIsland. See docs/proposals/scheduler-evolution-proposal.md “Phase D overload behavior”.
  • Phase D follow-on: EEVDF migration. Once the WFQ slice has accepted thread-scale evidence, evaluate replacing the bucketed per-CPU VecDeque with an EEVDF eligibility set (BTreeMap-by-virtual-deadline) plus per-thread request size and lag accounting. The accounting fields, capability surface, and migration contract carry directly; the change is localized to the dispatch ordering structure. Promote to its own design slice if and when selected; do not bundle it into the WFQ first-slice plan.

Phase E: SchedulingContext Capability

Phase E policy follow-ups are closed. Local owner-shell logout propagation is recorded in scheduler-phase-e-local-owner-shell-logout-propagation. Endpoint donation/return, timeout/depletion notifications, and the scheduler-observable session lifecycle hook are recorded on main: scheduler-phase-e-endpoint-donation, scheduler-phase-e-timeout-depletion-notifications, and scheduler-session-lifecycle-hook. The donated-context logout policy is also closed as a conservative counted/skipped return-path proof: scheduler-phase-e-session-logout-donated-context-policy. Timeout/depletion notifications now use fixed per-context notification cells allocated at context creation/bootstrap. The ordinary non-donated session-logout stale-context proof is complete through the UserSession.logout() hook. In-flight endpoint donation uses the conservative counted/skipped policy during logout and relies on endpoint RETURN/cancel to finish the in-flight transfer/clear without returning donor budget early. Local owner-shell exit now calls the same UserSession.logout() path on clean REPL exit or terminal-close completion; the shell proof observes the scheduler hook with no bound local shell SchedulingContext, while the focused session-context proof remains the ordinary bound-context stale evidence.

  • Phase E preflight: retire the transitional CAPOS_SCHED_DISABLE_WFQ=1 / WakePolicy::QueueAny single-global-queue fallback that Phase D kept for one bisect cycle. This is a scheduler-surface cleanup before SchedulingContext claims budget/period authority; do not treat it as an EEVDF blocker. Completed 2026-05-10 22:20 UTC: the source-level opt-out, queue-0 enqueue funnel, and QueueAny wake policy are gone.
  • Define the first SchedulingContext object shape. Phase E Task 1 adds the minimal schema/control-plane cap shape: SchedulingContextSpec carries budget, period, relative deadline, byte-oriented CPU mask, and overrun policy; SchedulingContextInfo is a read-only snapshot with remainingBudgetNs as derived info-only state; and the kernel/runtime expose an info-only SchedulingContext.info() cap stub for focused grant/discovery and client decode coverage. The cpuMask field is a canonical little-endian bitset: CPU n is bit n % 8 of byte n / 8, empty means no CPUs selected, producers omit trailing zero bytes, and non-empty canonical masks end in a nonzero byte. Dispatcher budget enforcement, replenishment, bind/revoke rules, donation/return, depletion notifications, realtime islands, SQPOLL, and nohz remain deferred.
  • Add capability creation/bind/revoke rules and generation identity. The second Phase E control-plane slice keeps info() method id 0 stable, adds same-interface context creation as a bounded result-cap transfer, records at most one caller-thread binding per context generation, and revokes by advancing the context generation and clearing the matching thread metadata binding. Bootstrap grants and created contexts use the same non-wrapping context-id allocator so distinct caps cannot alias the (contextId, generation) binding key. The focused make run-scheduling-context QEMU smoke proves distinct bootstrap identities, create result-cap adoption, bind/revoke, stale-generation calls, release cleanup, and the explicit infoOnlyNoDispatchChange dispatch-effect marker. Stale caps report staleGeneration and cannot mutate scheduler metadata; revoked contexts report revoked. Dispatch selection, WFQ ordering, runtime charging, replenishment, donation/return, timeout/depletion notification, realtime islands, SQPOLL, auto-nohz, and CPU placement enforcement remain future work.
  • Enforce budget and replenishment in the kernel dispatcher. First Phase E budget enforcement landed 2026-05-11 08:38 UTC: bindCallerThread() now installs a fixed per-thread budget ledger under the scheduler/process locking model, runtime charge decrements the bound context budget at the existing dispatch charge points, runnable selection replenishes elapsed periods without allocation, and exhausted contexts stay queued but RetryLater until their next period. Deadline-driven accounting closed the previous periodic-tick granularity caveat on 2026-06-04: the ordinary dispatch path arms a sub-tick budget-exhaustion one-shot when the selected thread’s remaining budget would deplete before the next scheduler tick, kernel-mode one-shot fires restore a live periodic timer, nohz re-arm folds the leased thread’s budget deadline into its existing nearest deadline, and nohz budget depletion restores the periodic tick with reason=scheduling-context-budget-throttled. make run-scheduling-context proves visible charge, replenishment to full budget, stale/revoked fail-closed behavior, and a throttled wall-clock window with dispatch_effect=budgetEnforced; the representative 5 ms deadline marker recorded elapsed_since_arm_ns=5474819, overshoot_ns=474819, remaining_after_ns=0, and bounded_charge=true. At that slice’s landing, donation/return, depletion notifications, realtime islands, SQPOLL, auto-nohz, and CPU placement enforcement remained future work.
  • Add endpoint donation/return semantics for synchronous calls and passive services. Completed 2026-05-11 10:51 UTC: endpoint in-flight call state now carries a bounded internal donation token when a caller with a bound SchedulingContext delivers a synchronous CALL to a receiver thread without its own context. The scheduler charges pre-donation caller runtime before moving the ledger, charges passive-server runtime before returning the ledger, and returns the remaining budget to the caller before waking it when RETURN commits, commits an application exception, or fails with an invalid caller result buffer. RETURN preflight failures keep the in-flight donation intact; delivery/return cancellation paths return or clear the donation without allocating. A donor with an in-flight token is blocked from returning to userspace until the endpoint call returns or is canceled. Nested donation of an already donated context is rejected until stacked return tokens have a dedicated design. The focused make run-scheduling-context smoke now includes a same-process endpoint round trip with endpoint_donation=ok, endpoint_return=ok, endpoint_exception_return=ok, endpoint_invalid_return=ok, and endpoint_nested_rejected=ok, plus an endpoint_donor_block=ok delayed-server cap_enter(0, 0) proof, an endpoint_donor_fast=ok fast-return race proof, and remaining-budget fields for successful RETURN, application-exception RETURN, invalid-result RETURN, nested-donation rejection, donor blocking, and fast donor return. This is synchronous endpoint donation/return only; depletion notifications, realtime islands, SQPOLL, auto-nohz, CPU placement enforcement, and session-logout stale-context coverage remain future work.
  • Add a scheduler-observable session lifecycle hook from UserSession.logout() into scheduler-owned SchedulingContext stale-marking. The hook covers explicit logout plus the remote DTO gateway logout/connection-teardown paths that already call UserSession.logout(): after the liveness cell flips to logged out, the scheduler scans process/thread metadata for the same session liveness cell, removes non-donated matching bindings from its ledger, and advances the bound context generation as revoked so ordinary old grants become stale. The hook preserves the scheduler as the binding authority and avoids scheduler-lock to context-record-lock inversion by taking one binding under the scheduler lock, dropping that lock, and then marking the context stale through its cleanup token. In-flight endpoint donation bindings are explicitly skipped because returning donor budget before endpoint cancellation would violate the donor-blocking invariant. This hook unblocks focused stale-context proofs: ordinary non-donated logout, donated-context policy, and local owner-shell propagation are now closed by their dedicated task records.
  • Add timeout/depletion notifications with preallocated emergency-path storage. Completed in the timeout/depletion notification slice: every SchedulingContext owns a fixed notification cell allocated at context creation/bootstrap, with coalescing slots for budget depletion and deadline/timeout, sequence counters, bounded coalesced-event counts, holder identity, donated-holder marking, remaining budget, and next timestamp snapshots. Scheduler charging, timeout/deadline observation, donation-return, and cancellation paths update only that fixed state; they do not allocate, publish result caps, append unbounded queues, or require hard-path logging. SchedulingContext.drainNotifications() exposes typed ok, revoked, and staleGeneration observer results, plus explicitRevoke lifecycle state. The focused make run-scheduling-context smoke proves repeated budget-depletion coalescing, deadline notification, explicit revoke, stale observer labels, and endpoint-donated notification accounting. A pre-armed observer waiter/wakeup path remains a separate follow-up.
  • Extend stale-context proofs beyond the first revoke/generation contract to process and thread exit. The focused SchedulingContext smoke now proves that a context bound by an exiting thread becomes unbound without minting fresh budget on rebind, while process-exit and explicit process-termination children bind contexts and run the process cleanup path before cap-table release.
  • Extend stale-context proofs to session logout. Completed for ordinary non-donated contexts at 2026-05-11 17:44 UTC. This remains separate from process/thread exit because logout propagation is owned by the session lifecycle surface, not the scheduler dispatch loop. The focused session-context smoke now binds a SchedulingContext in a session-owned child, calls UserSession.logout(), observes the scheduler hook line, and proves the old cap is stale before budget refresh, caller-thread rebind, result-cap publication, or metadata mutation. Process/thread exit cleanup remains covered by make run-scheduling-context.
  • Prove donated receiver logout policy. Completed at 2026-05-11 18:19 UTC. Logout keeps the existing conservative counted/skipped behavior for receiver threads holding endpoint-donated SchedulingContext bindings. The focused session-context smoke has a donor call a guest-session receiver, the receiver logs out while holding the donated binding, the scheduler hook reports stale_marked=0 donation_inflight_skipped=1, the donor remains blocked in cap_enter(0, 0) until endpoint RETURN, and the donor context returns bound with reduced remaining budget rather than a refreshed or minted budget. Local owner-shell lifecycle propagation was closed separately by scheduler-phase-e-local-owner-shell-logout-propagation.
  • Propagate local owner-shell exit to session logout. Completed at 2026-05-11 19:36 UTC. Clean local REPL exit and terminal-close completion now call the held UserSession.logout() before process exit, so the session liveness cell is marked logged out through the same kernel hook used by explicit logout and the remote DTO gateway. The shell smoke asserts the scheduler-observable hook line with stale_marked=0 donation_inflight_skipped=0; ordinary bound SchedulingContext stale behavior remains proven by the focused session-context smoke through the same hook. Process/thread-exit cleanup remains separate and unchanged.

Phase F: CPU Isolation Lease and SQPOLL

The Phase E gates and the first Ring/SQPOLL ownership prerequisite are now closed. Dispatch through scheduler-phase-f-auto-nohz-sqpoll only through its own Phase F authority, telemetry, rollback, and nohz/SQPOLL tasks; this backlog entry does not implement Phase F behavior. The concrete ring prerequisite is scheduler-phase-f-one-sq-consumer-ring-ownership, closed on 2026-05-11: ring endpoints now have generation-checked syscall-mode SQ-consumer leases, duplicate future SQPOLL acquisition is rejected while that owner is live, stale owner generations cannot advance SQ head, teardown releases the owner without clearing accepted completions, and bounded SQPOLL admission metadata exists without starting a poller. The first executable Phase F child task, scheduler-phase-f-cpu-isolation-lease-scaffold, closed on 2026-05-12 12:02 UTC. It is limited to CpuIsolationLease authority, activation preflight telemetry, and rollback scaffolding. It does not enable SQPOLL, automatic nohz, tick suppression, automatic CPU isolation, or generic full-nohz behavior. The second executable child task, scheduler-phase-f-nohz-activation-telemetry, closed on 2026-05-12 14:18 UTC. It turns the disabled preflight into observable activation/deactivation and rollback decisions while still leaving tick suppression, SQPOLL, automatic CPU isolation, and generic full-nohz disabled. The housekeeping/deferred-work placement child closed on 2026-05-12 18:36 UTC by scheduler-phase-f-housekeeping-deferred-work-placement: the scheduler now records an explicit online housekeeping CPU placement input, selected housekeeping mask, deferred cleanup/timer/network/IRQ/accounting placement or rejection labels, and bounded revoke, process-exit, service-replacement, and session-logout cleanup placement while ticks remain periodic. The bounded SQPOLL ring-mode child closed on 2026-05-12 20:29 UTC by scheduler-phase-f-sqpoll-ring-mode-bounded-poller: ring endpoints now transition explicitly through syscall, SQPOLL starting, running, sleeping, stopping, and rollback modes; a kernelSqpoll CpuIsolationLease admits one bounded periodic-tick poller for the caller thread’s ring; producer wakeups use NEED_WAKEUP; stale SQ owners fail before SQ-head consumption; and poller stop/revoke preserves accepted CQEs while releasing SQ ownership. Actual tick suppression is blocked until the SQPOLL progress path no longer depends on periodic scheduler ticks. The clockevent/deadline substrate child closed on 2026-05-12 23:07 UTC by scheduler-phase-f-clockevent-deadline-substrate: normal QEMU/x86_64 monotonic_ns() is backed by the calibrated TSC rather than TICK_COUNT, the periodic LAPIC tick disciplines the TSC epoch while nohz is disabled, Timer.sleep, finite cap_enter, and park waiters store absolute monotonic deadlines, and the LAPIC clockevent backend can program a bounded one-shot deadline and restore periodic mode. The substrate’s firing precision is now proven, not only its programming: the scheduler-lapic-oneshot-subtick-firing-precision child (closed 2026-06-04 03:26 UTC, commit 49b36129) arms a TICK_NS/2 one-shot over the live periodic timer during boot and measures the actual countdown-to-fire instant, asserting via make run-scheduling-context that it fires sub-tick (~5 ms for a 5 ms request, well under the 10 ms tick) with the current-count correctly reset to the sub-tick value – ruling out the suspected “INITIAL_COUNT write does not reset the running countdown” root cause – and that the kernel-mode-fire periodic restore leaves a live timer (no lost-timer hang). Automatic nohz, tick suppression, SQPOLL nohz, generic full-nohz, and production realtime admission remain disabled. Known pre-existing gate flake (independent of the firing-precision proof, which passed in 100% of measured boots): the scheduling-context-smoke budget-timing proof exited early in ~20% of boots on both main and this branch under host load – its wall-clock budget-throttle assertions are sensitive to host scheduling jitter. Run make run-scheduling-context on an otherwise-idle host until the budget proof is stabilized (own follow-up); it is orthogonal to the clockevent firing assertions. A second substrate prerequisite surfaced 2026-06-04 from scheduler-deadline-driven-budget-accounting’s Attempt 2: even with the LAPIC one-shot firing precisely sub-tick, the monotonic clocksource discipline floored a sub-tick interval to a full tick. A boot probe measured a real 5.0 ms interval advancing monotonic_ns by 10.0 ms after one discipline_clocksource_tick step (monotonic_delta_ns=10000020 for real_ns=5000118, floored=true), because discipline_clocksource_tick took max(tsc_interpolated, epoch + TICK_NS) on every fire. That was the real cause of that task’s Attempt 1 “9.85 ms” – not the LAPIC firing (fixed) and not the ordinary-path timer-ISR rechecks (which provably no-op when no nohz/idle window is active). The prerequisite scheduler-monotonic-clocksource-subtick-discipline closed it (2026-06-04): discipline_clocksource_tick now trusts the TSC interpolation at sub-tick granularity, falling back to the TICK_NS floor only when the interpolated advance is below MIN_DISCIPLINED_ADVANCE_NS (TICK_NS / 8) so a degenerate (stalled/backward/mis-calibrated-slow) TSC still keeps a minimum forward rate; the tick-derived fallback is unchanged. A boot proof (context::qemu_clocksource_subtick_discipline_proof, emitted on make run-scheduling-context) runs one real TICK_NS / 2 discipline step and asserts monotonic_ns() tracked the sub-tick delta – measured monotonic_delta_ns=5055612 for real_ns=5000474 (floored=false, subtick_tracked=true). Deadline-driven budget accounting and generic full-nohz can now observe a sub-tick deadline through the accounting clock. The SQPOLL nohz-progress child closed on 2026-05-13 00:06 UTC by scheduler-phase-f-sqpoll-nohz-progress: cap_enter now has a bounded current-thread SQPOLL service entry for producer wakes and syscall kicks that borrows the SQPOLL owner lease, charges the admitted accounting target, and reports non-periodic progress evidence while ordinary periodic service remains active. Automatic policy-service nohz issuance and production realtime admission remain future work; generic SQPOLL nohz for explicitly leased caller-thread rings landed in the later Step 14 slice. The tickless-idle child closed on 2026-05-23 09:12 UTC by scheduler-tickless-idle-step6: the CPL0 idle loop now admits an idle-only tickless window when no non-idle work is runnable, no nohz lease is active, no local deferred cleanup is pending, no cap-enter polling dependency is present, and the LAPIC one-shot clockevent plus monotonic clocksource are available. The periodic tick is restored before non-idle dispatch and on rollback. Legacy cap-enter polling surfaces, including the terminal shell path, remain periodic until they gain explicit deadline or housekeeping placement.

  • Define CpuIsolationLease authority separately from CPU-time budget. Completed 2026-05-12 12:02 UTC by docs/tasks/done/2026/scheduler-phase-f-cpu-isolation-lease-scaffold.md.
  • Add scheduler activation proof for housekeeping, deferred cleanup, timers, networking, IRQ affinity, live accounting target, one-SQ-consumer state, and revocation latency. The scaffold reports blocked eligibility and leaves ticks/nohz/SQPOLL disabled.
  • Enforce one live SQ consumer per ring before SQPOLL. Completed 2026-05-11 by docs/tasks/done/2026/scheduler-phase-f-one-sq-consumer-ring-ownership.md.
  • Integrate SQPOLL ring mode only after this ownership prerequisite and docs/tasks/done/2026/scheduler-phase-f-housekeeping-deferred-work-placement.md have landed. Completed 2026-05-12 20:29 UTC by docs/tasks/done/2026/scheduler-phase-f-sqpoll-ring-mode-bounded-poller.md.
  • Add lease revocation on explicit revoke, process exit, service replacement, and session close. Completed by the focused make run-scheduler-cpu-isolation-lease proof.
  • Add nohz activation/deactivation telemetry. Completed 2026-05-12 14:18 UTC by docs/tasks/done/2026/scheduler-phase-f-nohz-activation-telemetry.md. The proof records active-candidate rejection, stale/revoked rollback, ready housekeeping CPUs under -smp 4, exactly-one-runnable target CPU evidence, deferred cleanup/timer/network/IRQ labels, valid accounting targets, explicit clocksource/accounting readiness or refusal, live syscall SQ-consumer state, revocation-latency policy, and disabled tick/SQPOLL/full-nohz guardrails.
  • Assign housekeeping and deferred-work placement before behavior. Completed 2026-05-12 18:36 UTC by docs/tasks/done/2026/scheduler-phase-f-housekeeping-deferred-work-placement.md. The proof keeps periodic ticks, SQPOLL, automatic CPU isolation, and generic full-nohz disabled.
  • Add bounded SQPOLL ring mode only after housekeeping/deferred-work placement. Completed 2026-05-12 20:29 UTC by docs/tasks/done/2026/scheduler-phase-f-sqpoll-ring-mode-bounded-poller.md. The proof covers one poller owner, bounded polling, stale queue-owner rejection, wake/sleep ordering, and teardown without losing completions while periodic ticks remain active.
  • Add clockevent/deadline substrate before automatic nohz activation. Completed 2026-05-12 23:07 UTC by docs/tasks/done/2026/scheduler-phase-f-clockevent-deadline-substrate.md. It split clocksource reads from clockevent programming, added a one-shot/restore timer backend, and converted tick-count waiters to absolute monotonic deadlines while ordinary scheduling remains periodic.
  • Add SQPOLL nohz progress that does not depend on periodic scheduler ticks. Completed 2026-05-13 00:06 UTC by docs/tasks/done/2026/scheduler-phase-f-sqpoll-nohz-progress.md. The proof preserves the one-SQ-consumer, NEED_WAKEUP, bounded polling, stale-owner rollback, and teardown/completion invariants while keeping periodic fallback service active.
  • Add automatic nohz activation only after placement, bounded SQPOLL behavior, the deadline substrate, and non-periodic SQPOLL progress. Completed 2026-05-14 09:01 UTC by docs/tasks/done/2026/scheduler-phase-f-auto-nohz-activation.md. The CpuIsolationLease activation preflight now performs real per-CPU periodic-tick suppression for the narrow single-runnable-entity window (namedRing = none compute lease on the preflight CPU): it masks the periodic LAPIC tick and arms a bounded one-shot deadline at min(nearest pending timer wakeup, now + max revocation latency). Network polling and IRQ affinity stay read-only fail-closed admission gates – any ring-coupled or device-owning mode keeps the conservative refusal. Every disqualifying change (stale lease generation, a second runnable entity, stealable sibling work, a local deferred-cleanup dependency, a target-CPU mismatch, or a one-shot backend that can no longer arm a deadline) rolls the CPU back to the periodic tick first. The make run-scheduler-cpu-isolation-lease proof asserts the activation and rollback log lines. Generic full-nohz and the broader SQPOLL-driven nohz state machine landed in later slices.
  • Measured suppressed-tick proof on the lease path (harness-hardening). Completed 2026-06-02 19:53 UTC by docs/tasks/done/2026-06-02/scheduler-cpu-isolation-measured-suppressed-tick-proof.md. Closes the review-identified honesty gap that the lease path proved suppression only by the tick_suppression=active periodic_tick=masked marker plus a no-hang progress loop, never that periodic timer interrupts actually stopped arriving. The kernel now counts genuine periodic LAPIC fires per CPU (account_timer_fire in the timer ISR increments only when neither the lease-backed nor idle tick-suppression bit is set, so the one-shot replacement is never miscounted), snapshots the count at activation, and on rollback emits cpu-isolation: nohz suppressed-ticks cpu=<n> window_ns=<w> expected_periodic=<e> actual_periodic=<a> suppressed=<e-a>; a bounded post-rollback cpu-isolation: nohz restored-rate line proves the periodic rate returns. The demo holds a childless compute lease on CPU 0 across a ~150 ms masked window, then a busy restore window; the harness asserts a masked window with actual_periodic near zero (expected_periodic >= 10, suppressed >= 8) and a restored window with actual_periodic tracking expected_periodic (>= 8). No activation behavior changed; the mask/one-shot mechanism is untouched. A durable ticks_suppressed{cpu,mode} telemetry field on a monitoring/status surface remains future work.
  • Timeout-based auto-revoke primitive on CpuIsolationLease. Landed via docs/tasks/done/2026-05-30/scheduler-cpu-isolation-lease-timeout-auto-revoke.md. Adds leaseLifetimeNs @6 to CpuIsolationLeaseSpec (0 = no expiry, preserving every existing producer); read_spec clamps to a one-hour ceiling and rejects a non-zero lifetime below maxRevocationLatencyNs (invalidSpec). A lease records expires_at_ns at creation; the first observation past the deadline auto-revokes through the existing generation-advancing cleanup (reason=lease-expired, registry unregister, SQPOLL stop, rollback_nohz_for_lease) and every subsequent info/activationPreflight/revoke reports staleGeneration. The nohz activation record carries the lifetime deadline so a tickless CPU under a lease that crosses its lifetime rolls back at the next timer/IPI recheck (lease-lifetime-expired disqualifier), bounded by maxRevocationLatencyNs. make run-scheduler-cpu-isolation-lease asserts the expiry release line, the post-expiry staleGeneration, and the invalidSpec rejection.
  • Enable tickless idle only when there is no runnable non-idle work and no cap-enter polling dependency. Completed 2026-05-23 09:12 UTC by docs/tasks/done/2026/scheduler-tickless-idle-step6.md. The idle path masks the periodic LAPIC tick only for true idle, arms a bounded one-shot at the nearest Timer/ParkSpace deadline or 100 ms housekeeping floor, and restores periodic mode before ordinary work. Ready-but-budget-throttled SchedulingContext retry windows remain periodic so budget replenishment and deadline notification timing stay on the existing scheduler accounting path.
  • Keep automatic full-nohz behind the completed one-SQ-consumer ownership prerequisite and the narrower CpuIsolationLease telemetry/rollback proof. Generic full-nohz is not the first Phase F implementation task.

Phase F.5: Full-SMP Hardware Scalability

This phase is the planning slot for the next visible SMP milestone when the project is ready to answer whether capOS uses 16/32-core machines well. It does not replace the current Installable System selected milestone and should not be dispatched as a QEMU-only benchmark cleanup. QEMU remains regression infrastructure; the primary performance record should come from direct capOS execution on a dedicated high-core perf runner or bare-metal/cloud-bare-metal machine.

  • Replace temporary four-owner scheduler assumptions with dynamic CPU topology: discovered scheduler CPU set, physical-core versus SMT sibling labeling, APIC id mapping, per-CPU allocation sizing, and boot/status output that makes the selected CPU set auditable.
  • Add or select the APIC backend needed for high-core machines. xAPIC MMIO can remain the current low-core path, but x2APIC selection is the likely larger-APIC-id follow-up from docs/research/x2apic-and-virtualization.md.
  • Shrink scheduler shared-state serialization. Local pick/requeue should avoid one global scheduler-lock critical section where possible, while shared process/thread metadata, blocking waiters, direct IPC handoff, timers/deadlines, and cleanup keep explicit ownership and rollback rules.
  • Add topology-aware placement and observable migration policy. The record should distinguish local enqueue, cross-core wake, steal, SMT sibling placement, failed placement, reschedule IPI, and TLB-shootdown costs.
  • Build the hardware benchmark profile from existing benchmark proposals: static map/reduce, uneven dynamic task pool, barrier phase loop, independent processes, same-process threads, and one capability-call/service-bound workload. Each workload reports work-window and total-time rows at 1/2/4/8/16/32 workers when hardware exists.
  • Record matching native Linux rows on the same machine, plus capOS raw artifacts with source commit, toolchain, topology, frequency/isolation policy, run count, warmup policy, verifier output, medians, variance, speedup, efficiency, and scheduler counters.

Phase G: Realtime Islands

  • Define RealtimeIsland admission inputs: scheduling contexts, memory reservations, device/IRQ reservations, communication paths, CPU leases, and overrun policy.
  • Add a small local-audio or synthetic periodic-control proof before robotics or provider workloads.
  • Prove no allocation, blocking endpoint call, paging, or logging on the admitted realtime path.
  • Record deadline misses and overrun handling as observable output.

Phase H: Policy Service

  • Define a privileged scheduler policy service interface for admission, budget/profile updates, CPU lease grant/revoke, and diagnostics.
  • Keep kernel fallback scheduling independent of policy-service liveness.
  • Add manifest/config hooks for default profiles without making policy changes require kernel rebuilds.
  • Add operator diagnostics that explain why a thread or island was denied, throttled, migrated, or revoked.
  • Define how stateful task/job graph assignment metadata maps into scheduler policy inputs: graph priority to weight/latency class, graph deadline to request freshness or admission input, graph budget to SchedulingContext reference, and graph queue to policy-service placement. The graph coordinator must not mint CPU authority by itself.
  • Design the user-space policy-service AutoNoHz placement heuristic for ordinary threads that appear capable of utilizing a full CPU core. The policy service synthesizes the “thread appears capable of utilizing a full CPU core” decision from a future monitoring/status surface and issues a bounded CpuIsolationLease against a pre-authorized account or session CPU pool. The lease is placement only; it does not mint CPU-time authority. Required bounds on every auto-issued lease: lifetime shorter than admin-issued leases by default and renewable only by re-observing the signal; max_revocation_latency_ns bounded by NoHzEligibility; accounting target a live SchedulingContext or coarse ResourceLedger; CPU set restricted to the operator-declared auto-claim pool; priority-aware fairness preemption that terminates the lease (not just rolls back tick suppression) on arrival of an equal-or-higher priority runnable entity. Prerequisites: (a) a timeout-based auto-revoke primitive on CpuIsolationLease – LANDED 2026-05-30 as leaseLifetimeNs @6 (0 = no expiry) with enforced first-observation auto-revoke and a lease-lifetime-expired nohz rollback; the auto-claim placement lease can now be granted with a bounded lifetime. The bounded renew half LANDED as CpuIsolationLease.renew @4, which pushes the deadline forward by at most the original lifetime while keeping the lease’s identity / accounting / nohz state, leaving only the renewal-by-re-observation heuristic (when to call renew) to Phase H; (b) the monitoring/status surface that exports per-thread saturation observation – LANDED 2026-05-30 as the non-measure per-thread saturation status surface. voluntary_blocks and preemptions were promoted out of cfg(feature = "measure"), an always-built runnable_accumulated_ns runnable-but-not-running accumulator was added (stamped at the run-queue enqueue chokepoint, accumulated at selection), and all three plus runtime_ns are exported through SchedulingPolicyCap.snapshot @2 (proof make run-thread-fairness: hog voluntary_blocks=0 with live preemptions/runnable_ns). migrations stays measure-gated. This read-side surface exports raw cumulative counters only; windowing and the saturation decision remain policy-service work; (c) the pool-grant authority shape that lets an operator pre-authorize an account’s auto-claim pool. Declared-pool descriptor LANDED 2026-05-30: the CpuIsolationLeaseSpec carries poolId @7 (0 = the implicit default pool over every scheduler CPU), the kernel seeds a fixed declared-pool registry (CpuIsolationPoolDescriptor: default pool 0 plus one declared non-default pool 1 over a single CPU), and read_spec admits a lease only when its poolId is declared and its allowedCpuMask is a subset of the pool’s CPU mask – echoing the admitting pool’s id/mask through CpuIsolationLeaseInfo (proof make run-scheduler-cpu-isolation-lease: nondefault_pool=invalidSpec (undeclared id), declared_pool=ok admitted_pool_id=1 admitted_pool_cpu_mask_subset=true, declared_pool_mask_violation=invalidSpec, default_pool_id=0). Manifest-sourced pool table LANDED 2026-05-30: the declared-pool registry is sourced from the boot manifest SystemConfig.cpuIsolationPools @14 (each entry a CpuIsolationPoolDescriptor), with the in-kernel constant as the fail-closed default when the manifest omits/empties the list; the kernel validates each entry fail-closed at boot (canonical CPU mask subset of the scheduler mask, default pool 0 synthesized if omitted, duplicate ids rejected) and emits cpu-isolation: declared-pools source=manifest count=3 ... (proof make run-scheduler-cpu-isolation-lease; kernel-default fallback proven by cargo test-config decode/empty assertions). Per-pool live-lease capacity bound LANDED 2026-05-31: CpuIsolationPoolDescriptor carries poolMaxLeases @2 (0 = unbounded); a non-zero value caps the number of simultaneously live (non-revoked, current-generation) leases the kernel admits against that pool at create-time, counted from the existing LEASE_REGISTRY after prune_dead, rejecting an over-capacity create fail-closed resourceExhausted. The manifest bounds pool 2 at poolMaxLeases: 2; the proof admits two live leases, refuses a third (cpu-isolation: pool-capacity-rejected admitted_pool_id=2 live_leases=2 pool_max_leases=2 result=resourceExhausted, pool_capacity_exceeded=resourceExhausted), and reclaims after a revoke (pool_capacity_reclaimed=ok) – live-count, not cumulative. This is the count+reject mechanism the per-account N policy keys onto. Account identity + per-account N LANDED 2026-05-31: CpuIsolationLeaseSpec carries accountId @8 :UInt64 (0 = unattributed, caller-asserted and inert until counted, echoed read-only through CpuIsolationLeaseInfo.accountId @6) and CpuIsolationPoolDescriptor carries poolMaxLeasesPerAccount @3 :UInt32 (0 = unbounded per account). After the pool-wide check, register counts the requesting account’s live entries (admitted_pool_id AND account_id both matching) against the per-account bound and rejects an over-bound create fail-closed resourceExhausted (0 account or 0 bound skips the gate). The manifest bounds pool 2 at poolMaxLeasesPerAccount: 1; the proof admits one account-7 lease, refuses a second account-7 create (cpu-isolation: account-capacity-rejected admitted_pool_id=2 account_id=7 account_live_leases=1 pool_max_leases_per_account=1 result=resourceExhausted, account_capacity_exceeded=resourceExhausted), admits a different account-9 lease on that CPU (account_capacity_other_account=ok – per-account, not pool-wide), and reclaims after revoking account-7 (account_capacity_reclaimed=ok). The account id is caller-asserted, not yet authenticated. Bootstrap pool-grant authentication LANDED 2026-05-31: CpuIsolationPoolGrant (schema/capos.capnp, source cpu_isolation_pool_grant, kernel kernel/src/cap/cpu_isolation_pool_grant.rs) introduced a bootstrap-staged grant binding one authenticated account to one declared pool. createLease stamps the bound account/pool onto the minted lease, overriding any caller-asserted accountId/poolId, and reuses the exact lease-create admission path (cpu_isolation::create_lease_for_caller), so the per-account bound is unforgeable: a holder can no longer assert another account to evade poolMaxLeasesPerAccount. The initial proof used one account-7/pool-2 grant; the current manifest-sourced proof below exercises multiple seeded grants. Manifest-declared multi-account grant table LANDED 2026-06-01: the grant binding is now operator-declared via SystemConfig.cpuIsolationPoolGrants (schema/capos.capnp, decoded in capos-config, seeded at boot by cpu_isolation_pool_grant::seed_pool_grants after seed_declared_pools), mirroring the manifest-sourced cpuIsolationPools table; the cpu_isolation_pool_grant / cpu_isolation_pool_grant_secondary sources stage seeded binding index 0 / 1, so a manifest can pre-authorize multiple distinct (account, pool) grants, each staged as its own bootstrap cap. An absent/empty list falls back to one in-kernel binding at index 0: account 7 bound to preferred pool 1 when active, otherwise account 7 bound to synthesized default pool 0, so manifest-sourced pool tables that omit pool 1 still stage a usable default grant. Proof make run-scheduler-cpu-isolation-pool-grant now boots a two-entry grant table (account 5/pool 1, account 8/pool 2), holds both grant caps, and proves each stamps its OWN bound account (pool-grant: create ok bound=A stamped_account_id=5 ... / bound=B stamped_account_id=8 ...) with the per-account bound still enforced fail-closed under the manifest-sourced path; boot evidence cpu-isolation: pool-grants source=manifest count=2. Fallback proof make run-scheduler-cpu-isolation-pool-grant-default boots a manifest-sourced pool table that declares pool 2 and omits pool 1 plus an empty grant list; the kernel stages one default grant as (account 7, pool 0) and the smoke proves it can mint a stamped lease. Runtime grant minting landed (CpuIsolationGrantMinter): one cap mints a fresh CpuIsolationPoolGrant for an operator-chosen (account, pool) at call time, bounded by the declared SystemConfig.cpuIsolationGrantMinterAllowlist (an out-of-allowlist mint is refused unauthorized, so it is never an ambient grant-any authority; the minted grant reuses the same unforgeable createLease admission path). The same run-scheduler-cpu-isolation-pool-grant smoke now also mints a grant for the allowed (account 6, pool 2), proves its createLease stamps account 6 and stays bounded by the per-account gate, and proves an out-of-allowlist (account 99, pool 2) mint is refused; boot evidence cpu-isolation: grant-minter-allowlist source=manifest count=1. Grant-revocation lifecycle landed (CpuIsolationGrantMinter.revokeGrant): a runtime-minted grant gets a revocable (grantId, generation) identity; revokeGrant(grantId) advances the grant generation so a stale grant handle’s createLease fails staleGeneration, and cascades to every live lease minted through it – reusing the landed fairness-termination cleanup (reason=grant-revoked, periodic-tick rollback, registry unregister) so the per-pool/per-account live-lease capacity frees immediately and a fresh grant is admitted into the reclaimed slot. Double-revoke is alreadyRevoked and an unknown grantId is unknownGrant, both fail-closed. The same run-scheduler-cpu-isolation-pool-grant smoke proves the full lifecycle. This closes Track C (prerequisite (c)) – operator grant authority is now mint + revoke complete. Detailed design in docs/proposals/tickless-realtime-scheduling-proposal.md “Policy-Service Userstories: AutoNoHz Placement for Compute-Capable Threads”.

AutoNoHz Decomposition: Roadmap to Full Auto-NoHz

The status bullet above narrates what landed. This subsection is the discrete dispatchable decomposition from the current landed state to full operator-driven auto-nohz, so the path is written as concrete slices rather than “future work” prose. Grounding: the proposal’s “Policy-Service Userstories: AutoNoHz Placement”, “Bounds the policy service must enforce”, “Telemetry Requirements”, and Implementation Sequence steps 7/14/17.

Landed substrate (not repeated below): the narrow manual per-CPU LAPIC tick-mask for the single-runnable compute window and the SQPOLL-coupled window, tickless idle, prerequisite (a) leaseLifetimeNs @6 timeout auto-revoke, prerequisite (b) the SchedulingPolicyCap.snapshot @2 saturation observation surface, and prerequisite (c) pool-grant authority now mint + revoke complete (the manifest-declared multi-account cpuIsolationPoolGrants @15 table, runtime grant minting through CpuIsolationGrantMinter, and the grant-revocation lifecycle that cascades to minted leases). Fairness lease termination (Track D) and a measured suppressed-tick proof have also landed, as have network-poll and IRQ-affinity housekeeping routing, kernel-side generic full-nohz admission for ordinary budgeted compute threads, and generic SQPOLL nohz admission for explicitly leased caller-thread rings. What the name “auto nohz” still oversells today: there is no production policy service, and broader userspace-poller/device-queue issuance remains future work. Each remaining slice below closes one of those.

Conflict-domain note: every kernel slice here shares resource:scheduler-cpu-isolation and writes kernel/src/cap/cpu_isolation* or kernel/src/sched.rs, so they serialize against each other – dispatch the chain head first; the rest convert from this list into docs/tasks/ records as their depends_on closes. Slices marked ready have a task record under docs/tasks/; the rest stay here until their prerequisite lands.

Next increment (decomposed 2026-06-04 00:18 UTC; updated 2026-06-07 after generic SQPOLL nohz landed): Track C, Track D, and the measured suppressed-tick proof are all landed, and the ordinary-thread and SQPOLL-ring kernel admission leaves are now done. Records under docs/tasks/ capture: scheduler-cpu-isolation-lease-renewal-on-reobservation (renewal residual), scheduler-nohz-irq-affinity-housekeeping-routing, scheduler-nohz-network-poll-housekeeping-routing, scheduler-deadline-driven-budget-accounting, and scheduler-generic-full-nohz-arbitrary-threads as done. The remaining operator-driven AutoNoHz capstone is the policy service. These scheduler CPU-isolation slices serialize against each other on resource:scheduler-cpu-isolation but are parallel-safe against the in-flight Phase C network-stack lane, so the scheduler lane stays runnable whenever Phase C 7c holds the kernel cap/ surface.

Track C – complete operator grant authority (prerequisite (c) residual):

  • scheduler-cpu-isolation-runtime-grant-minting – behavior, normal, LANDED 2026-06-02 22:24 UTC. One cap (CpuIsolationGrantMinter) mints a fresh CpuIsolationPoolGrant for an operator-chosen (account, pool) at call time, bounded by the declared SystemConfig.cpuIsolationGrantMinterAllowlist (an out-of-allowlist pair is refused unauthorized), instead of only the boot-seeded table. The minted grant reuses the same unforgeable createLease admission path. Proof make run-scheduler-cpu-isolation-pool-grant. depends_on: manifest-multi-account grant table (landed).
  • scheduler-cpu-isolation-grant-revocation-lifecycle – behavior, normal, LANDED 2026-06-03 17:11 UTC. CpuIsolationGrantMinter.revokeGrant revokes a runtime-minted grant by advancing its (grantId, generation) so later createLease through the stale handle fails staleGeneration and mints nothing; revocation cascades to every live lease minted through that grant, driving the landed fairness-termination cleanup (reason=grant-revoked, periodic-tick rollback, registry unregister) once per tagged lease so per-pool/per-account capacity frees immediately (a fresh grant’s lease is admitted into the reclaimed slot in the proof). Double-revoke is alreadyRevoked, unknown grantId is unknownGrant, seeded grants stay un-revocable. Closes Track C. Proof make run-scheduler-cpu-isolation-pool-grant. depends_on: scheduler-cpu-isolation-runtime-grant-minting (landed), scheduler-cpu-isolation-priority-aware-lease-termination (landed).

Track D – fairness preemption (proposal fairness_preemption):

  • scheduler-cpu-isolation-priority-aware-lease-termination – behavior, normal, LANDED 2026-06-02 21:17 UTC. On arrival of an equal-or-higher policy-priority runnable on the leased CPU when no other CPU authorized by both the admitted pool and the lease allowedCpuMask is eligible, the kernel now terminates (revokes) the lease itself at the existing nohz rollback site (fairness-preempted ... result=lease-terminated), not just restores the periodic tick, bounded by maxRevocationLatencyNs. The recheck compares the static WFQ policy priority (latency_class, weight) of the arriving entity against the captured leased thread; a strictly-lower arrival or an eligible sibling CPU inside both masks keeps the existing tick-restore-only behavior. The termination runs the same generation-advancing cleanup leaseLifetimeNs expiry uses (reason=fairness-preempted) immediately after the scheduler restores the periodic tick, so a subsequent info/revoke reports staleGeneration and placement/account capacity is freed without waiting for the holder’s next cap call. Proven in make run-scheduler-cpu-isolation-lease (default pool 0 with allowedCpuMask=0x01: an equal-priority sibling terminates and capacity is reclaimed, a strictly-lower sibling restores only). Out: no re-placement onto an eligible sibling CPU (the “no sibling eligible” condition is recorded; actual migration is generic-full-nohz work). depends_on: auto-nohz-activation (landed).

Lease lifetime renewal (proposal lifetime_ns renewal residual):

  • scheduler-cpu-isolation-lease-renewal-on-reobservation – behavior, normal, landed. CpuIsolationLease.renew @4 pushes expires_at_ns forward to now + leaseLifetimeNs (clamped to the same one-hour ceiling read_spec enforces), keeping the same (leaseId, generation), accounting binding, and nohz activation state. Callable only before expiry: a revoked, auto-revoked, or past-deadline lease stays stale (staleGeneration) and is not resurrected, and an unbounded leaseLifetimeNs = 0 (or factory) lease reports notRenewable. The renewed deadline is propagated to a tickless CPU’s nohz activation record (renew_nohz_lifetime_deadline_for_lease) so the lease-lifetime-expired disqualifier no longer rolls it back at the old deadline. CpuIsolationLeaseInfo.expiresAtNs echoes the deadline read-only. The kernel primitive the policy service uses to renew an auto-issued lease by re-observing the saturation signal; the re-observation heuristic itself stays Phase H policy-service work. Proof make run-scheduler-cpu-isolation-lease. depends_on: timeout-auto-revoke (landed).

Honesty / telemetry (proposal Telemetry ticks_suppressed{cpu,mode}):

  • scheduler-cpu-isolation-measured-suppressed-tick-proof – harness-hardening, normal, LANDED 2026-06-02 19:53 UTC (docs/tasks/done/2026-06-02/scheduler-cpu-isolation-measured-suppressed-tick-proof.md). A kernel expected-vs-actual periodic-tick counter (account_timer_fire, counted only when no tick-suppression bit is set) over a bounded nohz window is asserted in make run-scheduler-cpu-isolation-lease (cpu-isolation: nohz suppressed-ticks ... plus a restored-rate line), so the proof shows the periodic tick actually stopped firing, not only that the mask write was issued and the CPU made progress. Closed the review-identified honesty gap. A durable ticks_suppressed{cpu,mode} telemetry field on a monitoring/status surface remains future work. depends_on: auto-nohz-activation (landed).

Step 7 – network poll housekeeping/deadline routing:

  • scheduler-nohz-network-poll-housekeeping-routing – behavior, normal, landed 2026-06-04 04:48 UTC. The in-kernel virtio-net poll (virtio::poll_scheduler) now routes off a lease-isolated (tickless) CPU: it consults sched::current_cpu_lease_nohz_active() and skips, emitting a bounded cpu-isolation: network-poll routed ... result=skipped-on-isolated-cpu record, while the always-ticking housekeeping CPU the admission requires keeps the poll progressing. The network_polling admission gate flips from the hard rejected-periodic-network-polling-not-routed-to-housekeeping refusal to a housekeeping-conditioned routed-periodic-network-polling-to-housekeeping-cpu admit (eligibility accepts the routed- prefix), and fails closed (rejected-network-polling-no-housekeeping-cpu-to-relocate) when no housekeeping CPU exists. The admitted named_ring=None lease carries the routed label tick-suppressed; the CallerThread compute-with-ring lease’s network refusal is removed but it stays ForcedPeriodic because IRQ affinity routing is the separate slice below. Proof make run-scheduler-cpu-isolation-lease; regression make run-net. depends_on: housekeeping-deferred-work-placement (landed), auto-nohz-activation (landed).
  • scheduler-nohz-irq-affinity-housekeeping-routing – behavior, normal, landed (docs/tasks/done/2026-06-04/). The activation path reroutes an opting-in leased CPU’s legacy IO-APIC redirection-entry destinations onto the selected housekeeping CPU (mask-before-reprogram + read-back, restored on rollback/revoke) before admitting tick suppression, and keeps the conservative rejected-irq-affinity-not-routed-to-housekeeping refusal for a ring-coupled IRQ dependency that cannot be safely rerouted. Proof make run-scheduler-cpu-isolation-lease (irq-affinity ok ... routed_admitted=true restored_on_revoke=true residual_forced_periodic=true); DDF run-interrupt-grant / run-devicemmio-grant stay green. Scoped to a quiescent housekeeping destination: under the in-kernel KVM irqchip, reprogramming an IO-APIC redirection-entry destination onto an actively-scheduling CPU stalls that CPU’s forward progress, so the live reroute is gated to a focused proof lease (reroute sentinel maxRevocationLatencyNs) whose destination is idle. A general busy-destination reroute remains future work behind a destination-quiescence gate or a non-KVM-irqchip delivery backend. depends_on: auto-nohz-activation (landed).

Step 14 – generic SQPOLL nohz for arbitrary rings:

  • scheduler-generic-sqpoll-nohz-arbitrary-rings – behavior, normal, done 2026-06-07. The SQPOLL nohz state machine now admits explicitly leased caller-thread rings when the SQPOLL worker is live, the ring is running/sleeping with a non-stale owner, exactly one SQ consumer is present, and producer wake/deadline rollback are bounded. The focused make run-scheduler-generic-sqpoll-nohz proof drives eligible entry, producer wake, SQPOLL service, rollback, and stale-owner rejection. Broader AutoUserspacePoller userspace-poller/device-queue issuance remains future policy-service work. depends_on: auto-nohz-sqpoll (landed), scheduler-nohz-network-poll-housekeeping-routing.

Generic full-nohz for arbitrary threads (the kernel half of “auto”):

  • scheduler-generic-full-nohz-arbitrary-threads – behavior, normal, done 2026-06-06. Ordinary budgeted compute threads can now enter full-nohz through an explicit SchedulingContext-targeted CpuIsolationLease when the single-runnable, budget-deadline, housekeeping, network-poll, IRQ-affinity, timer, lifetime, and rollback gates all pass. Missing thread budget, multiple runnable work, revoked or expired leases, unrouted dependencies, and no-housekeeping cases still fail closed. Issuance is still policy-service future work; this is only the kernel admission half. depends_on: scheduler-cpu-isolation-priority-aware-lease-termination, scheduler-nohz-network-poll-housekeeping-routing, scheduler-nohz-irq-affinity-housekeeping-routing.

Step 17 – user-space AutoNoHz policy service (capstone):

  • scheduler-autonohz-policy-service-saturation-local-proof – behavior, normal, done 2026-06-07. A userspace AutoNoHz policy-service smoke now holds an operator-declared CpuIsolationPoolGrant, consumes SchedulingPolicyCap.snapshot @2 runtime / runnable / voluntary-block / preemption counters, denies a voluntarily blocking worker, issues a bounded full-nohz lease only after a local saturation window, renews only after re-observing saturation, and proves stopped-renewal expiry leaves fallback periodic scheduling intact. The proof records the grant-stamped account/pool and the single allowed CPU mask that the kernel admitted. depends_on: scheduler-cpu-isolation-runtime-grant-minting, scheduler-cpu-isolation-lease-renewal-on-reobservation, scheduler-cpu-isolation-priority-aware-lease-termination.
  • scheduler-autonohz-production-policy-daemon – behavior, normal, blocked. Replace the local smoke’s fixed single-process proof with a privileged reusable policy daemon: profile-driven smoothing/window selection, cross-process target discovery, operator policy plumbing, structured observability, and revocation/non-renewal decisions for multiple accounts and pools. The landed local proof keeps this future work replaceable without ABI churn. depends_on: scheduler-autonohz-policy-service-saturation-local-proof.

Independent hardening (makes auto-nohz budget-safe):

  • scheduler-deadline-driven-budget-accounting – behavior, normal, done 2026-06-04. Charge SchedulingContext budget at monotonic-deadline granularity rather than per-periodic-tick so an auto-nohz thread cannot overshoot its budget by a full tick quantum while the tick is masked. Closes the “enforcement remains periodic-tick granularity” caveat that auto-nohz made load-bearing; the task ledger is docs/tasks/done/2026-06-04/scheduler-deadline-driven-budget-accounting.md. depends_on: Phase E budget enforcement (landed), scheduler-lapic-oneshot-subtick-firing-precision (done), scheduler-monotonic-clocksource-subtick-discipline (done).

Cleanup: Retire Benchmark-Driven Scaffolding Before Phase E

This section captures simplification work identified during the post-thread-scale SMP/threading architecture review on 2026-05-01 23:20 EEST. None of these items are regressions: the affected code is correct, gated behind the measure feature where it should be, and was added intentionally during attribution and placement slices that closed the In-Process Threading Scalability milestone. They are recorded here so the next selected scheduler milestone does not extend or formalize speculative SMP scaffolding that the current per-CPU WFQ scheduler does not need.

The cleanup is subordinate to the current selected milestone and to already-open review-finding task records. Pick it up as Phase E preflight work before SchedulingContext claims the scheduler surface. Each removal must preserve the documented runnable-ownership invariants from docs/architecture/scheduling.md (single dispatch owner per live ThreadRef across per-CPU current/handoff_current slots, the per-CPU WFQ run queues, and the direct IPC target; scheduler-lock-contained migration; allocation-free timer/unblock/direct-IPC-fallback/requeue/steal-requeue paths) and the recorded benchmark-only counter policy. The 2026-05-02 per-CPU run-queue collapse and the accepted 2026-05-10 Phase D WFQ reintroduction are now both historical evidence: the single-global-queue shape had accepted 1-to-2 evidence but a 1-to-4 diagnostic gap (capOS 1.566x/1.538x vs Linux 3.963x/3.858x), and Phase D manually accepted the 2026-05-10 per-CPU WFQ 1-to-4 diagnostic (capOS 3.088x/2.700x; matching Linux 3.974x/3.850x on the same pin set) after the harness-enforced 1-to-2 gates stayed green.

Grounding read before any slice:

  • docs/architecture/scheduling.md
  • docs/proposals/scheduler-evolution-proposal.md
  • docs/proposals/smp-proposal.md
  • docs/backlog/smp-phase-c.md
  • kernel/src/sched.rs
  • kernel/src/process.rs
  • kernel/src/measure.rs
  • kernel/src/arch/x86_64/{smp.rs,lapic.rs,percpu.rs,tlb.rs}

Acceptance rule for every slice below: each removal must land with a host or QEMU test that fails without it, so a future reintroduction is explicit authority work rather than silent regression of an undocumented feature.

  • 2026-05-02 08:07 UTC: Retired the timer continuation fast path, its per-CPU skip budget, and the slow-path-required mirror flags. Deleted try_continue_current_on_timer_tick, mark_timer_slow_path_required, reset_current_cpu_timer_fast_path_skip_count, note_timer_slow_path_completed_locked (both feature variants), scheduler_has_hard_timer_slow_path_work_locked_excluding_endpoint_queue, scheduler_timer_slow_path_reasons_locked, the TimerBlockedWaiterKind / blocked_thread_* helpers, and the four atomic mirrors TIMER_SLOW_PATH_REQUIRED, TIMER_FAST_PATH_SKIP_COUNTS, CURRENT_NON_IDLE_CPUS, and TIMER_FAST_PATH_MAX_CONSECUTIVE_SKIPS. set_current_thread_locked no longer publishes CURRENT_NON_IDLE_CPUS. The timer interrupt entry in kernel/src/arch/x86_64/context.rs now always calls crate::sched::schedule(context) instead of trying the lock-free fast path. Eight mark_timer_slow_path_required() call sites in kernel/src/sched.rs (run-queue publish, pending process drop, park-with-deadline, process termination queue, direct-IPC handoff, timer sleep enqueue, cap-enter-with-deadline, pending thread stack release, pending endpoint cancellation push) also dropped — they are no-ops once the fast path no longer exists. Verified that make run-spawn exits cleanly ([init] Spawn cap-table exhaustion check ok., proc: process 2 exited with code 0, sched: last process exited, halting) and make run-smoke runs the scripted login flow to operator session. cargo build --features qemu is warning-free (project rule). Reintroduce the fast path only if a future Phase D or Phase F slice ships an evidence pair where it measurably reduces scheduler-lock hold time on a contended SMP run.

    Follow-up partial 2026-05-02 08:39 UTC: `kernel/src/measure.rs`
    lost the eight public API entry points (`timer_fast_path_attempt`,
    `timer_fast_path_continue`,
    `timer_fast_path_slow_required_fallback`,
    `timer_fast_path_skip_budget_fallback`,
    `timer_fast_path_pending_reschedule_fallback`,
    `timer_fast_path_no_current_non_idle_fallback`,
    `timer_fast_path_inactive_invalid_cpu_fallback`, and
    `timer_slow_summary`) plus the now-orphaned `TimerSlowSummaryReasons`
    struct and its `requires_slow_path` impl. `cargo build --features
    qemu,measure` is back to warning-free.
    
    Follow-up complete 2026-05-02 21:00 UTC: the deeper deletion slice
    removed the seven `TIMER_FAST_PATH_*` static counters, the
    `TimerCounter::FastPath*` enum variants, the
    `TimerSlowSummaryCounter` enum, the `TIMER_SLOW_SUMMARY_*` counter
    arrays (`TIMER_SLOW_SUMMARY_COUNTER_VALUES`,
    `CASE_START_TIMER_SLOW_SUMMARY_COUNTERS`,
    `PREVIOUS_TIMER_SLOW_SUMMARY_COUNTERS`,
    `PHASE_TIMER_SLOW_SUMMARY_COUNTERS`), the
    `(TimerSlowSummaryCounter, &str)` reporting table, the
    `Snapshot.timer_slow_summary_counters` field, and the matching
    reset/diff/print helpers and accessors. `TIMER_COUNTER_COUNT`
    shrank from 11 to 4 (interrupts, user_scheduler, kernel_only,
    bsp_tick_advances). The `measure: timer ...` line is now compact
    and the `measure: timer_slow_summary ...` line is no longer
    emitted at all. `tools/qemu-thread-scale-harness.sh` dropped the
    `fast_path_*` clauses and the `timer_slow_summary` aggregate /
    per-phase grep checks in the same slice, satisfying the
    "removal must land with a host or QEMU test that fails without it"
    acceptance rule. Verified with `make fmt-check`,
    `cargo build --features qemu` (warning-free),
    `cargo build --features qemu,measure` (warning-free),
    `cargo test-lib` (171 passed), `make run-spawn`, and `make
    run-measure` (proof line emitted, exit 0). A local one-iteration
    `CAPOS_THREAD_SCALE_RUNS=1 CAPOS_THREAD_SCALE_GUEST_MEASURE=1 make
    run-thread-scale` was used solely as functional verification of
    the harness parser against the new measure-output shape (no CPU
    pinning, single iteration; the run reported `qemu taskset cpus:
    none` and the resulting medians/speedups are diagnostic only).
    This slice is a measure-output cleanup, not a scheduler-structure
    change, so it does not require controlled benchmark-VM timing
    evidence under the Phase A "before/after each scheduler structure
    change" rule; the harness fail-without-the-kernel-change pairing
    is the acceptance gate.
    
  • 2026-05-01 22:01 UTC: Collapsed the asymmetric scheduler CPU sizing. MAX_SCHEDULER_CPUS = 64 was deleted, MAX_SCHEDULER_CLEANUP_CPUS = 4 was renamed to a single SCHEDULER_CPUS = 4, and SchedulerDispatch.current[] resized from 64 to SCHEDULER_CPUS to match run_queues, handoff_current, idle_pids, idle_threads, pending_thread_stack_release, TIMER_FAST_PATH_SKIP_COUNTS, and SCHEDULER_CPU_MASK. The dual current_cpu_slot() / current_cleanup_slot() helpers collapsed into a single current_cpu_slot() that bounds-checks against SCHEDULER_CPUS and panics on overflow with "scheduler: CPU id {} exceeds scheduler-owned mask". scheduler_cpu_slot(cpu_id) -> Option<usize> retained for the non-panicking lookup. The earlier “raw CPU id 0..63 vs scheduler slot 0..3” indexing distinction is gone. Reintroduce a wider id-to-slot mapping only when a Phase D/F slice grows the scheduler-owned mask beyond the current four. Verified with cargo build --features qemu and cargo build --features qemu,measure (both warning-free) plus make run-smoke and make run-spawn on 2026-05-01.

  • 2026-05-02 09:26 UTC: Replaced the per-CPU run-queue array with a single global run_queue: VecDeque<ThreadRef>. SchedulerDispatch keeps run_queue_live_reservations as a single counter; the reserve_run_queue_capacity_for_thread_locked / release_run_queue_capacity_reservations_locked / push_reserved_run_queue_locked triple still bounds growth but operates on the single queue. enqueue_ready_thread_on_cpu_locked, run_queue_target_cpu_locked, the created_thread_target_cpu_locked placement chain (active_ready_scheduler_cpu_mask, non_idle_dispatch_load_locked, least_loaded_scheduler_cpu_*, caller_current_scheduler_cpu_slot_locked), the CreatedThreadPublishPolicy / CreatedThreadTarget types, the scheduler_cpu_scan_order helper, and the crate::measure::thread_placement_publish_caller_* reporting surface are all gone. WakePolicy::QueueCpu(usize) collapsed to WakePolicy::QueueAny. wake_idle_scheduler_cpus_locked walks eligible idle scheduler CPUs and stops only after the first one that accepts a fresh reschedule IPI; CPUs that already have a pending IPI (or that fail LAPIC delivery) are skipped without breaking, so a burst of ready work cross-wakes more than one neighbor for both queue and direct-target wakes. publish_created_thread no longer takes a caller_thread argument and no longer emits a per-CPU placement record: under the single global queue there is no per-CPU publish target, and hard-coding CPU0 misclassified normal worker publishes as single-owner-CPU0. Phase D later reintroduced the per-CPU split without restoring those publish counters; reintroduce them only through a separate operator-observability slice.

    Verified with `cargo build --features qemu` and `cargo build
    --features qemu,measure` (both warning-free) plus `make run-spawn`
    and `make run-smoke`. A post-collapse 3-run diagnostic
    `make run-thread-scale` on the benchmark VM (`taskset 0,1,2,3`,
    enforcement disabled) on 2026-05-02 10:42 UTC measured
    1-to-2 work/total `1.890x`/`1.792x` (slight improvement over the
    pre-collapse 1-to-2) and 1-to-4 work/total `1.504x`/`1.436x`
    (clear regression vs the pre-collapse 1-to-4): single-queue
    scheduler-lock contention dominates at 4 workers. The numbers
    live in `docs/benchmarks.md` as diagnostic. Phase D later
    brought per-CPU queues back with a fair-share enqueue policy and
    formal accepted evidence (capOS plus Linux baseline, full
    enforcement, multiple runs, recorded host caveats).
    
  • 2026-05-02 07:00 UTC: Lifted endpoint-cancellation retry storage out of the scheduler lock. The pending_endpoint_cancellations: VecDeque field is gone from Scheduler; it now lives in a dedicated static PENDING_ENDPOINT_CANCELLATIONS: Lazy<Mutex<VecDeque<...>>> with bounded try_reserve_exact(MAX_PENDING_ENDPOINT_CANCELLATIONS) reservation, eagerly forced in init_idle via Lazy::force so the allocation never lands in a timer/exit cleanup path. The queue’s len() under its own mutex is the single source of truth for pending_endpoint_cancellations non-emptiness. Producers (queue_pending_endpoint_cancellation, remove_pending_endpoint_cancellations_for_pid, remove_pending_endpoint_cancellations_for_thread) and the drain (drain_pending_endpoint_cancellations) take only the queue mutex; the scheduler lock is acquired only briefly inside queue_pending_endpoint_cancellation to validate the target thread is live and has a ring scratch. defer_endpoint_cancellation previously re-acquired the scheduler lock just to push to the fallback queue; that re-acquisition is gone.

    `note_timer_slow_path_completed_locked` (consumer) holds the queue
    mutex across both the `!is_empty()` check and the
    `TIMER_SLOW_PATH_REQUIRED.store`, and the producer
    `queue_pending_endpoint_cancellation` stores
    `TIMER_SLOW_PATH_REQUIRED = true` inside the queue lock alongside
    its push, so a concurrent producer cannot push between the
    consumer's read and store and have its slow-path mark be overwritten.
    
    The functional contract is preserved: a cancellation that cannot
    deliver immediately because the target ring scratch is contended
    still falls back to the bounded retry queue, still raises
    `TIMER_SLOW_PATH_REQUIRED`, and is still drained on the next
    scheduler tick. Bound is unchanged
    (`MAX_PENDING_ENDPOINT_CANCELLATIONS = MAX_CAP_SLOTS *
    MAX_ENDPOINT_CANCELLATION_OBJECT_SWEEPS *
    MAX_ENDPOINT_CANCEL_NOTIFICATIONS_PER_ENDPOINT * SCHEDULER_CPUS`);
    the open size-tightening question (whether the `SCHEDULER_CPUS`
    multiplier is still load-bearing now that producers no longer hold
    the scheduler lock) is deferred to a future slice with bench evidence.
    
    A possible follow-on slice would move retry storage to per-endpoint
    bounded slots so each endpoint object owns its own queue, but that
    requires reshaping the `(thread, user_data)` payload to be addressable
    from an endpoint object and is non-trivial. The current move is
    sufficient to get the storage out of the scheduler lock and unblock
    future scheduler-lock-hold-time analysis.
    
    Verified with `cargo build --features qemu` and
    `cargo build --features qemu,measure` (both warning-free) plus
    `make run-spawn` and `make run-smoke` on 2026-05-02. Review found and
    fixed a Lazy-init in interrupt paths and a slow-path-clearing race
    against producer publication.
    
  • 2026-05-01 21:38 UTC: Feature-gated the first ThreadCpuAccounting experiment end-to-end behind cfg(feature = "measure"). That slice temporarily compiled the whole accounting record, its accessors, and scheduler call sites only when the feature was enabled. Phase D later superseded this temporary shape: runtime_ns, virtual_runtime_ns, and last_started_ns are now unconditional normal-build fields because WFQ ordering, SchedulingPolicyCap.snapshot, and SchedulingContext budget charging depend on them. The remaining diagnostic counters (context_switches, preemptions, voluntary_blocks, migrations, last_cpu, blocked/exited stability observations, placement buckets, and per-phase attribution counters) stay behind cfg(feature = "measure"). The 2026-05-01 slice was verified with cargo build --features qemu and cargo build --features qemu,measure (both warning-free) plus make run-spawn (non-measure default) on 2026-05-01. make run-measure was broken on main at the time of this slice for unrelated reasons; that regression was repaired on 2026-05-02 20:23 UTC (see docs/backlog/scheduler-evolution.md and the docs/changelog.md Measure Mode Repair entry).

  • 2026-05-01 21:02 UTC: Retired the RUNNABLE_PROCESS_EXIT_CLEANUP_PROOF_PRINTED, RUNNABLE_THREAD_EXIT_CLEANUP_PROOF_PRINTED, and CPU_ACCOUNTING_PROOF_PRINTED once-flag log lines along with their Atomic* gating booleans, the three print_*_once / maybe_print_*_for_thread_locked helpers in kernel/src/sched.rs, and their four call sites. The runnable-cleanup invariants remain enforced by the unconditional assert_no_runnable_pid_entry_locked and assert_no_runnable_thread_entry_locked panics already in kernel/src/sched.rs; a regression that leaves stale runnable owner state still panics the kernel and fails make run-spawn. The tools/qemu-spawn-smoke.sh harness lost its three matching grep -Fq lines for the same reason. The orphaned Process::account_thread_exited_stable_observed / ThreadCpuAccounting::observe_exited_stable helpers were deleted with the print; the remaining ThreadCpuAccounting writes stay untouched for the upcoming feature-gate slice. The pub fn thread_cpu_accounting accessor moved behind cfg(feature = "measure") because its only remaining caller is the measure-gated account_thread_selected_locked placement counter bridge.

  • Cache the active CPU id in the per-CPU GS-relative slot. arch::percpu::current_cpu_id reads the LAPIC ID MMIO register and then linearly scans CPU_LAPIC_IDS[0..64] on every call. The timer fast-path consumer was retired on 2026-05-02 (see the “Retired the timer continuation fast path” entry above), but the function still runs from the syscall path and from non-syscall kernel contexts: arch::context::advance_bsp_tick, the scheduler’s CPU-slot accounting and dispatch lookups in sched.rs, arch::tlb::flush_pending_for_current_cpu, and mem::paging invalidation paths. The hot caller is the syscall entry path; the non-syscall callers are why a drop-in GS-relative replacement is harder than the cleanup item first suggested. The single-mov lookup conceptually wants mov %gs:offset, %eax, but the slice is blocked on a kernel-mode GS-base invariant: today the kernel sets KernelGsBase via set_kernel_gs_base and only the syscall assembly does swapgs to make gs:0..16 resolve at PerCpu while handling a syscall. In normal kernel context (timer ISR, scheduler from non-syscall paths, paging init, AP bring-up), the active GS base is whatever Limine left, not the PerCpu address. A drop-in replacement of current_cpu_id with gs:[offset] therefore faults outside syscall context (verified 2026-05-02: reordering init_bsp to set KernelGsBase before set_kernel_entry_stack is necessary but not sufficient because the active GS base is still not the PerCpu address). The enabling work is establishing a kernel-mode invariant that GS_BASE = PerCpu in CPL0 (typically by swapgs-ing on every kernel entry/exit, including interrupt handlers), or by adopting a hybrid: GS-relative read in the syscall path plus the existing LAPIC-based path everywhere else. Both paths are larger than a single retirement slice and should land with their own gates. Until then this item stays open and current_cpu_id keeps the LAPIC MMIO + CPU_LAPIC_IDS scan.

  • Reassess the scheduler-lock-site instrumentation breadth. SchedulerLockSite, the SchedulerLockGuard/measured_lock wrappers, the dual cfg(feature = "measure") scheduler_lock/scheduler_lock_site paths, and the eight per-site counter axes in kernel/src/measure.rs were added when the global scheduler lock was the suspected scaling bottleneck. After the runqueue/dispatch split landed and the documented per-CPU ownership invariants stabilized, decide which sites still justify dedicated counters and which should fold back into the aggregate scheduler_lock line. Keep the cfg(feature = "measure") gating; reduce the surface so reading the scheduler still reads as one lock acquisition path under non-measure builds.

  • Reassess single_cpu_owner_pids, direct_ipc_target, and handoff_current before Phase E starts. The single-owner pinning policy, the one-slot direct-IPC handoff, and the per-CPU handoff guard each special-case a small subset of the dispatch flow; document or delete each one against the accepted Phase D fair-policy behavior before SchedulingContext work depends on it. Do not delete them speculatively: the cross-process IPC and process/thread exit cleanup proofs depend on the current direct-IPC and handoff invariants.

  • Keep an honest scaling proof when scheduler work resumes. Completed 2026-05-02 21:38 UTC on the benchmark VM against main commit 374f8556. Five-run controlled paired evidence, both runs pinned to physical-core logical CPUs 0,1,2,3 on a 4-core/8-thread n2-highcpu-8 host with KVM:

    | Comparison | capOS | Linux pthread | capOS gate | capOS verdict |
    | --- | ---: | ---: | ---: | --- |
    | 1→2 work  | `1.883x` | `1.988x` | ≥ `1.6x` | accepted |
    | 1→2 total | `1.787x` | `1.987x` | ≥ `1.6x` | accepted |
    | 1→4 work  | `1.566x` | `3.963x` | ≥ `1.6x` | diagnostic |
    | 1→4 total | `1.538x` | `3.858x` | ≥ `1.6x` | diagnostic |
    
    Linux scales near-linearly on the same physical CPU set (1-to-2
    `1.99x`, 1-to-4 `3.96x`), so the workload shape is sound and the
    capOS 1-to-4 gap is a scheduler bottleneck, not a benchmark
    artifact. The 1-to-2 result was the formal accepted gate against
    the single-global-queue scheduler. The 1-to-4 result became the
    bottleneck-attribution diagnostic that justified Phase D's fair-share
    enqueue policy; Phase D later manually accepted the `2026-05-10` WFQ
    1-to-4 diagnostic pair recorded above while the harness-enforced gates
    remained the 1-to-2 work/total speedups.
    
    Benchmark shape: blocking parent join, 262,144 blocks (16 MiB),
    `work_rounds=64`, 5 runs per case (the capOS harness default is 3
    runs; this collection explicitly set `CAPOS_THREAD_SCALE_RUNS=5`
    for parity with the Linux baseline default). Host caveats:
    internal benchmark VM in a single GCP zone, status `RUNNING`
    during collection, machine `n2-highcpu-8` with nested
    virtualization enabled, `/dev/kvm` readable+writable without
    sudo, SSH operator account, kernel `Linux 6.17.0-1012-gcp
    x86_64`, CPU `Intel(R) Xeon(R) CPU @ 2.80GHz`, distinct
    physical-core layout (logical CPUs 0-3 are core IDs 0-3 thread
    0; logical CPUs 4-7 are the SMT siblings), `qemu-system-x86_64
    8.2.2`, `rustc 1.97.0-nightly (c935696dd 2026-04-29)`.
    
    Exact commands:
    
    ```sh
    # capOS
    PATH="$HOME/.cargo/bin:$PATH" \
      CAPOS_THREAD_SCALE_QEMU_TASKSET_CPUS=0,1,2,3 \
      CAPOS_THREAD_SCALE_RUNS=5 \
      CAPOS_THREAD_SCALE_REQUIRE_SPEEDUP=1 \
      CAPOS_THREAD_SCALE_REQUIRE_TOTAL_SPEEDUP=1 \
      CAPOS_THREAD_SCALE_TIMESTAMP=20260502T213544Z \
      make run-thread-scale
    
    # Linux pthread baseline
    PATH="$HOME/.cargo/bin:$PATH" \
      LINUX_THREAD_SCALE_TASKSET_CPUS=0,1,2,3 \
      LINUX_THREAD_SCALE_RUNS=5 \
      LINUX_THREAD_SCALE_TIMESTAMP=20260502T213445Z \
      make run-linux-thread-scale-baseline
    ```
    
    Raw artifacts on the benchmark VM at
    `target/thread-scale/20260502T213544Z/` and
    `target/linux-thread-scale/20260502T213445Z/`. The instance was
    stopped after collection.
    

Research And Design Gaps Backlog

This file tracks important OS design, development, and user-story areas that are absent, thinly covered, or only indirectly owned by existing capOS proposals. It is a triage register, not an execution queue. Listing a gap here does not change the selected milestone in docs/tasks/state.toml and does not mean the project should immediately create a full proposal.

Promote an entry out of this file only when a visible milestone, paper evidence gap, review finding, or explicit user direction makes the area actionable. Promotion targets:

  • docs/research/ for prior-art survey or external precedent.
  • docs/proposals/ for a concrete reviewed design direction.
  • A focused docs/backlog/ file when the design is accepted enough to decompose implementation.
  • docs/design-risks-register.md when the gap is an active architectural risk with an owner.

Status Vocabulary

  • Uncovered: no owned design exists yet.
  • Thin: mentioned indirectly, but no coherent owner or decision record.
  • Backlog-only: task decomposition exists without a full proposal.
  • Research-needed: design should not start before prior-art review.
  • Ready-for-proposal: enough constraints exist to draft a proposal.
  • Deferred: intentionally future work, not a near-term blocker.
  • Rejected: considered and explicitly not pursued.

Promotion Checklist

Before creating a proposal from an entry here:

  • Identify the visible user or operator outcome the work would enable.
  • List existing capOS docs that already partially cover the area.
  • List the docs/research/ files actually read, or explain why no research file applies.
  • Decide the first capability boundary or trust boundary that must be designed.
  • Define one QEMU, host-test, documentation, or review gate that would prove the proposal made progress.

Display, GUI, And Input

Status: Uncovered.

User story: a user boots capOS on a desktop, laptop, or remote graphical session and uses multiple graphical apps with keyboard, pointer, clipboard, accessibility, and app isolation.

Current coverage: browser, browser/WASM, agent, GPU, and shell proposals point toward future visual sessions, but there is no native display server, compositor, input routing, window authority, clipboard, screenshot, or accessibility model.

Missing decisions:

  • Display ownership and framebuffer/GPU authority.
  • Compositor trust boundary and per-window capability model.
  • Keyboard, pointer, touch, IME, and focus authority.
  • Clipboard and drag/drop data-transfer policy.
  • Screen capture and remote desktop authority.
  • Accessibility-service authority and privacy boundaries.

Research needed:

  • Genode GUI/session routing and report-ROM style composition.
  • Wayland compositor security model and clipboard limitations.
  • Fuchsia Scenic/input pipeline if the native GUI track becomes near-term.
  • seL4/CapROS precedents for trusted path or secure attention, if applicable.

Promote when: native graphical sessions, browser UI, desktop app isolation, or rich web/agent interaction becomes a selected milestone.

Driver Framework And Hotplug

Status: Thin.

User story: an operator plugs in a device, capOS identifies it, starts or restarts the correct isolated driver, and exposes only the intended typed capabilities.

Current coverage: docs/dma-isolation-design.md, docs/backlog/hardware-boot-storage.md, networking, storage, cloud, and GPU proposals cover pieces of device work. There is no general driver framework for discovery, binding, isolation, recovery, firmware, or hotplug.

Missing decisions:

  • Device discovery authority and driver matching policy.
  • Driver process lifecycle, crash restart, and stale handle behavior.
  • Firmware loading and firmware provenance.
  • Hotplug attach/detach semantics.
  • Interrupt, MMIO, DMA, and power authority handoff.
  • User-space driver SDK boundaries and test harnesses.

Research needed:

  • Genode driver components and session routing.
  • Zircon/Fuchsia driver framework concepts.
  • Linux VFIO/uio and userspace-driver isolation tradeoffs.
  • seL4 device-driver partitioning examples.

Promote when: userspace NIC, block-device, USB, GPU, or real hardware bring-up requires reusable driver lifecycle rules.

Power, Suspend, Resume, And Thermal Policy

Status: Uncovered.

User story: a laptop or VM can sleep, wake, preserve sessions, and report power or thermal limits without leaking stale authority or corrupting timers.

Current coverage: tickless scheduling covers timer cleanup and idle mechanics, but not power management as an OS product area.

Missing decisions:

  • Suspend/resume authority and system-wide quiesce protocol.
  • Wake-source capabilities and audit.
  • Battery, charger, lid, and thermal sensor surfaces.
  • CPU frequency, C-state, and thermal-throttling policy.
  • Timer and network behavior across sleep.
  • Session and service liveness after resume.

Research needed:

  • ACPI power-state model and Linux suspend blockers/wakeup sources.
  • Fuchsia power framework if relevant.
  • Genode power-management patterns for component systems.

Promote when: laptop hardware, cloud hibernation, low-power idle, or interactive remote-shell reliability needs sleep/resume semantics.

Time, Clock, And Trusted Timestamp Services

Status: Promoted to proposal (2026-05-22). See Time and Clock Authority and the prior-art note Time and Clock Authority research. Residual research (servo/loop-filter, holdover/error-bound, suspend recovery) is noted in that proposal. Original gap status was Thin.

User story: services can distinguish monotonic time, wall-clock time, and trusted audit time, and cannot silently forge system time.

Current coverage: scheduler and tickless proposals mention clocks, timers, deadlines, and clocksource/clockevent split. There is no user-facing time authority model.

Missing decisions:

  • Monotonic, boot, realtime, and coarse clock capability surfaces.
  • Who can set wall-clock time and how changes are audited.
  • NTP/PTP/cloud-metadata time synchronization authority.
  • Timezone and locale data ownership.
  • Leap-second and clock-step behavior.
  • Timestamp trust level carried into audit records.

Research needed:

  • Linux clock ids, adjtimex/NTP discipline, and time namespaces.
  • Fuchsia clock objects and UTC maintenance.
  • Cloud metadata time and attestation interactions.

Promote when: audit log completion, TLS certificate validation, distributed services, or durable storage needs trusted timestamp semantics.

Software Installation, Packages, And Rollback

Status: Thin.

User story: a user or operator installs an app, inspects requested authority, updates it, rolls it back, and removes its state without ambient filesystem assumptions.

Current coverage: repository composition, storage/naming, userspace binaries, live upgrade, cloud deployment, and public-release proposals cover adjacent pieces. There is no package/app distribution model.

Missing decisions:

  • Package manifest schema and authority-request review.
  • Signed repositories, update channels, and revocation.
  • Dependency resolution and build provenance.
  • App install/remove lifecycle and state ownership.
  • Rollback, staged rollout, and compatibility policy.
  • Vulnerability advisory and emergency update workflow.

Research needed:

  • Nix/Guix, OSTree, Flatpak portals, Android package permissions, and Fuchsia package/update system.
  • Supply-chain signing systems such as TUF/in-toto/Sigstore if this becomes release-critical.

Promote when: capOS needs installable demos, sibling repositories, public release, or cloud image update flow.

Crash Recovery, Supervision, And Diagnostics

Status: Promoted to proposal (2026-05-22). See Crash Recovery and Supervision and the prior-art note Crash Recovery and Supervision research. Residual research (Fuchsia component-manager escrow semantics) is noted in that proposal. Original gap status was Thin.

User story: a service crashes; init or an authorized supervisor restarts it or enters a known degraded mode without leaking authority, hiding the cause, or looping forever.

Current coverage: service architecture already sketches SpawnRequest restart policy, supervisor-owned respawn, and always/on-failure restart modes; capos-service covers service lifecycle pieces; live-upgrade planning ties fault containment to supervisor respawn; and system monitoring covers logs/metrics/crash records at a high level. Crash-loop budgets, core/minidump capture, degraded-mode semantics, watchdog policy, and stale/in-flight cleanup are still not owned as one recovery design.

Missing decisions:

  • Restart policy authority and failure budget.
  • Crash-loop backoff and operator override.
  • Core dump or minidump capture with capability redaction.
  • Watchdog and health-check capabilities.
  • Degraded boot and emergency shell semantics.
  • Stale capabilities and in-flight calls after service death.

Research needed:

  • Erlang/OTP supervision trees, systemd restart policy, Kubernetes probes, and Fuchsia component lifecycle.
  • Capability-system precedent for crash propagation and service replacement.

Promote when: shared services, remote shell, storage, or agent workloads need production-grade recovery behavior.

Backup, Restore, Snapshots, And Migration

Status: Thin.

User story: an operator loses a disk or VM and restores users, services, keys, and app state while avoiding stale authority and accidental data disclosure.

Current coverage: storage/naming, cloud deployment, and the hardware/boot/ storage backlog already cover narrower pieces: user-owned encrypted save transport, fake Drive/Firebase restore rejection tests, rollback/stale handling, and cloud-backed snapshot material. System-wide disaster recovery for users, services, keys, machine identity, and authority state is still not owned as one design.

Missing decisions:

  • Snapshot capability boundary and consistency protocol.
  • Encrypted export/import and restore identity.
  • Key recovery and disaster recovery drills.
  • Partial restore and per-service state ownership.
  • Backup retention, deletion, and privacy policy.
  • Migration between machines or cloud instances.

Research needed:

  • ZFS/Btrfs snapshot semantics, Borg/Restic encrypted backup models, and cloud snapshot/key-management practices.
  • Capability-specific concerns from EROS/CapROS persistence if applicable.

Promote when: writable storage, durable local accounts, volume encryption, or cloud deployment becomes near-term.

Human-Facing Administration And Explainability

Status: Thin.

User story: an operator can answer who has access to a service, why, since when, what will happen if access is revoked, and why a request was denied.

Current coverage: shell, system info, local users, system monitoring, configuration, and security proposals cover pieces. There is no unified administrator UX or policy explainability track.

Missing decisions:

  • Account and role management commands or UI.
  • Grant inspection, diff, revoke, and dry-run behavior.
  • Denial explanation format across kernel, broker, and services.
  • Audit search and incident timeline views.
  • Diagnostics bundle generation and redaction.
  • Safe repair workflow for broken configuration or policy.

Research needed:

  • Kubernetes RBAC can-i/audit practices.
  • Cloud IAM policy simulators and access-analyzer tools.
  • Genode configuration/reporting UX for component graphs.

Promote when: local users, ABAC/MAC, remote shell, or operator configuration needs day-2 administration rather than proof-only commands.

Developer Debugging, Profiling, And Tooling

Status: Partially promoted to proposal (2026-05-22). The debug/trace/profile authority slice is now Debug and Trace Authority with the prior-art note Debug, Trace, and Profiling Authority research. The broader developer-tooling surface (service templates, local SDK, schema explorer, request-replay) remains Thin and is not yet owned by a proposal.

User story: a developer writes a capOS service, runs it locally, debugs a failed capability call, profiles it, and ships it with reproducible evidence.

Current coverage: harness engineering, benchmarks, generated-code checks, run-targets, and the paper evidence track cover pieces. There is no full debugger/profiler/developer-tooling proposal.

Missing decisions:

  • Debug authority and process attach policy.
  • Symbols, stack traces, crash dumps, and source maps.
  • Ring/syscall/capability-call tracing.
  • Service schema explorer and request replay tooling.
  • Guest profiling, flamegraph, and benchmark attribution workflow.
  • App/service templates and local developer SDK.

Research needed:

  • GDB remote protocol, Linux perf/eBPF-style tracing boundaries, Fuchsia diagnostics, and seL4 debug authority practices.

Promote when: non-trivial third-party services, public release, or performance claims need repeatable developer workflows.

Compatibility And App Porting Strategy

Status: Thin.

User story: a developer ports a small existing CLI or server to capOS and knows which Unix assumptions work, fail, or require explicit capability adapters.

Current coverage: userspace binaries, Go, Lua, POSIX adapters, WASI, C/C++, and language-runtime proposals mention porting targets. There is no concrete compatibility profile matrix.

Missing decisions:

  • Minimal libc/POSIX surface and unsupported-call policy.
  • Filesystem, environment, argv, signal, pipe, socket, and process semantics.
  • Dynamic linking and shared-library policy.
  • WASI adapter authority model.
  • Build recipes and package corpus selection.
  • Porting report template and acceptance tests.

Research needed:

  • WASI preview models, CloudABI history, Redox, Hermit, Fuchsia POSIX layer, and Genode libc/VFS integration.

Promote when: a language runtime, POSIX adapter, or real application corpus becomes a selected milestone.

Accessibility And Internationalization

Status: Uncovered.

User story: non-English users and assistive-technology users can operate capOS shells, graphical sessions, and web/agent surfaces without privileged workarounds.

Current coverage: none beyond general shell/browser surface discussions.

Missing decisions:

  • Unicode, locale, collation, and timezone data ownership.
  • Input methods and keyboard layout authority.
  • Screen reader or accessibility tree service boundary.
  • High-contrast, font scaling, and reduced-motion policy.
  • Translation and message-catalog strategy.
  • Accessible denial/audit messages and setup flow.

Research needed:

  • Web accessibility platform architecture, Wayland accessibility status, Fuchsia accessibility manager, and terminal accessibility conventions.

Promote when: graphical sessions, public demos, web shell, or production interactive setup becomes user-facing beyond developer/operator proof flows.

Fleet Operations And Remote Management

Status: Thin.

User story: an operator manages many capOS nodes and can prove which version, policy, keys, services, and update state each node is running.

Current coverage: cloud deployment, cloud metadata, system monitoring, configuration, hosted agents, and public release cover adjacent concerns. There is no fleet-management design.

Missing decisions:

  • Node enrollment and identity bootstrap.
  • Remote attestation and inventory reporting.
  • Configuration rollout and drift detection.
  • Remote logs, metrics, and audit aggregation.
  • Staged update and rollback policy.
  • Break-glass access and emergency revocation.

Research needed:

  • Kubernetes node/bootstrap models, cloud instance identity, SPIFFE/SPIRE, TPM/measured boot attestation, and OSQuery-style inventory.

Promote when: cloud deployment, hosted agent swarms, public release, or remote administration becomes more than a single-node proof.

Privacy And Data Governance

Status: Thin.

User story: a user can see and revoke what data a service can access, and deleted data does not unintentionally persist in logs, backups, or derived indexes.

Current coverage: capability authority, session privacy, audit redaction, identity policy, storage, monitoring, and browser/agent proposals cover parts of the problem. There is no explicit data-governance design.

Missing decisions:

  • Data classification and purpose-bound access metadata.
  • Retention, deletion, and legal-hold semantics.
  • Derived data, indexes, caches, and backup deletion behavior.
  • User consent and service data export.
  • Audit redaction versus forensic retention.
  • Cross-service data-sharing policy and review UX.

Research needed:

  • Object-capability privacy patterns, GDPR-style data lifecycle controls, browser permission UX, and cloud DLP/data catalog practices.

Promote when: persistent user data, browser/agent activity, hosted services, or public release introduces real privacy expectations.

Security And Verification Backlog

Detailed decompositions for security and verification work. docs/tasks/README.md links here but should not inline these subtasks.

Stage-6 Trust-Boundary Refresh

  • Refresh trust-boundary docs after Stage 6 IPC/capability-transfer work.

Untrusted-Service Hardening Pass

Cover unmapped pointers, kernel-half pointers, invalid capability IDs, corrupted rings, SQ/CQ overflow behavior, and a service without Console authority. Audit manifest, ELF, SQE, params, and result-buffer paths so untrusted input fails closed instead of reaching kernel panic paths.

Completed context:

  • Panic-surface inventory: audited panic!, assert!, unwrap, and expect reachable from manifest, ELF, SQE, params, result-buffer, IPC, and spawn inputs.
  • Ring/user-pointer hostile demos: added unmapped params/result-pointer, kernel-half params-path, invalid-capability-ID, corrupted RETURN call_id, corrupted SQ/CQ head, undersized-params, undersized-result, and SQ/CQ overflow coverage.
  • No-authority smoke: empty-CapSet service verifies expected cap lookups fail and invalid-cap CALLs return controlled CQEs; after removal of syscall 0, it proves a no-authority process cannot write and can only exit/cap_enter.

Remaining decomposition:

  • Quota and exhaustion smokes (make run-untrusted-exhaustion, two QEMU passes; covered 2026-05-25 06:42 EEST):
    • Cap-table and endpoint-queue exhaustion fail closed without corrupting existing calls. Endpoint-queue is proven by the small-scratch core pass (per-owner queue ceiling -> Overloaded, then a held console call still completes). Cap-table is proven by the small-scratch core pass and the default-profile *-captable companion pass: single-frame MemoryObject allocations first return bounded FrameAllocator success replies, then continue until the per-process cap-slot ledger fails closed (Overloaded: failed to reserve MemoryObject cap slot); a held console call still completes after the boundary.
    • Scratch/result-buffer pressure returns controlled errors and later valid calls still complete (core pass: ring-scratch oversize CALL rejected with CAP_ERR_INVALID_REQUEST, reply-scratch clamp returns a serialized exception, then a valid console write completes).
    • Repeated invalid submissions stay bounded: each structurally invalid SQE returns a controlled per-SQE error CQE and the ring stays usable (a recovery NOP completes). Note: the per-key token-bucket log aggregation in docs/authority-accounting-transfer-design.md §3 (D1/D2 suppressed- count summary line) is still a design target, not implemented; the smoke asserts bounded per-SQE rejection, not the summary line.
    • Frame-grant-page exhaustion: not cleanly reachable from a smoke. For single-page allocations the cap-slot ceiling (PROCESS_CAP_SLOT_LIMIT, 256) is reached far before the frame-grant ledger (PROCESS_FRAME_GRANT_PAGE_LIMIT, 4096 pages), and reaching 4096 grant pages needs large contiguous allocations whose failure mode is physical fragmentation, not the grant ledger. The cap-table pass exercises the same fail-closed preflight-reserve path. Remaining gap.
  • Fail-closed cleanup: the FrameAllocator success-path result serialization now honors the caller’s effective reply-scratch capacity, so small-scratch processes can receive bounded MemoryObject result caps before cap-slot exhaustion fails closed. Closed by security-reply-scratch-success-path-limit-local-proof.

Kani Harness Bounds Refresh

  • Revisit Kani harness bounds and proof shape once capability transfer, resource accounting, or user-buffer validation has more concrete proof obligations. Keep current bounds practical for make kani-lib; expand only when the added verifier cost buys a specific kernel invariant.

DMA Assurance Model Operationalization

dma-assurance-model-v0 (2026-05-24) landed the accepted proposal (docs/proposals/dma-assurance-model-proposal.md) and inspectable-only TLA+/Alloy skeletons (models/dma/), but stopped there: no run target, no CI gate, no reconciliation with DMA code landed since. Kickoff task: dma-assurance-model-operationalization (decomposition — reconciled the v0 model with landed code and emitted the per-tool slices below).

  • Reconcile models/dma/ with landed invariants (ownership-generation on recycle, map-record-before-PTE-install ordering, drive-pin, epoch fence, scrub-before-free): gap table in models/dma/README.md grounded against the landed symbols, done 2026-06-04.
  • make model-dma-tla — bounded TLC run of dma_authority.tla (pinned TLC 2.19 / tla2tools 1.7.4 + pinned Temurin JRE 17.0.19), lifecycle ordering plus generation-keyed stale completion, record-before-PTE-install split, drive-pin/quarantine, and queue-enable epoch-fence interleavings, checked clean at 2 devices / 2 domains / 2 pages / 2 iovas, generations 0..1, done 2026-06-04: dma-assurance-model-tla-checked-gate.
  • make model-dma-alloy — Alloy analysis of dma_authority.als (pinned Alloy Analyzer 6.2.0), device/domain/IOVA/page/alias authority graph plus the ownership-generation stale-handle gate, checked at scope for 4, done 2026-06-04: dma-assurance-model-alloy-checked-gate.
  • make kani-dma-authority — bounded Kani over an extracted pure DMA-authority core (capos_lib::dma_authority: ownership-generation bump on recycle, stale-handle rejection without mutation, no-re-expose before completion), make kani-lib style, done 2026-06-04. Faithful extraction of the device_dma.rs authority arithmetic; routing the kernel call site through the core is a tracked follow-up (kernel is no_std/no_main, not host-built): dma-assurance-model-kani-authority-core.
  • make model-dma-deferred-completion-loom — focused Loom (pinned 0.7.2) over the DeferredCompletionQueue reservation budget and the multi-CPU TLB shootdown generation re-read (deferred-EOI / completion concurrency the ring Loom does not cover), done 2026-06-04: dma-assurance-model-deferred-completion-loom.
  • CI wiring into the GitHub gate and local aggregate now that each target has a checked result. make dma-assurance-model-check runs Alloy/TLA+/Loom/ Kani locally when cargo-kani is installed; GitHub CI runs model-dma-alloy, model-dma-tla, and model-dma-deferred-completion-loom in dma-assurance-models, and kani-dma-authority in kani-proofs. Done 2026-06-05: dma-assurance-model-ci-wiring.

Scheduler & IRQ Assurance Models

The scheduler is the densest unmodeled concurrency surface in the kernel (per-CPU atomics read lock-free from ISR context while another path holds the scheduler mutex via try_lock, plus IPI cross-CPU activation) and has zero formal coverage today (smoke + measured suppressed-tick counters only). The IRQ MSI-X waiter race was fixed by reproduction, not a model. Mirrors the DMA operationalization pattern; tasks reuse the TLC/Alloy/Kani pins that track lands.

  • S1 scheduler-nohz-activation-model (done 2026-06-04 09:00 UTC) – TLA+/TLC for the nohz activation/rollback lifecycle + a focused Loom for the lock-free NOHZ_ACTIVE_CPUS bit vs locked nohz_activation[slot] record race. make model-scheduler-nohz-tla checks no timer-less CPU (NoTimerlessStall + EventuallyReArmed), bit/record agreement (EventuallyConsistent), and that a staled remote activation is dropped not applied to a newer lease (NoStaleActivation + StaledRecordEventuallyCleared); make model-scheduler-nohz-loom checks the lock-free-bit ↔ locked-record reconciliation keeps the timer armed. Checked results + mutation/non-vacuity evidence in models/scheduler/README.md.
  • S2 scheduler-lapic-oneshot-timer-model (done 2026-06-04) – Kani over the extracted pure count/clamp arithmetic (capos_lib::clockevent) + a TLA+ mode-transition lemma pinning the halt-first reprogram ordering. make kani-lapic-oneshot proves the clamp window is well-formed, the armed count is in [1, u32::MAX] with no u128 overflow, and the count round-trips to the request within one LAPIC count (3/3 SUCCESSFUL). make model-scheduler-lapic-oneshot-tla checks that after the halt-first reprogram the next fire is the one-shot at the armed count, never the periodic reload (OneshotModeBoundedCount + HaltedDisarmed + the periodicFiredInOneshotMode sentinel), and that every fire path (user- and kernel-mode consumption) restores a timer source (NoTimerlessStall + EventuallyReArmed / FiredEventuallyRestored liveness). Checked results + mutation/non-vacuity evidence in models/scheduler/README.md.
  • S3/S4 scheduler-cpu-isolation-lease-authority-model (done 2026-06-04 07:04 UTC) – Alloy for the lease/grant relational invariants + TLA+ for the two-lock teardown and the documented non-atomic createLease-vs-revokeGrant SMP window. make model-scheduler-lease-alloy checks the unforgeable grant->lease binding, no live lease through a revoked grant outside the explicitly modeled bounded window, capacity never undercounting a live lease, and the stale-handle generation gate; make model-scheduler-lease-tla checks generation advances exactly once per termination, no capacity double-free, the single chokepoint always runs unregister + SQPOLL-stop + nohz-rollback before recycle, no stranded generation (liveness), and that the renew deadline branch never resurrects. Checked results + non-vacuity evidence in models/scheduler/README.md.
  • IRQ irq-msix-waiter-determinism-model (done 2026-06-04 06:10 UTC) – TLA+/TLC for the waiter <-> delivery <-> deferred-EOI ordering the RX MSI-X waiter fix established. make model-irq-waiter-tla checks no spurious/early injection (NoCompletedEarly), exactly-once delivery/EOI/completion accounting, EOI drain before route re-arm (EpochDrainSound), and the NoLostWake liveness property; checked result + mutation evidence in models/irq/README.md.

Preserved Completed Security Context

These are completed and should not be re-read by default. They remain here so future work can find their design context without bloating docs/tasks/README.md.

  • Authority graph and resource accounting design for transfer model: docs/authority-accounting-transfer-design.md.
  • Supply-chain and generated-code TCB hardening: pinned Limine and external build downloads, generated-code drift checks, dependency policy, pinned Cap’n Proto compiler, shared tools/capnp-build, and deterministic generated-binding comparisons.
  • DMA isolation model before PCI/virtio/user-driver work: docs/dma-isolation-design.md defines short-term QEMU bounce-buffer decision, DMAPool, DeviceMmio, Interrupt invariants, and the userspace-driver transition gate.
  • ELF parser arbitrary-input coverage: proptest coverage plus a bounded cargo-fuzz target.
  • Telnet IAC filter fuzz coverage: TelnetFilter extracted to capos-lib::telnet, fuzz/fuzz_targets/telnet_filter.rs exercises the state machine with structural assertions (Normal/AfterIac emission rules, monotonic emit count). Will travel with the parser when networking moves to userspace per docs/proposals/networking-proposal.md.
  • Telnet IAC filter differential round-trip fuzzing (fuzz/fuzz_targets/telnet_filter_roundtrip.rs): synthesize arbitrary RFC 854 event streams (Data, WILL/WONT/DO/DONT, SB blocks with payload), encode to wire bytes, and assert that filter output equals the concatenated Data payloads. Found a real EXOPL handling bug in the original filter – the option byte right after IAC SB was being interpreted as the start of an IAC IAC escape, leaving the filter stuck in the subnegotiation state and silently dropping all subsequent data bytes. Fixed via a new AfterSb state that consumes the option byte unconditionally before entering payload parsing.
  • Line discipline extraction and fuzz coverage: pure LineDiscipline lives in capos-lib::line_discipline, returning LineStep { outcome, echo } descriptions. The kernel transport drives it and translates Echo::Byte/Echo::Backspace/Cancelled/ Submitted/Reprompt into the existing send_*_track_cr calls. Backed by fuzz/fuzz_targets/line_discipline.rs with structural invariants (line_len <= max_bytes, ±1 line_len delta per Pending step, Cancelled clears, echo only when buffer grows/shrinks).
  • Future: differential fuzzing against an external Telnet library (libtelnet or a Rust port) to catch RFC conformance bugs the structural and round-trip targets cannot express. Tracked as a follow-up to Track S.14.
  • Ring SQE wire validation extraction and fuzz coverage: lifted the per-opcode *_sqe_has_unsupported_fields predicates from the kernel into capos_config::ring, exposed a unified sqe_wire_validation_error entry point, and reroute the kernel through it. Added fuzz/fuzz_targets/sqe_validation.rs plus 12 host unit tests covering the classification boundaries each opcode imposes. Closes the originally planned three-parser fuzz set (elf::parse, manifest::decode, ring SQE decoder).
  • Well-formed SQE generator oracle (docs/tasks/done/2026-06-06/security-sqe-well-formed-generator-fuzz-local-proof.md): the test/fuzz-only sqe-validation-oracle feature exposes capos_config::ring::sqe_oracle, which generates validator-accepted SQEs for every opcode accepted by the current build, plus one-field rejecting mutations for reserved fields, unsupported flags, constrained cap/pointer/size fields, opcode-specific constraints, and session-disclosure reserved bits. The existing sqe_validation fuzz target keeps arbitrary-byte coverage and runs the positive/negative oracle on each input. This closes the shared wire-validator oracle gap only; it does not claim cap-table lookup, userspace pointer mapping, transfer-descriptor loading, or full kernel ring semantic coverage.
  • Track S.17 – sanitizers on host tests – partially landed. make sanitizer-host-tests runs AddressSanitizer over the capos-lib and capos-config host suites (crate set / features mirror the test-lib / test-config aliases). Outcome: zero findings – both suites pass clean under ASan, including the named unsafe suspects (FrameBitmap slot indexing, CapTable generation counters, lazy_buffer raw &mut [u8]). The “cheap to add” claim holds for ASan only: it needs no -Zbuild-std because its libc interceptors cover the uninstrumented precompiled std. - ThreadSanitizer (make sanitizer-host-tests-tsan) is blocked upstream, not by a capOS defect. TSan changes the crate ABI, so rustc refuses to link sanitized code against the uninstrumented precompiled std (mixing -Zsanitizer will cause an ABI mismatch). Instrumenting std needs -Zbuild-std, which then fails with duplicate lang item in crate core: sized for build-script-bearing dependencies (typenum / libc / cfg-if / subtle) when the sanitizer target equals the host triple – reproduced four ways (plain -Zbuild-std, renamed --target JSON spec, target-applies-to-host=false, and -Zhost-config). The TSan target is kept wired so it starts passing once the upstream -Zbuild-std + build-script issue is fixed. capOS concurrency invariants are meanwhile covered by the dedicated Loom model (cargo test-ring-loom); these host unit tests spawn no threads, so TSan’s marginal value here is low.

Hardware, Boot, And Storage Backlog

Detailed decompositions for hardware, boot packaging, block devices, and local storage. docs/tasks/README.md links here but should not inline these subtasks.

This is a forward-decomposition reservoir: it carries the open frontier (explicit DDF follow-up tasks, cloud/network next gaps, and the DMA-authority invariants that constrain new slices). Landed proof-by-proof chronology lives in docs/tasks/done/, docs/changelog.md, and git history; this file keeps only one-line “Landed:” pointers to it where a reader needs to know a capability exists.

DDF Dispatch Budget

Device Driver Foundation was the previously selected milestone and its production-authority closeout is recorded for the current brokered-bounce path. Future DDF slices should not reopen the retained review finding as a generic blocker; they should advance explicit follow-up tasks such as direct-remapping/vIOMMU production hardware support, device-autonomous MSI-X delivery, broader writable-DeviceMmio region selection, or follow-on provider/device variants. Harness-only updates should protect one of those authority steps rather than add another standalone proof layer.

Landed: the IOMMU/remapping groundwork and its disabled scaffold (DRHD/source/ domain records, MMIO-status diagnostics, disabled IOVA ledger, mapping-lifecycle preflight) through the bounded QEMU Intel path; see the IOMMU section below and docs/tasks/done/2026-05-12/ .. done/2026-05-23/.

docs/proposals/device-manager-refactor-proposal.md core refactor has landed: the device manager is the kernel/src/device_manager/ module tree. Remaining refactor work is optional risk reduction only: run behavior-preserving registry, ledger, or proof-internal splits when they reduce the risk of upcoming DeviceMmio, Interrupt, or DMAPool authority work or unblock that work’s review. Those slices remain subordinate to behavior-moving DDF authority slices and to scheduler SMP/nohz prerequisites.

Landed local follow-up: multi-PRP brokered NVMe BlockDevice windows (ddf-nvme-multiprp-blockdevice-window-local-proof). Landed local follow-up: the read-cap reply-scratch fail-closed clamp (storage-file-read-reply-scratch-clamp). Landed local follow-up: DeviceMmio map/unmap stale-generation proof (ddf-devicemmio-map-unmap-stale-generation-local-proof). Landed local follow-up: production DeviceMmio teardown transaction manager hold proof (ddf-devicemmio-production-teardown-transaction-local-proof). Landed local follow-up: production DMAPool buffer lifecycle over the manager ledger (ddf-dmapool-production-buffer-lifecycle-local-proof). Landed local follow-up: manager-owned DMABuffer free/reuse generation (ddf-dmabuffer-free-reuse-generation-local-proof). Landed local follow-up: Interrupt waiter reset-generation (ddf-interrupt-waiter-reset-generation-local-proof). Landed local follow-up: production Interrupt routed waiter / deferred-EOI lifecycle over the manager ledger (ddf-interrupt-production-waiter-lifecycle-local-proof). Landed local follow-up: provider IRQ/MSI stale-notification hostile lifecycle proof (ddf-provider-interrupt-stale-notification-hostile-local-proof). Landed closeout: the retained DDF production-authority review finding is closed by ddf-production-authority-closeout. Keep direct-remapping/vIOMMU and broad umbrella tasks blocked until their named gates are actually satisfied.

Growing the inline AttachedDmaPoolRecord::proof_buffers slot count beyond three slots is blocked on a prerequisite refactor: boot-time proof emissions pass AttachedDmaPoolRecord by value through nested paths starting at validate_dmapool_budget_policy_for_record (kernel/src/device_manager/dma_pool.rs) and the descriptor lifecycle emissions in kernel/src/device_manager/proofs.rs. A direct slot bump to four double-faulted make run-net with the BSP boot stack exhausted. Prerequisite: ddf-attached-dmapool-record-by-ref is done; any future proof-buffer growth should verify it still avoids by-value stack expansion before increasing the inline slot count.

Device Manager Refactor Track

The refactor keeps the kernel device manager as the single authoritative ledger for claimed devices. It must preserve the same ownership transactions across DMAPool, DMABuffer, DeviceMmio, and Interrupt; it should not create independent managers or move authority decisions into userspace.

Landed: proof split, handles/errors split, domain modules (mmio.rs/dma_pool.rs/dma_buffer.rs/interrupt.rs), and the transaction-helper cleanup, all while PciDeviceRecord remains the aggregate ledger owner. See ddf-device-manager-proof-split-closeout, ddf-device-manager-handles-errors-split, ddf-device-manager-domain-modules, and ddf-device-manager-transaction-helper-cleanup.

Open:

  • Optional follow-up splits. Further registry, ledger, or proof-internal splits may run when they are behavior-preserving and reduce near-term DDF review risk. They must preserve cap semantics, audit labels, proof labels, QEMU smoke output, lock ordering, and the single aggregate PciDeviceRecord ownership ledger.

Conflict guidance: treat this as part of the DDF kernel-core serial surface. It owns kernel/src/device_manager/ and overlaps with any DDF slice touching kernel/src/cap/device_mmio.rs, kernel/src/cap/interrupt.rs, kernel/src/cap/dma_pool.rs, kernel/src/cap/dma_buffer.rs, kernel/src/device_dma.rs, kernel/src/device_interrupt.rs, or DDF QEMU smoke assertions. Do not run it in parallel with scheduler SMP/nohz kernel slices that need kernel/src/process.rs or kernel/src/sched.rs review capacity if those prerequisites are the selected blocker.

Bootable Disk Image

Landed (complete track): make image raw hybrid BIOS+UEFI disk image, make run-disk (OVMF) and make run-disk-bios boot proofs, and provider packaging helpers (make package-cloud-image / package-gcp-image / package-aws-image) plus the import notes in docs/backlog/cloud-image-import.md. See docs/tasks/done/ (disk-image-*, closed 2026-05-25). Cloud NIC/storage driver ownership remains a separate, blocked track below.

Serial Diagnostics Console

Visible outcome: before cloud NIC/storage drivers are trusted, a cloud VM can boot to a COM1 diagnostics prompt and expose enough state to debug ACPI, PCI, interrupt, DMA, storage, and NIC bring-up through the provider serial console.

Landed: the COM1 diagnostics mode (no network/disk), the bounded command set (help/status/reboot/halt/cpu/mem/acpi/pci/irq/timers/ devices/logs; reboot is a recognized placeholder), the ACPI/PCI and virtio-net/DMA-ledger/interrupt-route dump slices, and scripted QEMU coverage.

Open:

  • Keep the serial path for command/control and bounded diagnostics only. Do not require large binary upload, in-place kernel replacement, or high-volume tracing over provider serial consoles.

ACPI And PCIe Discovery

Landed: Limine RSDP map, MADT LAPIC/I/O APIC enumeration, and MCFG parse with PCIe ECAM config-space access beside legacy QEMU I/O-port access.

Interrupt Infrastructure

Depends on ACPI and SMP Phase C LAPIC timer/IPI.

The MSI-X proof is kernel-owned: virtio-net config/RX/TX sources are recorded in the device interrupt registry against a bounded first-fit LAPIC device MSI vector pool, programmed through the typed PCI MSI-X table helper, claimed and unmasked by the in-kernel virtio-net owner, assigned to virtio vector registers, and proved by the TX source’s dispatch counter. A metadata-only QEMU virtio-rng function reuses the same path with a distinct claimed-masked owner. That virtio-rng function is a QEMU-only proof fixture, not a production driver and backs no userspace-facing capability (see virtio-rng); the entropy service is the separate RDRAND-backed EntropySource cap. Legacy I/O APIC routes have a bounded QEMU proof through the same registry.

Landed (kernel-side proof evidence, docs/tasks/done/2026-05-* / done/2026/; also make run-net, make run-interrupt-grant, make run-hardware-audit*): masked I/O APIC routing foundation, MSI/MSI-X capability discovery, the static and registry-backed virtio-net source-route proofs, the device MSI vector pool + exhaustion policy, claimed-route lifecycle / vector reassignment / stale-route rejection, driver-owned mask/unmask, the second-device (virtio-rng) proof, the first device-manager ownership and interrupt-source handoff proofs, the bounded teardown-trigger contract (seven object-backed rows), cap-specific release/process-exit/driver-crash/reset-disable/interrupt-waiter teardown hooks for DeviceMmioCap/InterruptCap/DmaPoolCap/DmaBufferCap, the read-side HardwareAuditLog.snapshot coverage, pending-IRQ token validation through capos-lib::device_authority, and bounded Interrupt wait/acknowledge/mask/unmask admission promoted to bounded route-state control plus one manager-grant-source routed waiter / deferred-EOI lifecycle proof (make run-interrupt-grant).

Open:

  • Continue real interrupt-source teardown beyond the manager-grant-source routed waiter proof: provider-driver IRQ/MSI waiters now have a local hostile stale-notification proof for reset/release/provider-death/waiter- cancel boundaries, but broader process-exit/driver-crash/reset-disable smoke coverage must keep using the proven ownership lifecycle rather than a separate route cleanup path.
  • Expose userspace Interrupt authority only after source ownership, generation checks, broader stale-notification lifecycle wiring, and the S.11.2 hostile IRQ smokes are implemented.
  • Add a selected-mode x2APIC QEMU proof over the landed x2APIC MSR backend (kernel/src/arch/x86_64/lapic.rs): make run-interrupt-grant-x2apic boots with -cpu qemu64,+smep,+smap,+rdrand,+x2apic, asserts LapicMode::X2Apic, and reuses the routed Interrupt.wait / Interrupt.acknowledge proof. This remains a bounded local proof, not a high-core hardware readiness claim.

PCI/PCIe Infrastructure

Promotes PCI enumeration from a networking substep to a reusable subsystem consumed by all device drivers.

Landed: PCI config access via legacy I/O ports and PCIe ECAM, the ECAM function mapping cache/ledger, full Q35 bus enumeration (scanned_buses=256), BAR parsing + reusable kernel MMIO subregion mapping, MSI/MSI-X metadata discovery, the second-device (virtio-rng) PCI proof, and the metadata-only QEMU NVMe PCI proof (make run-pci-nvme). See docs/tasks/done/.

NVMe userspace-bind chain (forward-relevant; landed steps with their successor gaps preserved):

  • Landed: the Model B kernel on-notify DMA validator (nvme-doorbell-dma-validator, kernel/src/cap/nvme_doorbell_validator.rs, validate_doorbell_scan / completion_wakes_waiter): provider-writes / kernel-validates, fails closed outside the owner’s granted DMA window. Synthetic owner windows stand in for the live grant ledger; wiring the validator into a live NVMe DeviceMmio doorbell claim is valid only on a verified direct-remapping/vIOMMU or synthetic-address lane. The current no-IOMMU QEMU/GCP lane must use brokered queue-base/PRP materialization instead. Design: docs/proposals/nvme-model-b-doorbell-dma-validator.md; provenance: docs/devices/nvme.md; reconciliation: docs/dma-isolation-design.md (Provider-Written Addresses And No-IOMMU Brokered Bounce).
  • Landed: the read-only userspace NVMe bind (nvme-bind-claimed-mmio-read), userspace NVMe controller reset (nvme-controller-reset-selected-write, CC-scoped fail-closed selected write), and the no-IOMMU brokered controller enable (DeviceMmio.brokeredNvmeControllerEnable, schema @6; kernel-authored AQA/ASQ/ACQ from the live DMAPool ledger, no provider-supplied CC bits or host-physical/device-visible address). Proof make run-pci-nvme; provenance docs/devices/nvme.md §§5-6.
  • The provider-written Model B enable (nvme-userspace-bind-and-controller-bringup) remains a separate direct-remapping/vIOMMU lane (still open / blocked).

Open:

  • Add userspace DeviceMmio authority and ownership boundaries for out-of-kernel drivers only after the device-manager and DMAPool gates below are in place.
  • Extend beyond metadata-only discovery to virtio and NVMe driver binding as those reusable driver paths land.

Device Authority And Userspace Driver Gate

Ordered after the generic MSI/MSI-X dispatch table and second-device proof. The current brokered-bounce provider paths have landed their local/GCE evidence; future direct-remapping/vIOMMU, provider-written-address, hostile-hardware, or broader device-owner paths remain gated by the selected backend contract and Security Verification Track S.11.2 in docs/dma-isolation-design.md.

DMA authority invariants (settled; these constrain every new slice — do not weaken them). Per docs/dma-isolation-design.md (accepted): backend selection is a runtime, fail-closed kernel decision — direct IOMMU remapping only when a probe verifies usable hardware, otherwise kernel-owned bounce buffers. On the no-IOMMU lane the manager is the single owner of every bounce page’s host-physical address and IOVA: host_physical_user_visible=0, direct_dma=blocked, iova_export=disabled-future-only, real DMA not-attempted. Pool/buffer/handle lifecycle is generation-checked and fail-closed on stale/freed/wrong-owner/wrong-state; pages stay committed, resident, and unswappable while device-visible and are scrubbed before release; quiesce + scrub precede free; stale completions and stale IRQs after reset must not wake a waiter or mutate accounting. The device-manager ledger is the single record of DMA pool bytes, buffer count, descriptor/ring depth, page-rounded MMIO mappings, interrupt holds, in-flight DMA submissions/completions, ownership generations, budget/OOM policy, and teardown state.

Landed (prerequisite proofs and the first production userspace surface; docs/tasks/done/, make run-net / make run-dmapool-grant / make run-dmapool-grant-exit / make run-devicemmio-grant / make run-interrupt-grant / make run-hardware-grant-cycle / make run-hardware-audit* / make run-ddf-provider-consumer / make run-iommu-remapping):

  • the in-kernel device-manager object model, interrupt-source attach/detach, the kernel-owned DMAPool accounting / budget / OOM / tamper / over-budget proofs bound to attached records, the imported-live-accounting record over the device_dma ledger, and the device_dma zero-live / stale-handle / stale-completion / publication scratch proofs routed through the pure capos-lib::device_authority validators;
  • the documented production handle epoch invariants plus their pure validator and host tests; the manager-attached DMA-buffer record proof;
  • the production DMAPool.allocateBuffer result-cap method and its manifest grant, plus admission/typed DMABuffer submitDescriptor / completeDescriptor / map coverage, the userspace-VMA bounce-buffer map + protection hardening, the shared descriptor validator and manager-inflight accounting, the userspace-visible completion effect, and the provider-visible shadow-descriptor / selected-queue-entry side effects feeding the provider-consumer gate; the selected virtio-net TX backend + notify-offset claim policy;
  • the bounded sequential DeviceMmio/Interrupt grant-cycle reuse proof; admission + shared-validator + real-effect DeviceMmio map / read32 / write32 coverage; cursorable/edge read-side audit snapshots; and the first production DeviceMmioCap/InterruptCap cap-release + process-exit hooks; the exposed DeviceMmio user-map path records a manager-owned user hold (borrowed VMA, page-rounded BAR window, mapping generation, and selected-write policy label) and explicit unmap, cap release, and process exit clear it before detaching the reusable mapping generation; driver-crash and reset/disable hook markers remain bounded no-userspace-MMIO proofs that assert no user hold is live before detach;
  • the manager-handle identity fields carried into the result-only DMAPool/DMABuffer/DeviceMmio/Interrupt info surfaces;
  • the real pinned-page DMAPool page-lifecycle slice (ddf-real-dmapool-pinned-page-realness, done 2026-05-26): the kernel ledger owns real scrubbed frame::alloc_frame_zeroed pages and the manager imports a live snapshot on the honest bounce-buffer run-net path;
  • the S.11.2 hostile smokes (stale DMA handles, descriptor abuse, revoke/reset races, stale IRQ after reset, stale DMA completion after reset, exit-under-DMA; S.11.2.7/8 over real free/realloc on make run-net; the IOMMU-backed production matrix on make run-iommu-remapping); see docs/tasks/done/2026-05-26/ and the IOMMU section;
  • the first exposed userspace DeviceMmio + Interrupt surface (ddf-userspace-writable-devicemmio-interrupt, done 2026-05-26): read-only BAR map + brokered read32 + a real write32 on a claimed register, manager-cap wait/mask/unmask with deferred delivery and no-stale-wake-after-revoke, and real-route userspace wait/acknowledge with deferred LAPIC EOI through the provider tx_interrupt/rx_interrupt caps driven by a userspace process (make run-ddf-provider-consumer), plus the non-implication negative-authority assertions on both grant smokes.

Open:

  • Require DDF authority-surface hazard preflight before new behavior slices. The slice handoff/review prompt should state the relevant paging/MMIO, DMA, IRQ, ABI, and docs-authority invariants before code changes start. This is a workflow gate for avoiding bounded-proof overclaims and late review discovery of known infrastructure hazards.
  • Broader writable-DeviceMmio region selection remains out of scope until a separate manager-selected register-window design lands.
  • Direct-remapping/vIOMMU, provider-written device addresses, and hostile bus-mastering hardware isolation remain future work. The current no-IOMMU cloud path stays on brokered bounce-buffer authority.
  • Physical Store-backed hardware-audit local persistence, keyed segment seals, and runtime subscriber refusal are closed by hardware-audit-physical-persistence-signing-local-proof: the QEMU proof reuses one persistent_store disk across two boots, recovers pass-1 audit segment blobs through Store inventory before pass-2 drain, verifies development-source RAM-local HMAC segment seals, reports key lifecycle caveats, and refuses runtime reader admission until an authority-broker path exists. External verifier key custody, production rotation/revocation, rollback resistance, and broader runtime admission remain future; audit is observer evidence and does not grant DMA/MMIO/IRQ authority.
  • Device-autonomous MSI-X local APIC delivery is closed by cloud-prod-qemu-kvm-virtio-net-msix-apic-delivery-resolution and the dependent RX waiter proof cloud-prod-virtio-net-rx-device-autonomous-msix-raise-local-proof. The current provider path can still use polled completion when interrupt delivery is not required, and live-GCE device-autonomous interrupt evidence remains future work.

IOMMU/DMAR/AMD-Vi Staging

Deferred-with-known-dependency planning gate. capOS has a bounded QEMU Intel remapping implementation for the selected smoke path, not a general hardware isolation claim for production NIC or storage ownership. The selected QEMU Intel path programs manager-owned per-device domains for two claimed DMA-capable functions, exports only domain-scoped IOVAs, hides host physical addresses, and fails closed for stale or wrong-owner domain assignment; it emits an honest direct-DMA posture (real_dma=attempted, direct_dma=enabled, remapping_tables=programmed) over the real ledger, with mappings installed before the doorbell and invalidated/IOTLB-flushed before reuse, while hostile_hardware_isolation stays not-claimed (QEMU-emulator evidence).

Current no-IOMMU cloud/user-provider paths use brokered bounce-buffer authority, not direct DMA. Direct-remapping/vIOMMU work, trusted sharing groups, and hostile-hardware isolation remain blocked on their own future gates in docs/dma-isolation-design.md.

Landed (umbrella + children, docs/tasks/done/2026-05-12/ .. done/2026-05-26/; make run-iommu-acpi, make run-iommu-remapping): the IOMMU dependency record, bounded Intel DMAR / AMD-Vi IVRS ACPI discovery, DMA-capable-function attach + uncovered marking, the per-device DMA domain policy and its pure fail-closed admission helper, the COM1 diagnostics mirror, the disabled table scaffold + MMIO-status diagnostics + disabled IOVA ledger + mapping-lifecycle preflight, the first real QEMU Intel table-programming smoke (real VT-d table programming, hardware-DMA translation, two-phase invalidation/IOTLB-flush revocation, IOMMU-backed hostile stale-DMA smokes), production DMAPool ledger integration, domain-scoped IOVA export discipline, fault recording/diagnostics, per-device domain granularity, the no-usable-IOMMU fallback policy, the IOMMU-production teardown/bounce-buffer S.11.2 matrix, and the honest direct-DMA posture line (ddf-iommu-remapping-production-closeout).

Open (future, not on the bounce-buffer critical path): AMD-Vi programming, scalable-mode / interrupt-remapping / device-IOTLB, aw-bits=48 4-level tables, trusted multi-device sharing groups, and production cloud NIC/storage driver ownership remain separate future tasks. kernel/src/iommu.rs stays cfg(feature = "qemu")-gated as a separate verified-remapping lane.

Reusable Block-Device Path

Landed: the device-generic virtio queue/transport helpers factored into kernel/src/virtio.rs pub(crate) mod transport (ddf-virtio-transport-helper-factor), the device-agnostic VirtqueueDma DMA/notify seam + seam-driven Virtqueue/DmaPage + parameterized discover_modern_transport (ddf-virtio-driver-foundation-boundary), the virtio-blk sector read/write smoke (make run-virtio-blk, ddf-blockdevice-boundary-virtio-blk-smoke), the first BlockDevice trait/CapObject boundary (kernel/src/cap/block_device.rs), and multi-device virtio-blk support + a target-disk grant source (make run-multi-virtio-blk, KernelCapSource.blockDeviceTarget @44, ddf-multi-virtio-blk-device-support). Landed: block_device_target now resolves by manifest PCI segment:bus:device:function identity and fails closed when the selector is absent, mismatched, or names the resolved boot disk; proof make run-blockdevice-target-identity. See docs/tasks/done/2026-05-25/, done/2026-05-26/, and done/2026-06-05/.

Open:

  • Add storage services behind userspace ownership: storage-userspace-persistent-store-namespace-service-local-proof moved Store/Namespace serving onto a persistent userspace service (make run-storage-persist-service), and storage-userspace-directory-file-service-local-proof followed with Directory/File serving and result-cap transfer from userspace (make run-userspace-directory-file-smoke).
  • Retire the ambiguous kernel-owned Store/Namespace/Directory/File production storage routes: storage-legacy-kernel-storage-cap-backer-retirement gated the RAM-backed file/directory/store/namespace kernel grant sources behind qemu (fail-closed in the default production kernel, joining the already-gated virtio read_only_fs_root/persistent_store/ writable_fs_root mount sources) and named all remaining kernel storage backers as proof/fixture surface in code and docs. Production storage is userspace-served; the default system.cue boot grants no kernel storage caps.
  • Retire the transitional kernel virtio-blk production owner: storage-legacy-kernel-virtio-blk-path-retirement ratified that the kernel-owned virtio-blk driver, its BlockDevice cap arm (BlockDeviceBackend::Virtio), and its PCI discovery (diagnose_qemu_virtio_blk) are all qemu-feature-gated; the default production kernel never binds virtio-blk and resolves block_device to the userspace-brokered NVMe arm (BlockDeviceBackend::NvmeBrokered, fail-closed without a verified controller and a live device_mmio grant), with block_device_target fail-closed (requires the qemu feature). virtio-blk is named as a qemu fixture / regression in the device doc, smoke scripts, and fixture manifests; the production-storage gate is the run-cloud-provider-nvme-blockdevice-* chain. The kernel broker responsibilities (PCI claim arbitration, MMIO/IRQ/DMA admission, bounce/IOMMU isolation, stale-generation rejection, and revocation) stay kernel-owned and are the same surfaces the userspace storage driver binds into.

Local Disk Storage Milestone

Visible outcome: default storage-focused QEMU boots from a disk image, exposes a read-only directory from local disk, and proves one capnp object can be persisted and read back after reboot. Milestone complete.

Landed (docs/tasks/done/2026-05-14/ .. done/2026-05-25/): the Store/Namespace + file-I/O schema slices and RAM-backed naming round-trip proof (make run-storage-naming); virtio-blk wired into BlockDevice (make run-virtio-blk); the read-only filesystem service over BlockDevice (kernel/src/cap/readonly_fs.rs, CAPOSRO1, make run-storage-fs); and the disk-backed persistent Store with a two-boot reboot proof (kernel/src/cap/persistent_store.rs, CAPOSST1, make run-storage-persist). Disk-backed delete tombstones entries in place; a later put that would hit the entry-table or data-cursor limit now compacts live CAPOSST1 store entries through a shadow generation before recommitting the canonical front generation (make run-storage-persist). Store/persistent durability across passes rests on host page-cache coherence; a virtio FLUSH for write-back-cache media durability is deferred to the Writable milestone.

Writable Local Storage Milestone

Visible outcome: a storage-focused QEMU image can create, overwrite, truncate, rename, and remove files through capability-scoped Directory/File caps, persist both file and store mutations across reboot, and recover to a consistent state after an unclean shutdown test. Milestone complete.

Landed (docs/tasks/done/2026-05-26/; make run-storage-writable, make run-storage-writable-recovery): the fail-closed single-writer policy (documented in the storage proposal); directory mutation (create/mkdir/remove/rename, additive Directory.create @5/rename @6) and writable File paths (overwrite/append/truncate/sync/close, bounded by MAX_FILE_BYTES 64 KiB) over kernel/src/cap/writable_fs.rs; disk-backed write-through persistence of the CAPOSWF1 sub-volume co-located with the CAPOSST1 Store in one combined image (now produced by tools/mkstore-image --writable); real File.stat created/modified timestamps with internal ClockProvenance labels carried from the same WallClock source in the CAPOSWF1 node record; and one forced-poweroff unclean-shutdown recovery proof (proof-only storage_writable_recovery feature) verifying the superblock-commit-ordering invariant.

Bounded-proof caveat: the recovery proof exercises one record-vs-commit window under host-page-cache durability (no VIRTIO_BLK_F_FLUSH; kill -9 preserves the host page cache); it proves the kernel’s superblock-commit-ordering invariant, not general media crash-consistency against host power loss. The co-located CAPOSST1 Store now has bounded tombstone reclamation through make run-storage-persist; writable-file extent reclamation remains future work.

Managed Cloud Store Bridge

Visible outcome: application services can persist bounded Cap’n Proto records through a cloud-backed capability while local QEMU tests exercise the same semantics through a fake bridge.

Open gates:

  • Define a provider-neutral CloudStoreBridge or app-specific SaveStore interface with put/get/compare-and-set/append operations, explicit size limits, profile or tenant scoping, schema version, and stale-write rejection.
  • Add a local fake-cloud bridge used by host tests and QEMU smokes. It must reject wrong-profile loads, stale mutable writes, oversized records, and ledger rewrites.
  • Add a GCP deployment note for Cloud Run bridge service, Firestore Native mode mutable indexes/profile summaries, Cloud Storage versioned blobs, and Secret Manager credentials.
  • Add Cloud KMS keying notes for managed game-world storage: key ring/key per world or shard, narrow encrypt/decrypt IAM authority, rotation, retired world revocation, and audit logging.
  • Keep provider credentials outside ordinary capOS clients. Only the bridge service receives cloud credentials; game/storage clients receive narrow capabilities.
  • Add lifecycle/retention/cost controls before writing real snapshots or evidence blobs to Cloud Storage.
  • Treat local disk-backed Store as the offline/QEMU baseline even when cloud persistence is available.

User-Owned Browser Save Transport

Visible outcome: private user data can be backed up through the user’s browser to Google Drive or Firebase as encrypted capsules while capOS never receives provider tokens.

Landed (policy + host-test gates): the provider-neutral browser transport policy for opaque encrypted save capsules / opaque provider handles / capsule + wrapped-DEK metadata; fake Drive and fake Firebase host-test adapters modeling deletion / duplicate writes / stale versions / rollback / missing network / non-opaque handles / authenticated-user mismatch / Firebase auth UID path injection; the Drive appDataFolder (drive.appdata) and Firebase/Firestore per-user-capsule notes; and the KMS / token / key-capability boundary records (browser transports ciphertext + handles only).

Open (future real-provider integration):

  • Implement real Google Drive and Firebase browser-companion adapters after the provider-token boundary is exercised outside ordinary capOS clients.
  • Reuse the existing save-capsule restore rejection tests as the acceptance gate for real provider adapters: tampered, wrong-profile, stale, oversized, unknown-content, and unsigned capsules must still fail before provider bytes can mutate save state.
  • Add real-provider failure-mode coverage for deletion, duplicate writes, stale versions, rollback attempts, offline cache/sync replay, and missing network using the same semantics as the fake adapters.

Boot Binary ISO Layout

Move ELF payloads out of the Cap’n Proto manifest blob and into explicit boot package sources. The CD-ROM path uses ISO 9660 files read on demand through a minimal kernel ISO driver; the raw disk and cloudboot paths use Limine-loaded modules staged on the FAT ESP. Both keep the manifest as topology and decouple ordinary service binary bytes from NamedBlob.data. capOS remains Limine-backed for the current boot line; Limine supports FAT and ISO9660/CD-ROM media, so CD-ROM/ISO is a planned boot/install variant rather than a path to delete.

Landed (docs/tasks/done/2026-05-24/; make run-boot-iso-read, make run-boot-iso; producer guard added 2026-06-06): the minimal ATA PIO CD-ROM read_sectors reader (boot_iso_read), the read-only ISO 9660 driver (open_file(name) -> (lba, size), fail-closed bounds), mkmanifest --copy-bins (names-only manifest, empty NamedBlob.data) with producer-side rejection for names whose ISO 9660 d-character form exceeds the level-3 31-character limit or collides after normalization, the opt-in make capos-name-only.iso, the kernel run_init() on-demand-read switch + BootBinary registry behind the boot_iso feature (make run-boot-iso), and the BOOT_MANIFEST_MAX_BYTES doc + the -iso-level 3 name-only ISO build recipe. Landed 2026-06-07 21:36 UTC in commits 22320411 and f0695442: the default make image raw disk and make capos-cloudboot-image cloudboot targets now use a name-only manifest plus Limine module payloads staged under /boot/bins/; see boot-limine-disk-boot-binary-source-local-proof. Landed 2026-06-07 21:59 UTC: the default make, make run, and make run-smoke ISO paths now use name-only manifests plus boot_iso on-demand reads from /boot/bins/, so ordinary service ELF bytes no longer ride in NamedBlob.data for the default bootable ISO paths. The generic embedded ISO rule remains available for focused fixtures that have not moved to a name-only boot source.

Closed:

  • After that source is proven, boot-embedded-data-retirement-and-atapi-userspace-serving retires the embedded-data branch for ordinary service binaries. The retained ATAPI/ISO Directory/File cap is explicitly a QEMU install-source fixture over the early boot reader, not a general kernel filesystem service; broader post-bootstrap package browsing remains a userspace-service concern outside this fixture.

Cloud Device Tracks

These are portability notes, not implementation evidence. The first cloud milestone is imported-image serial-console boot; provider NIC/storage drivers are later usable-instance work and remain blocked by cloud-provider binding, DMA/IOMMU or explicitly accepted bounce-buffer policy, interrupt, teardown, and network/storage evidence gates above. Local implementation and *-local-proof records in this track run under host tests, QEMU, or local cloudboot-image QEMU unless their acceptance explicitly says otherwise; they must not be blocked on cloud access. The local bounded provider-consumer closeout does not implement a cloud-ready userspace virtio-net, virtio-blk, virtio-scsi, NVMe, gVNIC, ENA, or cloud storage/NIC driver. The GCP-first usable-instance provider rollup is closed by cloud-usable-instance-provider-nic-storage; future public ingress, AWS, Azure, broader storage, high-throughput NIC, and direct-remapping lanes remain separate work.

Access correction (2026-05-27, updated 2026-06-06). The GCP cloud tracks are NOT prefix-blocked on cloud access. Local implementation and *-local-proof tasks stay dispatchable once their local prerequisites are satisfied, including tasks named cloud-prod-* that only boot the production cloudboot kernel under QEMU. Only live/billable proof tasks that cross a provider API, provider hardware, public ingress, public CA/DNS, or explicit make cloudboot-test acceptance require access authorization. GCE access is provisioned for the configured cloud sandbox project: tools/cloudboot/run-test.sh is hardcoded to it (no public IP, no service account/scopes), the 2026-05-24 GCE live probes recorded n1-standard-1, e2-small, c3-standard-4, and n2d-standard-2 Confidential shapes (IOMMU disabled → SWIOTLB → labeled bounce-buffer) in Cloud DMA Provider Evidence Inventory, and Cloud Build runs Kani proofs (tools/cloudbuild-kani.yaml). The local QEMU virtio-net/NVMe foundations and the local production cloudboot bind markers exist. The GCP live NVMe Persistent Disk read proof is now closed by cloud-gcp-storage-driver; remaining live driver slices are blocked only by their own local authority, product-scope, and real-provider evidence gates. Slices that only need structured serial evidence from already-production code (for example cloud-network-terminal-access path 1) are runnable on real GCE.

Cloud-Leg Decomposition Track (2026-05-24)

The cloud-usable-instance-provider-nic-storage umbrella was decomposed into discrete slices and is now closed as the GCP-first provider rollup.

Landed foundation (docs/tasks/done/2026-05-24/ .. done/2026-05-30/): the cloud DMA provider-evidence inventory, the runtime fail-closed DMA backend selection mechanism (cloud-dma-backend-selection: probe → fail-closed select → manifest override; authoritative contract in the “Cloud DMA Backend” section of docs/dma-isolation-design.md), the local-QEMU GCP virtio-net binding precursor

  • cloud-shape classification, the production (non-qemu) cloudboot-evidence: dma-backend / device-class / device-inventory markers, the minimal read-only production PCI enumeration surface, and the production DDF/PCI bind-stack decomposition.

Landed production bind-stack children (terminal local-bind markers settled with a kernel-side dispatch-slot proxy where the userspace driver authority surface is cfg(feature = "qemu")-gated out of the non-qemu build): cloud-prod-pci-claim-inventory, the DeviceMmio BAR-readback grant (make run-cloud-devicemmio-grant), the DMAPool bounce-buffer grant (make run-cloud-dmapool-grant), the interrupt route-alloc + live-delivery proofs, the terminal provider-nic-bound / storage-bound proxy markers, the three production userspace-provider grant-source proofs (DeviceMmio/DMAPool/Interrupt), the aggregate grant-surface closeout, and the real provider-cap-side Interrupt.wait/acknowledge cap-waiter proof (cloud_provider_cap_waiter_proof, make run-cloud-provider-cap-waiter). See docs/tasks/done/2026-05-28/ and the task-graph reconcile cloud-live-driver-task-graph-reconcile.

Landed virtio-net userspace-provider chain (the stale parent is closed by its child sequence, make run-cloud-provider-virtio-net*; docs/tasks/done/2026-05-28/ .. done/2026-06-07/): the non-qemu-buildable virtio modern-transport host surface (kernel/src/virtio_transport.rs), the device bring-up proof, the same-BDF DeviceMmio+DMAPool+Interrupt authority bundle, TX and RX queue materialization, MSI-X function-enable, TX submit/doorbell + polled completion, the userspace DMABuffer map/submit live-publish path, TX and RX MSI-X wait/ack, RX userspace-submit, the production-IDT real-interrupt-gate dispatch wiring, the RX polled-completion-no-inject proof, the always-built polled provider graduated off the per-proof feature, the real-polled-driver provider-nic-bound re-point (removing the proxy as source), the polled teardown + driver-death/process-exit stale-authority discipline, the legacy/transitional virtio 0.9 PIO + INTx local bind, and the real-GCE legacy-polled provider-nic-bound run.

Landed NVMe brokered userspace-provider chain (the parent is closed by its child sequence; make run-cloud-provider-nvme-*; docs/tasks/done/2026-05-29/ .. done/2026-06-05/): read-only bind → controller reset (selected CC-clear write) → admin queue materialization → brokered controller enable (manager-op DeviceMmio.brokeredNvmeControllerEnable @6, manager-authored AQA/ASQ/ACQ; raw CC.EN-set fails closed) → admin IDENTIFY (@7, then split SUBMIT @8 / COMPLETE @9) with the admin-completion Interrupt.wait/acknowledge handoff over the cap-waiter MSI-X route → I/O queue-pair create (@10/@11) → I/O READ (@12/@13) → WRITE (@14/@15, read-back match) → arbitrary/second LBA (@16/@17) and multiblock (@18/@19) → single-call synchronous poll-read (@20/@21, no Interrupt.wait on the data path) and inline read-bytes (@22) → the BlockDevice.readBlocks-shaped fixed-LBA then arbitrary-LBA read arm (BlockDeviceBackend::{Virtio,NvmeBrokered}) → readonly_fs over the NVMe BlockDevice (single-file then multi-file dir-walk) → writeBlocks @1 durability + real FLUSH @3 (opcode 0x00) + clean-reboot persistence + forced-poweroff crash-consistency → persistent_store and writable_fs (plus recovery) over the NVMe write arm → File.sync/Store-commit routed to a real NVMe FLUSH → the capstone read-arm graduation into always-built production (fail-closed runtime capability probe kernel/src/nvme_storage_backend.rs) and the always-built device_manager::nvme_sync_io_state sync-I/O state seam → dedicated data-path completion interrupts for BlockDevice.writeBlocks @1 and readBlocks @0 (make run-cloud-provider-nvme-io-completion-interrupt).

All brokered NVMe steps hold the no-IOMMU discipline: PRP1/queue-base addresses are manager-owned bounce buffers, never exported; no provider-written queue-base/PRP/SGL address, no host-physical or IOVA export, no direct-DMA claim, no cloud/guest IOMMU assumption. QEMU caveat: “an unflushed write rolls back” is not provable under QEMU’s -device nvme cache=writeback model (unflushed_rollback=not-provable-under-qemu-nvme-model).

Open / blocked:

  • cloud-usable-instance-provider-nic-storage (done 2026-06-07) — closeout-only rollup over the landed GCE evidence: serial-console operator access (1779868872-2424), live legacy virtio-net raw-frame provider-nic-bound (1780412056-e1cb), live NVMe Persistent Disk brokered READ (1780806087-bf69), and the separate gVNIC raw-frame / typed-Nic portability runs (1780794927-1aa9, 1780796615-decc). This closes the GCP-first provider NIC/storage bar without claiming public L4 ingress, AWS/Azure, broader storage variants, direct DMA/remapping, or high-throughput NIC readiness.
  • cloud-gcp-nic-enumeration-evidence (blocked/decomposed 2026-05-27) — coupled honest production-path enumeration markers to a provider-nic-bound + --require-provider-nic-proof gate the harness reserves for the driver slice, plus a billable real-GCE run an autonomous worker cannot self-authorize. The honest production-marker slice landed; the provider-nic-bound + real-GCE proof folds into cloud-gcp-virtio-net-nic-driver.
  • cloud-prod-virtio-net-userspace-provider-local-proof (done/closed 2026-06-07 02:54 UTC) — this stale parent is closed by the landed child chain above. The local non-qemu cloudboot/QEMU path has the modern TX/RX provider proofs, always-built polled provider, honest provider-nic-bound marker sourced from real polled TX+RX progress, and clean-release plus process-exit teardown. The GCE-compatible legacy-polled path also passed real GCE through the billable cloud-prod-gce-billable-boot-real-polled-nic-bound run. Remaining future lanes are L4 socket/smoltcp relocation, literal system.cue provider fold, reusable full-NIC/multiqueue readiness, and live-provider device-autonomous MSI-X evidence.
  • cloud-prod-nvme-brokered-userspace-provider-local-proof (done/closed 2026-06-07 02:08 UTC) — this stale parent is closed by the landed child chain above. The local non-qemu cloudboot/QEMU path has the brokered controller/admin/I/O provider proof, BlockDevice read/write/flush and filesystem consumers, dedicated data-path completion interrupts, and NLB > 8 multi-PRP windows with manager-authored PRP lists. Remaining future lanes are a second namespace, FUA/DSM, live GCP evidence, device-autonomous MSI-X completion delivery, and any direct-remapping/vIOMMU/provider-written-address model.

Production Bind-Stack Port (qemu-gate dissolution)

The cloud-prod-*-local-proof chain proved each behavior behind a focused per-proof Cargo feature (cloud_*_proof) that compiles a kernel-side cap::*_proof module into the non-qemu build only when its feature is on. Those proofs are correct but do not graduate the underlying device surface to always-built production code. The qemu feature conflates three jobs: (1) test-harness affordances (isa-debug-exit shutdown, self-tests, diagnostics/measure/debug_tap/boot_iso/storage_writable_recovery, the VT-d smoke) that must stay compile-gated; (2) unproven-on-hardware device surface kept dormant; (3) genuine host capabilities that should be runtime-probed, not compile-gated. The unlock is not removing the cfg and not an “am-I-QEMU” runtime branch (it links unproven MMIO/DMA into production = fail-open against the brokered-DMA discipline, and forfeits dead-code elimination as a TCB property). The unlock is to dissolve the gate per-piece: port each dormant capability into always-built production code as it is proven, fronting hardware-dependent behavior with a fail-closed runtime capability probe (the kernel/src/dma_backend.rs probe → fail-closed → manifest-override pattern). Hard caveat: the no-IOMMU bounce-buffer discipline is preserved (host_physical_user_visible=0, direct_dma=blocked, iova_export=disabled-future-only), and kernel/src/iommu.rs stays cfg(feature = "qemu")-gated as a separate future verified-remapping lane.

Umbrella: cloud-prod-ddf-bindstack-qemu-gate-dissolution (done 2026-05-30).

Landed children (docs/tasks/done/2026-05-29/ / done/2026-05-30/): the RX MSI-X waiter-determinism fix (the provider-consumer flake was a synthetic-RX-dispatch delivery-ordering race; gating injection on the waiter thread being parked in cap_enter, 28/28 green), grant-source despecialization (stage_with_class + ProdGrantClass), ECAM/MCFG enumeration graduation (fail-closed runtime MCFG probe), MSI-X programming graduation (cap::interrupt_programmed::program_attach_arm_unmask + device_interrupt::wait_kernel_injected_dispatch now always-built), the device-manager backend port (always-built ProductionDeviceTable device-record / bounce-DMA / interrupt-route backend replacing the device_manager::stub slot), and the qemu/test_harness feature split.

Open:

  • [~] ddf-provider-consumer-dmabuffer-page-fault-baseline (blocked/premise-refuted) — the reported deterministic DDF/QEMU DMAPool/DMABuffer PAGE FAULT did not reproduce (0/28 on d2a342d2, byte-identical kernel to 45c4beb9). Keep historical unless new evidence re-establishes the original fault.

The local virtio-net and NVMe userspace-provider parents are both closed by their child chains, so the live provider tasks now sit behind their own real-cloud evidence and product-scope gates rather than stale local-parent blockers. The cloud/GCP track stays brokered bounce-buffer authority; this does not reopen direct DMA, guest IOMMU, or direct-remapping assumptions.

  • cloud-gcp-virtio-net-nic-driver (DONE/superseded 2026-06-02 by the slice-6 billable run, see the GCE Polling Path track below) — the live legacy virtio 0.9 NIC was bound through the kernel-brokered legacy polled path, passing --require-provider-nic-proof. Honest scope: userspace_driver_authority=kernel-brokered-legacy-polled, so this closes the real-GCE bind bar without claiming L4 socket reachability, reusable multiqueue/full NIC readiness, or live-provider device-autonomous MSI-X delivery.
  • cloud-gcp-storage-driver (done 2026-06-07) — the live GCE NVMe Persistent Disk path passed make cloudboot-gcp-storage-nvme-io-read-test on run 1780806087-bf69 at source commit 28518165518c29a48633682f4a6d9b5844c43335. Evidence identified storage_interface=nvme, vendor.1ae0, device.001f, c3-standard-4, europe-west3-a, one brokered 512-byte READ, no public IP, no service account, and complete teardown. The selected GCP path remains brokered-bounce queue-base/PRP materialization; provider-written Model B is reserved for a direct-remapping/vIOMMU or synthetic-address lane. This does not claim the older virtio-scsi PD path, Local SSD, a gVNIC datapath, or full filesystem integration.
  • cloud-network-terminal-access (done 2026-05-27; path 1, serial-console shell, needs no NIC driver) — proved a reviewed cloud operator access path beyond capos kernel starting over the GCE serial console (cloudboot-evidence: access-path serial-console-shell; real-GCE run 1779868872-2424, no public IP, no service account). Paths 2/3 (TCP/Telnet) depend on cloud-gcp-virtio-net-nic-driver; path 4 (SSH) is a separate milestone.
  • cloud-launch-teardown-policy-hardening (done) — hardened the cloudboot harness into the usable-instance gate: --require-provider-nic-proof, structured provider.json evidence, fail-closed launch-policy read-back, and nonzero exit on teardown failure or incomplete evidence.

Future provider slices (not required for the initial GCP usable-instance gate). The AWS and Azure tracks are split by proof surface: standard storage controllers (NVMe / virtio-scsi) are QEMU-emulable now, while the vendor-custom NICs (ENA, MANA) get host-conformance gates plus a deferred live proof because QEMU does not emulate them. The NVMe path’s shared GCP storage-provider foundation has landed via nvme-io-queue-and-read, so the NVMe-only AWS (Nitro EBS) and Azure (managed-disk) tracks re-scoped to a small cloud-shape classification delta and landed (both done 2026-05-28). The virtio-scsi alternative is not a shortcut: capOS has no userspace virtio-scsi provider driver, and make run-virtio-blk proves the kernel-owned virtio-blk driver, which leaves the hidden kernel DMA ownership the provider-authority acceptance forbids — so the older-family SCSI path stays out of scope.

AWS:

  • cloud-aws-nvme-storage-driver (done) — the AWS Nitro EBS NVMe cloud-shape classification delta on the shared NVMe foundation (make run-pci-nvme; docs/devices/aws-nvme.md). Live AWS EBS evidence is the deferred cloud-aws-storage-live-proof.
  • cloud-aws-ena-nic-protocol-conformance (done) — ENA protocol encode/decode in capos-lib/src/ena.rs with a host conformance suite vetted against the ENA spec / Linux driver headers. Gate: cargo test-lib (deliberate QEMU-exception; QEMU has no ENA device).
  • cloud-aws-ena-nic-live-proof (blocked on conformance + cloud-gcp-virtio-net-nic-driver; deferred until AWS access) — end-to-end ENA bind/send/receive/teardown on real AWS hardware.

Azure:

  • cloud-azure-disk-storage-driver (done) — the Azure Boost managed-disk NVMe cloud-shape classification delta on the shared NVMe foundation (make run-pci-nvme; docs/devices/azure-disk.md). The older-family Hyper-V/virtio-scsi path is out of scope (azure_scsi_path=no-userspace-provider-driver-out-of-scope). Live Azure evidence is the deferred cloud-azure-storage-live-proof.
  • cloud-azure-mana-nic-protocol-conformance (done) — MANA/GDMA protocol encode/decode in capos-lib/src/mana.rs with a host conformance suite vetted against the MANA Linux driver headers; provenance docs/devices/azure-mana.md. Gate: cargo test-lib (QEMU has no MANA device).
  • cloud-azure-mana-nic-live-proof (blocked on conformance + cloud-gcp-virtio-net-nic-driver; deferred until Azure access) — end-to-end MANA bind/send/receive/teardown on real Azure hardware, including SR-IOV VF revocation with fallback-to-synthetic.

Superseded umbrella records (do not dispatch):

Cloud milestones and per-provider paths:

  • First cloud milestone: imported-image serial-console boot. Closed for GCP by run 1778230874-715a (2026-05-08) against source commit 3951e275: make cloudboot-test imported the capos-cloudboot-image tarball, started an e2-small with no public IP and no service account, observed capos kernel starting on serial, and tore down cleanly. Does not require or prove cloud NIC/block-device drivers beyond the boot path.
  • Second cloud milestone: GCP-first usable instance provider rollup. The selected operator path, provider storage, and provider NIC data path are closed by cloud-usable-instance-provider-nic-storage: serial-console shell access on real GCE, live legacy virtio-net raw-frame provider-nic-bound, live NVMe Persistent Disk brokered READ, and separate live gVNIC raw-frame / typed-Nic portability evidence. Scope split (decided 2026-06-02, network-reachable-datapath-scope-decision): the network data-path reachability sub-requirement is raw-frame TX/RX over the live NIC (GCE polling-path slices 1-4 + slice 6); the SSH/WebShell / network terminal access sub-requirement is L4 and is deferred to networking-proposal Phase C.
  • Add NVMe controller init (brokered admin queue pair + identify on no-IOMMU). Closed by the brokered enable / admin / IDENTIFY / interrupt-wake child chain ending 2026-05-28.
  • Add NVMe I/O queue pair (submission/completion rings + doorbell writes). Closed by nvme-io-queue-and-read on 2026-05-28.
  • [~] Add NVMe read/write commands with PRP-based DMA transfers; no-IOMMU PRPs are manager-materialized from live buffer authority. READ and WRITE are done (see the NVMe chain above); multi-block PRP-list (count > 8) remains.
  • Implement BlockDevice for NVMe. Done via the BlockDeviceBackend:: NvmeBrokered read/write/flush arms (still per-proof-feature-gated for activation pending the capstone graduation).
  • Add QEMU NVMe metadata-only PCI testing via -device nvme.
  • [~] Extend QEMU NVMe testing to cover controller init, queues, PRP DMA, and BlockDevice behavior. Controller/admin, I/O queue, READ/WRITE/FLUSH, and BlockDevice read/write/flush plus dedicated data-completion interrupts over -device nvme are covered; NLB>1 PRP-list and always-built graduation remain.
  • [~] GCP storage path: NVMe Persistent Disk on a third-generation GCE shape has one live brokered READ proof (cloud-gcp-storage-driver, run 1780806087-bf69). The older virtio-scsi Persistent Disk path, Local SSD, and reusable filesystem-backed storage provider remain future work. Keep virtio-blk as a local/QEMU block-driver proof only unless a provider target explicitly exposes it.
  • GCP NIC path: virtio-net first where supported, then gVNIC for newer machine families, Confidential VM paths, generation-3-or-later shapes, and higher network performance tiers. The virtio-net raw-frame provider gate passed on live GCE, and the gVNIC portability lane below now has live raw-frame and typed Nic evidence. High-throughput, multiqueue, public ingress, and first-public-Web-UI productization remain future tasks.
  • AWS storage path: NVMe on Nitro-backed EBS instances. Treat AWS Nitro as an NVMe storage dependency rather than a virtio-blk path.
  • AWS NIC path: ENA driver, including ENA queue setup, MSI-X routing, and Nitro generation/version expectations. Do not claim AWS network support from QEMU virtio-net evidence.
  • Azure NIC path: MANA driver and Mellanox mlx4/mlx5 accelerated-networking fallback awareness where Azure exposes SR-IOV VFs. Driver lifecycle must tolerate dynamic VF binding and revocation by falling back to the synthetic interface rather than assuming the VF is permanent.

Cloud Benchmark Reruns

Visible outcome: once capOS reaches a first real cloud-VM boot, rerun the current benchmark profiles on that boot path and separate cloud evidence from local QEMU/KVM evidence.

Open gates:

  • Define the first supported cloud benchmark profile after the booted cloud hardware surface is known. At minimum, rerun boot/session smokes and any CPU-only benchmark such as run-smp-process-scale, and later run-thread-scale, that does not depend on missing cloud NIC or block drivers. A GCE n2-highcpu-8-class nested-KVM host is a reasonable first CPU-only benchmark target if /dev/kvm is usable by the benchmark user.
  • Record provider, region, instance type, CPU topology, cloud image id, firmware/device model, nested-KVM state, QEMU CPU pinning/isolation policy, and serial-console collection method in the benchmark artifact.
  • Retain provenance for the exact disk/cloud image, kernel, manifest, embedded binaries, host toolchain, and cloud image import path.
  • Compare cloud-VM results with local QEMU/KVM results only as separate environments; do not replace the selected local proof gate with a cloud result unless the milestone explicitly changes.

Cloud Device Tracks – Real GCE Polling Path (decoupled from MSI-X)

Decision (2026-06-01): the real-GCE-boot milestone (userspace virtio-net driver binding a real GCE NIC plus a reachable network data path) is decoupled from device-autonomous MSI-X interrupt delivery. The production data path uses polling the used ring, which already works on the non-qemu cloud kernel: the landed cloud_virtio_net_rx_userspace_submit_proof does a real device->host RX DMA (used_len=76) with zero interrupts, via the always-built virtio_transport + poll_used_idx. Every TX/RX data movement and completion in the repo is already polled; device-autonomous MSI-X remains a parallel efficiency follow-up, not a boot blocker. The local MSI-X track is now closed: the missing precondition was explicit PCI COMMAND memory-space/bus-master enablement in the proof path. With pci_command=0x0107, local QEMU/KVM delivers virtio-net RX MSI-X vector 0x50 through the guest IDT path with int_injected=0, idt_handler_observed=true, and one deferred-EOI acknowledgement. Live-GCE interrupt evidence remains outside the polling-path critical path.

Production-kernel ground truth (verified): PCI/ECAM enumeration, device_manager, the bounce-buffer DMA backend, MSI-X programming, and all three DDF grant sources are already always-built. Still cfg(feature = "qemu")-stubbed in production (the real gap): kernel/src/virtio.rs (legacy driver + smoltcp + cap/network.rs TCP/UDP socket caps) → virtio_stub.rs returns DeviceUnavailable.

Ordered slices (only the last is billable; none require interrupt delivery). Slices 1-5d are done; the legacy real-GCE blockers found in flight are all closed locally:

  1. RX polled-completion-no-inject local proof (done 2026-06-01) — flipped the RX-submit proof’s completion observation from the kernel-injected dispatch proxy to the already-latched polled used-ring state (make run-cloud-provider-virtio-net-rx-polled-completion).
  2. Polled provider default manifest (done 2026-06-01) — graduated the polled RX+TX provider off the per-proof feature into always-built cap::virtio_net_polled_provider, staged by a manifest-observable condition (make run-cloud-provider-virtio-net-polled-provider-default).
  3. Real-polled-driver provider-nic-bound (done 2026-06-02) — re-pointed cap::provider_nic_bind_proof::report so the marker fires only after the real polled provider completes a TX+RX over the live function, removing the kernel-side dispatch-slot proxy as the source. The literal system.cue fold remains the open remainder (make run-cloud-provider-nic-bound-real-polled-driver).
  4. Polled teardown / stale-authority (done 2026-06-02) — ported the S.11.2 hostile-smoke discipline (DMA/MMIO/IRQ stale-authority rejection, release/reset/driver-death teardown, no host-physical export) to the real polled production provider.
  5. Network-reachable-datapath scope decision (done 2026-06-02) — Option A: the milestone’s “reachable network stack” bar means raw-frame TX/RX reachability over the live NIC, because the billable make cloudboot-test gate checks no L4 socket round-trip. Slices 1-4 + slice 6 close that bar. L4 sockets (smoltcp + cap/network.rs socket caps off cfg(qemu) virtio.rs) are a separate future track (networking-proposal Phase C). Decision doc: network-reachable-datapath-scope-decision. 5b. [x] Legacy/transitional virtio 0.9 bind (decomposed 2026-06-02) — the real GCE NIC is a legacy/transitional virtio 0.9 device (PIO config BAR, INTx, no MMIO BAR, no MSI-X); the modern-only production polled provider returned no candidate on real GCE. Both decomposition slices landed 2026-06-02, so the local-proof acceptance is closed; the later billable slice-6 re-run also passed. cloud-prod-virtio-net-legacy-transitional-bind-local-proof.
    • 5b.1 [x] Legacy PIO select (done 2026-06-02) — kernel-brokered legacy PIO config access (pci::LegacyIoBar / pci::io_bar, scoped to the claimed I/O BAR, no ambient port authority) + legacy candidate selection with no MSI-X precondition (make run-cloud-provider-virtio-net-legacy-select, virtio-net-pci,disable-modern=on,vectors=0).
    • 5b.2 [x] Legacy datapath bind (done 2026-06-02) — single-PFN contiguous virtqueue materialization (frame::alloc_contiguous, reusing the modern ring helpers), legacy PIO notify, 10-byte legacy net header, polled TX (ARP) + RX over the legacy device with no MSI-X route (make run-cloud-provider-nic-bound-legacy). Sources exactly one provider-nic-bound from report_real_completion_legacy. 5c. [x] Legacy GCE-viable RX stimulus (done 2026-06-02) — the landed legacy proof’s RX stimulus was QEMU-SLIRP-only (spoofed ARP to 10.0.2.2); replaced by a broadcast DHCP DISCOVER from the device’s real MAC (legacy config 0x14), an accept-any inbound frame completion model, and a wall-clock (monotonic_ns) RX budget with an iteration-ceiling backstop. Marker carries rx_stimulus=dhcp-discover-broadcast, eth_src=device-mac, -srcmac.<12hex> (make run-cloud-provider-nic-bound-legacy). 5d. [x] Legacy large-queue-size (landed 2026-06-02) — live GCE legacy virtio-net advertises a 4096-entry virtqueue, exceeding the proof’s defensive MAX_LEGACY_QUEUE_SIZE = 1024. Raised to the virtio spec max 32768 (power-of-two enforced; non-power-of-two / over-bound / zero reject cleanly; alloc_contiguous fails closed without panic). QEMU caps queue size at 1024 and locks tx_queue_size at 256 for the non-vhost SLIRP legacy device, so the largest local shape is rx_queue_size=1024 (8-page RX single-PFN vring); the full 4096-entry materialization is a real-GCE attestation (make run-cloud-provider-nic-bound-legacy-large-queue).
  6. cloud-gcp-virtio-net-nic-driver (reopen) — DONE 2026-06-02 (run 1780412056-e1cb, e2-small, europe-west3-a, source commit 1fb65683): the real GCE boot bound the live legacy virtio 0.9 NIC (00:04.0, 1af4:1000) through the kernel-brokered legacy polled path and passed --require-provider-nic-proof. The full 4096-entry vring materialized on real hardware for the first time (rx_vring_pages=28 contiguous), the real GCE device MAC was read (src_mac=42:01:0a:c8:00:12), a broadcast DHCP DISCOVER was transmitted, and a real device->host RX DMA completed within the TSC-governed wall-clock budget (rx_used_len=532 ethertype=0x0800). Closes the GCE Polling Path track and retires the cloud-gcp-virtio-net-nic-driver blocker. The billable run was authorized on 2026-05-27 and recorded at commit 2aaeaa53; durable evidence is summarized in the completed task entry below. Dispatched as cloud-prod-gce-billable-boot-real-polled-nic-bound. To re-run the billable bind: build the cloudboot image from the legacy manifest system-cloud-provider-virtio-net-legacy-datapath.cue (not the modern system-cloud-provider-nic-bound-real-polled-driver.cue; the literal system.cue stages no provider), confirm make run-cloud-provider-nic-bound-legacy green on the build commit, then tools/cloudboot/run-test.sh --require-provider-nic-proof.

Real-Filesystem Track (2026-06-02)

The real-filesystem direction is decided in Real-Filesystem Decision: a role-split, not one on-disk format. capOS-managed state stays capnp-native (CAPOSWF1/CAPOSST1, evolved not replaced; crash-consistency already proven by make run-storage-writable-recovery); host-populated/interop images gain read-only FAT32 via the fatfs no_std crate; a single host capnp image tool retires the per-format tools/mkstorage-*.py byte-offset hazard. ext4-read is deferred behind an explicit trigger (“must read a disk capOS did not format”); FAT write is rejected (no crash-consistency story).

Landed: read-only FAT32 over virtio-blk (kernel/src/cap/fat_fs.rs, vendored vendor/fatfs-no_std/, make run-storage-fat-read, storage_fat_read feature on the existing read_only_fs_root source; provenance docs/devices/fat32.md), and read-only FAT32 over the graduated NVMe read arm (the Nvme BlockSource arm + deferred FatMount, cloud_fat_read_over_nvme_proof, make run-cloud-provider-fat-read-over-nvme). See docs/tasks/done/2026-06-02/ and done/2026-06-03/.

Open (next): the real-FS slice chain continues with FAT-over-NVMe follow-ups and timestamps/provenance on CAPOSST1/CAPOSRO1 where those layouts expose time metadata. FAT32 now surfaces valid host-authored directory-entry timestamps over both virtio-blk and NVMe through schema-stable File.stat values, with proof logs labeling the source as FAT metadata rather than trusted wall-clock custody. The capnp-native storage smokes and installable-system seeded variants now use the Rust host capnp image tool as the maintained fixture path; the retired Python capnp-layout fixture scripts are no longer referenced by the local proofs. The FAT image path stays on real mkfs.fat / mcopy tooling. ext4-read stays deferred behind its explicit trigger.

Phase C / L4 Track Opened (relocation, post raw-frame GCE proof) (2026-06-02; refreshed 2026-06-07)

The L4 socket reachability track — relocating the virtio-net driver and smoltcp into userspace processes (networking-proposal Phase C), sequenced after the cloud milestone per the network-reachable-datapath scope decision (Option A) — is designed in Phase C Userspace NIC Driver Relocation. It is no longer waiting on a new security ruling: the selected-write common-config and DMA-address export pieces landed through the bounded Phase C slices, reusing the accepted notify-doorbell discipline and the landed bounce/IOVA-export DMA isolation posture. The lower-layer blocker for Web UI on a GCE instance is production L4 plus live IPv4 configuration. The full boot-resource UI bundle is separate parallel work: it is ready and should close before claiming a useful public Web UI, but it is not the raw NIC/L4 blocker.

Current task chain:

  • cloud-prod-nic-driver-userspace-clean-tx-rx-split-local-proof is Phase C slice 6 (DONE 2026-06-03). It removed the last coupled raw-frame Nic.receive self-stimulus.
  • cloud-prod-userspace-network-stack-smoltcp-local-proof is Phase C slice 7c-ii(b) (DONE 2026-06-07). It locally proves the selected serve-from-userspace architecture: the non-qemu cloudboot manifest starts a userspace smoltcp network-stack service, the service spawns an application client with only Console plus a served TcpListenAuthority, and the client completes one hostfwd TCP request/response through a served TcpListener and TcpSocket. The armed path now receives socket authority from the userspace smoltcp service for this proof rather than extending the legacy kernel cap/network.rs / virtio_stub.rs socket owner. The selected design is recorded in the Phase C proposal’s 7c-ii Mechanism and Decomposition section.
  • cloud-prod-legacy-kernel-network-socket-path-retirement is done. Non-qemu production manifests now reject legacy kernel network_manager / tcp_listen_authority grants, so the armed socket route stays behind the userspace network-stack service; remaining kernel socket grants are qemu-only fixtures.
  • cloud-prod-phase-c-kernel-smoltcp-virtio-net-removal is done. It removes the kernel smoltcp dependency, retires the qemu-only kernel TCP/UDP runtime behind fail-closed socket entry points, and leaves the remaining virtio-net code as lower-layer QEMU fixture evidence rather than production cloud socket ownership.
  • cloud-prod-network-stack-dhcp-ipv4-config-local-proof is done. It follows the served-socket proof and locally proves DHCP/IPv4 lease acquisition, default-route installation, ARP/neighbor resolution, and userspace-served NetworkManager.getConfig status needed by a GCE-hosted listener.
  • Network Usability and Post-smoltcp decomposes the follow-on usability lanes: operator status tooling, DHCPv4 renewal/rebind/expiry/status beyond the first config proof, system DnsResolver, POSIX getaddrinfo, ping/ping6 diagnostics, socket readiness/cancel/backpressure, packet trace authority, and transport policy/status. These are not first public Web UI blockers except for the already-listed DHCP/IPv4 config proof.
  • remote-session-self-served-full-ui-bundle is done and provides the reviewed fixed-name boot-resource operator bundle for follow-on Web UI proofs.
  • cloud-prod-remote-session-web-ui-l4-local-proof now consumes the done userspace L4 and DHCP/IPv4 config proofs; it proves remote-session-web-ui locally on the non-qemu cloudboot socket path.
  • cloud-gce-legacy-virtio-webui-serving-local-proof is done (2026-06-11 04:26 UTC), proved by make run-cloud-gce-legacy-virtio-webui-serving. It closes the local legacy-datapath serving gap: a persistent kernel-brokered legacy virtio 0.9 polled runtime (cap::virtio_net_legacy_datapath_proof::legacy_nic_runtime, kernel feature cloud_gce_legacy_virtio_webui_serving_proof) backs the same typed Nic cap the modern path serves, and the Phase C userspace network stack plus remote-session-web-ui serve the fixed UI bundle to a host HTTP peer over the GCE NIC shape (disable-modern=on, no MSI-X), byte-verified against the committed bundle pin with a single cloudboot-evidence: legacy-virtio-webui-serving marker. PIO/vring ownership stays kernel-side; no host-physical, IOVA, queue, or port-I/O authority crosses the cap boundary. This closes only the LOCAL serving story – it does not claim private GCE reachability.
  • cloud-gce-private-self-hosted-webui-proof is on hold (2026-06-09). Its local prerequisites are done, and the legacy-datapath Web UI serving story is now locally proven (2026-06-11 04:26 UTC, above), but it still shares the missing firewall IAM / default-deny ingress blocker recorded on cloud-gce-private-icmp-echo-proof: the cloudtest credential cannot create firewall rules, so a private probe cannot reach the instance. It keeps the current no-public-IP cloudboot posture and requires a private probe that crosses the live GCE NIC under an explicit billable-run authorization.
  • cloud-gce-public-webui-ingress-tls-policy-design is done and records the selected ingress, TLS/certificate, firewall/source, browser session, and teardown policy for public exposure work.
  • cloud-gce-public-self-hosted-webui-ingress-tls is blocked on the private proof; public operator access is a separate exposure slice that implements the recorded ingress/TLS policy.
  • cloud-prod-phase-c-kernel-smoltcp-virtio-net-removal is the done Phase C exit cleanup after userspace L4 was proven. It is not the first GCE Web UI proof, and it does not claim private GCE reachability, public ingress, or TLS.

Networking diagnostics and stack-completeness follow-ups:

  • cloud-prod-icmp-echo-reply-local-proof is done (2026-06-08). It consumes the done userspace L4 and DHCP/IPv4 config proofs, acquires a local DHCP lease, proves a same-subnet ARP plus ICMP Echo Request / Echo Reply exchange that preserves identifier, sequence, and payload, and rejects malformed or oversized requests with a bounded per-poll budget. This is diagnostics, not Web UI readiness.
  • cloud-prod-icmp-echo-reply-real-nic-datapath-local-proof is done (2026-06-08), proved by make run-cloud-prod-icmp-echo-reply-real-nic-datapath. The done local responder proof above runs smoltcp over an in-process QueuePhyDevice: it injects the inbound Echo Request in-process and uses the real Nic cap only for the DHCP lease and ARP probe, so no inbound ICMP traverses Nic.receivePoll/Nic.transmit. The live GCE NIC is legacy virtio 0.9 (no userspace driver authority), so an inbound Echo Reply over the real NIC needs a kernel-owned responder on the legacy datapath. This task built that responder (cap::virtio_net_legacy_datapath_proof::run_icmp_echo_reply_real_nic_datapath) and locally proved it: a host peer over a QEMU socket netdev (not SLIRP, which drops inbound host->guest ICMP Echo) drives DHCP, ARP, multiple malformed Echo Requests (rejected; icmp_malformed_drops>=1), then a valid one, and the kernel answers an RFC 792 Echo Reply over the real RX/TX vrings, emitting cloudboot-evidence: icmp-echo-reply-real-nic-datapath <token> with rx_inbound_provenance=real-nic-rx-vring / in_process_queuephydevice=absent.
  • cloud-gce-private-icmp-echo-proof is blocked (2026-06-09) on GCP firewall IAM. Its harness, GCE-importable image (make capos-gce-private-icmp-echo-cloudboot-image), and probe orchestration (tools/cloudboot/run-test.sh --require-private-icmp-proof) are implemented and pre-spend-validated locally, and a real billable run (1780962265-4a2e) proved the GCE datapath: capOS DHCP-leased the exact GCE-assigned IP 10.200.0.38 over the live legacy virtio 0.9 NIC and emitted cloudboot-evidence: icmp-echo-reply-real-nic-datapath-ready, with the probe pinging that IP during capOS’s responder window. The pings showed 100% loss because GCE default-denies ingress and the cloudtest service-account credential lacks compute.firewalls.create/.delete/.list, so no temporary ICMP rule could be created; all resources tore down cleanly. Unblock by granting those firewall permissions to the cloudtest credential or pre-provisioning a persistent allow-ICMP rule in the cloudtest VPC network, then re-running make cloudboot-gce-private-icmp-echo-test. It proves private same-VPC ping over the live NIC with no public ICMP exposure and should not become a public HTTPS Web UI closeout condition unless a later ingress policy explicitly chooses ICMP health checks.

IPv6 Support Lane, Non-Blocking For First Public Web UI

The current Web UI cloud path is deliberately IPv4-first: Phase C userspace L4, DHCP/IPv4 configuration, ARP, private GCE reachability, and reviewed public HTTPS ingress remain the required blockers for the first public proof. IPv6 is not a reason to hold that path. It is a separate network-stack capability lane because the old qemu-only runtime remains IPv4-only and the legacy kernel-owned non-qemu socket fallback is retired; the Phase C userspace service path now carries the explicit address-family ABI. Private GCE IPv6 reachability and public IPv6 ingress policy remain unproven. Local ICMPv6 Echo Reply, GCE-style DHCPv6 configuration, and IPv6 TCP listener/connect behavior now have bounded local proofs.

The task chain is:

  • cloud-prod-ipv6-architecture-status-grounding is done (2026-06-03). It recorded the explicit current-state audit and the non-blocking decision, then unblocked the address-ABI task.
  • cloud-prod-network-address-abi-ipv6 is done (2026-06-03, the lane’s entry point). The socket/interface address ABI now represents IPv4 and IPv6 explicitly through IpAddressFamily and a documented address-length contract: getConfig reports the family plus an ipv6Supported flag, and the IPv4-only stack rejects IPv6 with a distinct ipv6Unsupported class and malformed lengths with malformedAddress, source- compatible for existing 4-byte IPv4 callers. Proof make run-cloud-prod-network-address-abi-ipv6.
  • cloud-prod-ipv6-link-local-nd-local-proof is done (2026-06-08). It enables the local smoltcp IPv6 feature set, installs a link-local address, verifies all-nodes plus solicited-node multicast joins, and proves a bounded Neighbor Solicitation / Neighbor Advertisement exchange plus cached-peer UDP egress locally. Proof make run-cloud-prod-ipv6-link-local-nd.
  • cloud-prod-ipv6-ra-slaac-local-proof is done (2026-06-08). It proves Router Solicitation, Router Advertisement acceptance, SLAAC address configuration, default-route installation, invalid-RA rejection, and prefix/default-route expiry locally. Proof make run-cloud-prod-ipv6-ra-slaac.
  • cloud-prod-ipv6-dhcpv6-gce-config-local-proof is done (2026-06-08). It proves a local GCE-shaped DHCPv6 Solicit / Advertise / Request / Reply exchange, installs the assigned /128, keeps default-route provenance tied to Router Advertisement, and rejects wrong source, wrong port, transaction-id, identifier, oversized-option, lease-lifetime/timer, and timeout cases. Proof make run-cloud-prod-ipv6-dhcpv6-gce-config.
  • cloud-prod-icmpv6-echo-reply-local-proof is done (2026-06-08). It proves bounded local ICMPv6 Echo Request / Echo Reply handling through the Phase C userspace smoltcp substrate, including identifier, sequence, payload preservation and checksum, type/code, address-family, and oversized-input rejection. It is diagnostics and stack completeness, not Web UI readiness.
  • network-ping6-diagnostics-tool-local-proof is done (2026-06-08). It proves a bounded local ping6-style diagnostic over the smoltcp ICMP socket path, including link-local scope reporting, configured global address status, malformed-reply drop, timeout/unreachable classification, one bounded retry after the neighbor-discovery timer, payload bounds, and one-outstanding-request enforcement. It remains diagnostics only and does not change the IPv4-first Web UI critical path or authorize public IPv6 ingress.
  • cloud-prod-ipv6-tcp-l4-local-proof is done (2026-06-08). It proves TCP listener and connect behavior through the production socket contract with IPv6 endpoints. Proof make run-cloud-prod-ipv6-tcp-l4.
  • cloud-prod-ipv6-real-nic-datapath-local-proof is ready. The done IPv6 proofs above run smoltcp over an in-process HarnessPhyDevice peer (markers self-declare metadata_only=true / public_ingress=not-attempted) and use the real Nic cap only for MAC/link status; the real-NIC TX/RX datapath exists today only for IPv4 (cloud-prod-network-stack-dhcp-ipv4-config-smoke). This task builds the IPv6 DHCPv6/RA + probe datapath over the real bound NIC and proves it locally, emitting cloudboot-evidence: ipv6-real-nic-datapath <token>.
  • cloud-gce-private-ipv6-reachability-proof is on-hold on missing GCP IAM access. The real-NIC IPv6 datapath proof above is now done, but its live-GCE acceptance fundamentally requires a dual-stack subnet (so the GCE NIC receives an IPv6 assignment at all) plus an IPv6 ingress firewall rule for the same-VPC probe. The cloudtest service-account credential lacks compute.networks.create / compute.subnetworks.* (the only existing cloudtest subnet is IPv4-only) and compute.firewalls.create/.delete/.list, so neither can be provisioned. Unblock by granting those permissions, or by pre-provisioning a dual-stack subnet plus an IPv6 ingress rule in the cloudtest VPC scoped to the probe. See the on-hold record for the consolidated blocker analysis and the parked codex/cloud-gce-private-ipv6-reachability-proof harness checkpoint.
  • cloud-gce-public-ipv6-ingress-tls-policy-update is blocked on the private IPv6 proof, then updates the selected public Web UI ingress/TLS policy for DNS/AAAA, IPv6 firewall, TLS coverage, and teardown before any public IPv6 exposure.

Non-blocking GCE gVNIC portability lane:

  • cloud-gce-gvnic-protocol-grounding-device-map is done. It landed the GCE gVNIC provenance map from the Google Cloud gVNIC docs and the Google/Linux GVE driver documentation: PCI identity (0x1ae0:0x0042), BAR/admin-queue/MSI-X wire subset, GQI/DQO formats, QPL/RDA addressing, and the planned DDF (DeviceMmio/DMAPool/DMABuffer/Interrupt) authority mapping. No capOS gVNIC driver or QEMU model exists yet.
  • cloud-gce-gvnic-image-launch-inventory-proof is done. It requests GVNIC image/instance launch posture, reads the GCE image/instance policy back, proves serial PCI inventory for the 1ae0:0042 function with BAR and MSI-X metadata, and records that no gVNIC driver bind was claimed. The live run used a private no-public-IP/no-service-account VM and completed teardown.
  • cloud-gce-gvnic-adminq-register-proof is done. It builds a proof-only cloudboot image, maps the live GCE gVNIC BAR0 through DeviceMmio, allocates manager-owned bounce-buffer DMA pages for the admin queue and descriptor, issues one DESCRIBE_DEVICE command, releases the admin queue, and checks stale DeviceMmio/DMAPool/DMABuffer handles. The live private GVNIC run completed teardown and recorded no userspace host-physical/IOVA export and no provider NIC bind.
  • cloud-gce-gvnic-raw-frame-tx-rx-proof is done. It builds a proof-only cloudboot image, configures one GQI/QPL TX queue and one RX queue over the live GCE gVNIC, sends one DHCP DISCOVER raw Ethernet frame from the device MAC, receives one inbound IPv4 frame, destroys queues, unregisters QPLs, deconfigures resources, releases/resets the admin queue, and records no provider Nic bind claim.
  • cloud-gce-gvnic-nic-cap-adaptation-proof is done. It adapts the proven GQI/QPL queue path behind the existing typed Nic semantics and emits gvnic-nic-cap-adaptation evidence with inline-frame TX/RX, MAC/link metadata, hidden queue addresses, no host-physical or IOVA export, and no provider-nic-bound claim. It remains a portability/future-machine-family lane; the first public Web UI proof can stay on the already-proven GCE virtio-net path.

Certificates / TLS Backlog

Bounded implementation slice chain for the certificates/TLS track. It decomposes Certificates and TLS into dispatchable slices and is owned by the Certificates / TLS track in docs/tasks/README.md. The dispatchable records live under docs/tasks/; this file is the long-form decomposition and sequencing rationale.

Grounding

  • Certificates and TLS – the schema surface and Phase 1-9 ordering. Phases 1-2 are the near-term target; Phase 1 is Certificate / CertificateChain / TrustStore / CertVerifier over a RAM-only webpki-roots store and a rustls-webpki verifier. The Phase 2 client-only local proof now completes a TLS 1.3 handshake over a userspace-served TcpSocket cap with embedded-tls; the server/config service surface remains future Phase 2 work.
  • Cryptography and Key Management – partial implementation. The minimal SymmetricKey, PrivateKey, and PublicKey ABI, RAM-only XChaCha20+HMAC/P-256 key cores, RAM-only KeyVault handle custody, and development-only software KeySource bootstrap exist for local proofs. There is still no persistence or production custody source, so production/public TLS and ACME remain blocked on a reviewed source that can mint key handles without exposing raw private-key material.
  • Time and Clock AuthorityWallClock Phase 1 landed (88cf4b5d): a read cap with wallTime and a ClockProvenance label, but the fixed-boot-base source reports Untrusted. Cert-validity (notBefore/notAfter) and OIDC exp/iat compare against it. Host-tested verify logic passes an explicit atEpochSeconds and needs no live clock; security-grade validity against an adversarial clock wants the trusted-provenance upgrade (WallClock Phase 2). Recorded as a sequencing dependency on the live consumer slices, not on the host verifier slice.
  • Phase C Userspace NIC Driver Relocation – the userspace TcpSocket cap the TLS stack wraps arrives via Phase C slice-7 (cloud-prod-userspace-network-stack-smoltcp-local-proof). The TLS stack is a userspace consumer of that cap and must not move into the kernel.

Sequencing Rationale

The proposal’s suggested shape (library -> handshake -> cert caps -> consumer) is reordered to land the lowest-risk real logic first, grounded in what exists:

  • The verifier path (TrustStore + CertVerifier over webpki-roots) needs no socket and no private key – it is pure no_std + alloc host-testable logic. It lands before the handshake.
  • A TLS client handshake needs a TcpSocket cap but no private key.
  • A TLS server (the Web UI consumer) needs a KeyVault-issued PrivateKey handle and a server cert source, so it remains the most-blocked terminal slice.

Slice Chain

#TaskProposal phaseStatusDepends on
1cloud-tls-vendor-rustls-webpki-roots-no-std-provenancePhase 1 depsdone
2cloud-tls-cert-truststore-certverifier-phase1-host-proofPhase 1doneslice 1
3cloud-tls-client-handshake-over-tcpsocket-local-proofPhase 2 (client)doneslice 2 + Phase C slice-7 socket cap
4cloud-tls-self-hosted-webui-terminated-endpointPhase 2 (server) consumerblockedslice 3 + key-cap surface + provider-terminated GCE Web UI proof
K0crypto-key-custody-tls-acme-decompositionkey-management precursordone
K1crypto-privatekey-publickey-ram-signing-local-proofkey Phase 1 subsetdoneslice 2
K2crypto-keyvault-ram-privatekey-custody-local-proofkey Phase 2 subsetdoneK1
K3crypto-development-keysource-tls-acme-bootstrap-local-proofkey source local bootstrapdoneK2
5cloud-tls-acme-account-order-local-proofPhase 3 (ACME core)doneslice 3 + K3
6cloud-tls-acme-http01-challenge-solver-local-proofPhase 3 (http-01)doneslice 5 + Web UI L4 path
7cloud-tls-acme-renewal-certstore-rotation-local-proofPhase 3 (renewal)blockedslices 4-6
8cloud-gce-public-webui-letsencrypt-direct-termination-proofpublic GCE successorblockedprovider proof + slice 7 + public DNS/authorization
  1. Vendor the Phase-1 verifier crates. rustls-webpki + webpki-roots as static-pinned no_std+alloc snapshots with VENDORED_FROM.md provenance, recorded under docs/trusted-build-inputs.md, proved to build for the bare-metal target. Slice 3 later selected embedded-tls for the local client proof’s no_std TLS state machine; the broader server/config service stack remains future work.
  2. Certificate / TrustStore / CertVerifier (Phase 1). Schema additions plus host-tested verify logic over rustls-webpki seeded by webpki-roots, with chain verification proved against committed good/bad vectors and explicit atEpochSeconds. No running cap service, no socket, no key – the lowest-risk real cert logic.
  3. Client TLS handshake over TcpSocket (Phase 2, client-only). Done 2026-06-08. A userspace process completes one TLS 1.3 client handshake over the Phase C userspace TcpSocket cap, validating the peer chain with the slice-2 verifier, with an observable local QEMU proof. The no_std determination selected a vendored embedded-tls 0.19.0 client state machine for this local proof rather than full rustls.
  4. capOS-terminated Web UI endpoint (terminal consumer). Serves the Web UI over capOS-held TLS as a direct-termination successor after the first GCE public Web UI proof closes through provider-terminated HTTPS. Deeply blocked: needs a KeyVault-issued PrivateKey cap and a server cert source (ACME / provisioned).
  5. Minimal key-custody decomposition. Done. It decomposes the missing PrivateKey / KeyVault / KeySource subset into the three implementation records below, keeping production hardware/cloud custody out of the local TLS/ACME bootstrap.
  6. PrivateKey / PublicKey RAM signing proof. Done 2026-06-04. Adds the minimal asymmetric-key ABI and host-tested RAM signing core: sign/public/info, public verify/export/info, purpose metadata, and no raw private-key export.
  7. RAM KeyVault custody. Done 2026-06-05. Adds handle-based key generation/open/list/destroy and a local QEMU proof for TLS and ACME account key handles, still RAM-only and not production custody.
  8. Development-only KeySource bootstrap. Done 2026-06-05. Grants local proofs a development software key source that mints key handles without putting raw private keys in manifests, images, logs, task records, or evidence, and is rejected for production/public profiles.
  9. ACME account/order local proof. Done 2026-06-08. capos-tls now has a no_std+alloc ACME account/new-order/CSR-finalize/certificate-retrieval core, with ES256 JWS signing through AcmeAccount PrivateKey caps and CSR signing through a separate TLS-purpose PrivateKey cap. Challenge validation stays fake or pre-authorized here; the proof does not call Let’s Encrypt staging or production.
  10. Scoped http-01 solver. Done 2026-06-09. capos-tls adds a bounded, token-scoped Http01ChallengeSolver and the RFC 8555 http-01 authorization flow (pending order, authorization fetch, key-authorization derivation via the RFC 7638 account-key thumbprint, challenge response, out-of-band validation, and cleanup). remote-session-web-ui serves only /.well-known/acme-challenge/<token> for currently-published tokens through that same solver; retired, unknown, sub-path, and traversal tokens fail closed (404). The host ACME http-01 test proves the protocol and cleanup; the Web UI L4 QEMU proof fetches the challenge through the served origin. It grants no generic route, static-file, DNS, or Web UI authority and adds no public CA call.
  11. CertificateStore.watch renewal and rotation. Proves local renewal with short-lived test certificates, storing the fresh chain under a stable handle and rotating the Web UI TLS server without restart.
  12. Public GCE Let’s Encrypt direct-termination proof. A separately reviewed successor after the provider-managed first public proof. It requires a public DNS name controlled for the run, explicit billable/public-ingress authorization, and explicit authorization before any Let’s Encrypt production call; staging remains the default external CA target.

Let’s Encrypt / ACME Public TLS Decomposition

Let’s Encrypt support is implementable for the public TLS milestone only as the capability-native, capOS-terminated successor path. It is not the already selected closeout path for the first public GCE Web UI proof. That first proof continues to terminate HTTPS at the GCP external load balancer with a provider-managed certificate, no capOS private-key custody, and no raw public HTTP closeout.

The missing prerequisites are represented as named task records:

Local proofs and public CA/cloud proofs stay distinct. The ACME account/order, challenge, and renewal slices use a local RFC 8555-compatible directory and local QEMU/cloudboot paths. A public GCE/Let’s Encrypt run requires a separately authorized harness mode, a controlled public DNS name, public-ingress teardown evidence, and no private key material in manifests, images, logs, task records, or evidence directories.

Next Gap

Slices 1 and 2 landed on 2026-06-03: rustls-webpki and webpki-roots are vendored as static-pinned no_std+alloc snapshots, and capos-tls contains the Phase 1 Certificate / TrustStore / CertVerifier host verifier proof over those crates. K1 landed on 2026-06-04: capos-tls also contains the minimal RAM-only P-256 PrivateKey / PublicKey signing core. K2 landed on 2026-06-05: RAM-only KeyVault generation/open/list/destroy handle custody for those keys. K3 landed on 2026-06-05: local development software KeySource bootstrap now mints TLS and ACME account key handles without raw private-key material in manifests or evidence and rejects production/public profiles. Capability-infrastructure key-cap reconciliation landed on 2026-06-06: the minimal RAM-only SymmetricKey ABI and local AEAD/MAC proof now exist. Slice 3 landed on 2026-06-08: the local QEMU proof now completes one TLS 1.3 client handshake over a userspace-served TcpSocket cap and validates the peer chain with capos-tls. ACME slice 5 landed on 2026-06-08: capos-tls now proves account registration, order creation, CSR finalize, and returned-chain parsing against a local RFC 8555-style directory using purpose-scoped key caps. ACME slice 6 (proposal item 10) landed on 2026-06-09: the scoped http-01 solver now serves bounded /.well-known/acme-challenge/<token> responses through remote-session-web-ui, with the http-01 authorization/validation/cleanup flow proven host-side and the served route proven in the Web UI L4 QEMU proof. The next ACME gap is renewal and certificate-store rotation (slice 11). The next server-side TLS behavior gap remains the Web UI consumer, still blocked on reviewed server key custody and a certificate source. The behavior chain then advances slice-by-slice – each kernel/lib-first with a local proof – until the Web UI consumer slice can add a separately reviewed direct-termination successor after cloud-gce-public-self-hosted-webui-ingress-tls closes with provider-terminated HTTPS. The key-custody local-proof precursor is now complete for PrivateKey / PublicKey, RAM KeyVault, and development KeySource; production custody remains future.

Installable System Backlog

Detailed decomposition of Installable System: an installed, persistent capOS that boots from disk and keeps mutable system configuration across reboots, composed with the immutable boot manifest.

docs/tasks/README.md links here. Installable System became the selected milestone after the Device Driver Foundation closeout and is now closed for the bounded local/QEMU installable-system contract. The behavior track below landed through item 8, and item 9 reconciled the proposal/body wording to the landed install, provision, update, and rollback contracts. This milestone does not pull public L4 ingress, AWS/Azure live support, direct-remapping production hardware, userspace smoltcp/L4 readiness, secure boot/signing, or production release authority into selected scope.

Landed Foundations (What This Builds On)

These contracts exist today and the track decomposes against them, not against the proposal’s projected shapes. Present tense is landed behavior.

Building blockLanded contractSource
BlockDevicereadBlocks/writeBlocks/info/flush over a real cfg(qemu) virtio-blk device; blockDevice grant sourcekernel/src/cap/block_device.rs; proof make run-virtio-blk
Read-only filesystemCAPOSRO1 fixed superblock at LBA 0; Directory.list/open/sub + File.read/stat; mutating methods fail closed; readOnlyFsRoot grant sourcekernel/src/cap/readonly_fs.rs; proof make run-storage-fs
Persistent content-addressed StoreCAPOSST1 superblock at LBA 0; put/get/has/delete keyed by SHA-256 hash; superblock rewrite is the durable commit point; survives reboot; persistentStore grant sourcekernel/src/cap/persistent_store.rs; reboot proof make run-storage-persist
Writable filesystemCAPOSWF1 superblock at LBA 256; full Directory mutation set + File.write/truncate/sync/close; fail-closed single-writer policy; writableFsRoot grant sourcekernel/src/cap/writable_fs.rs; reboot proof make run-storage-writable
Co-located storage imageOne disk co-locates CAPOSST1 Store (LBA 0) and CAPOSWF1 filesystem (LBA 256) so both survive reboot togethertools/mkstore-image --writable
Read-only packaged-image source fixtureQEMU-gated read-only Directory/File over the booted CD-ROM ISO 9660 /boot/bins/ tree (boot_iso ATAPI reader); installable_image_source grant source; physically scoped to the ATAPI medium, cannot reach the writable target disk; not a general post-bootstrap filesystem servicekernel/src/cap/installable_image.rs; proof make run-installable-image-source
Default bootable disk imageSingle hybrid BIOS+UEFI raw image with one GPT ESP (FAT32) carrying Limine, kernel, a name-only manifest.bin, and Limine module payloads under /boot/bins/; make image, run-disk, run-disk-bios; GCP/AWS provider packaging.tools/mkdiskimage.sh, tools/package-cloud-image.sh; proof make run-limine-disk-boot-modules
NamespaceRAM-backed resolve/bind/list/sub name-to-hash bindings; not persistent; namespace grant sourcekernel/src/cap/namespace.rs
Boot manifest / initBaseline boot loads the manifest module. Default raw disk and cloudboot images resolve ordinary service binaries from checked Limine modules; default ISO build/run/smoke paths resolve ordinary service binaries from the name-only ISO /boot/bins/ tree through boot_iso; the reserved init selector still uses the kernel-embedded init ELF. run_init parses SystemManifest, builds init’s bootstrap caps, and enters initConfig.init. The installable data-region path additionally reads and validates system/config/overlay.bin when the data region mounts and the base manifest declares matching extension pointskernel/src/main.rs, kernel/src/boot.rs, proof make run-installable-overlay; make run-smoke

Divergences From The Proposal (Structural Reconcile Closed)

The proposal was written before the storage prerequisites landed and projects shapes that differ from the landed contracts. The initial reconcile task recorded the landed storage contracts and placement decisions before the behavior track ran; the behavior track then landed through item 8 below. The structural docs reconcile task updated the proposal’s structure and body wording to the landed install, provision, update, and rollback contracts without broadening selected scope.

  • On-disk layout. The proposal projected three partitions (boot / system / data) on the installed disk. The base make image raw boot image still produces a single hybrid BIOS+UEFI image with one GPT ESP, and the co-located CAPOSST1 Store + CAPOSWF1 writable filesystem remains available as a separate data-region image for the focused storage and early data-region smokes. The landed installable-system disk path no longer stops there: task 5 (installable-bootable-disk-system-data-regions) folds that co-located data-region image into the bootable disk as GPT partition 2 at the fixed data-region LBA, so make run-installable-disk boots from one disk carrying both the ESP and the persistent data region. Tasks 2-4 describe the separate auto-mounted data-disk model they originally built on; item 5 is the landed integrated single-disk packaging.
  • Persistent naming. The proposal stores the overlay “under a well-known system Namespace root (system/config/<generation>)”. The landed Namespace is RAM-only and does not survive reboot. Persistent naming and the active/known-good pointers are therefore grounded in the landed writable filesystem (CAPOSWF1 paths and small marker files) plus the content-addressed persistent Store (CAPOSST1) for immutable generation objects. No new persistent-Namespace kernel cap is assumed.
  • Generations and epochs. No landed Store, Namespace, or SystemManifest schema carries a system-generation/epoch field today (other caps such as AccountRecord and the DDF revocation generations do, but not the installable-system path). The monotonic epoch + content hash the proposal relies on for stale-write rejection and rollback are carried inside the overlay’s own capnp object and the writable-filesystem marker files, not by extending Store or Namespace.
  • Composition machinery. Landed in track item 3 (2026-05-26): the SystemConfigOverlay capnp object plus SystemManifest.extensionPoints, and init’s read/decode/validate/compose with base-pins-win / overlay-adds-within- declared-extension-points / no-new-authority precedence and fail-closed base-floor fallback (make run-installable-overlay). Generation/rollback selection of which overlay is active landed in task 4, and the install/provision/update/rollback flows landed in tasks 6-8.

Ordered Track

Each item names its acceptance and the landed APIs it builds on. Dispatchable records live under docs/tasks/ with the same ids.

  1. installable-system-proposal-reconcile (docs). Done 2026-05-26. Reconciled installable-system-proposal.md to the landed contracts above: single hybrid ESP boot image (no three projected partitions), persistent config region grounded in the writable filesystem + content-addressed persistent Store, RAM-only Namespace (naming/pointers are writable-filesystem files), and no system-generation field on the Store/Namespace/SystemManifest path. Recorded the data/system region placement decision. The structural proposal/body wording update is closed by item 9. Builds on: all five storage/boot prerequisites.

  2. installable-data-region-boot-mount (behavior). Done 2026-05-26 02:02 UTC. Wires the persistent data region (the co-located CAPOSST1 Store + CAPOSWF1 writable filesystem) into the boot path: under the installable_data_region kernel feature, run_init best-effort calls cap::grant_data_region, which mounts the auto-attached data disk, scopes a writable Directory to the system/config subtree (writable_fs::mount_config_root), and grants init that Directory (data-config) plus the persistent Store (data-store) under well-known CapSet names – granted together or not at all. It fails closed wholesale to the base manifest (caps unchanged, “no data region; base floor” diagnostic) when the disk is absent, has a malformed superblock, or is missing the system/config root; the disk caps stay out of the manifest so the fail-closed boots do not abort on a mandatory cap. No new kernel cap type or schema change. Proof: make run-installable-data-region boots the same ISO three times – seeded disk (init prints the resolved system/config contents), no disk (base floor), and zeroed-superblock disk (base floor). Builds on: writable_fs.rs, persistent_store.rs, the co-located image tool (--seed-config), and run_init.

  3. installable-config-overlay-schema-and-merge (behavior, schema). Done 2026-05-26 02:55 UTC. Added the persistent SystemConfigOverlay capnp object (overlay version, monotonic epoch, SHA-256 content hash, additional services, network/runtime settings, account-store location) and the SystemManifest.extensionPoints declared extension points (ManifestExtensionPoints: additional-services allowance/count, allowed service caps, minOverlayEpoch, settings allowances – closed by default). Init now reads system/config/overlay.bin from the granted config region, decodes it (SystemConfigOverlay::from_capnp_bytes re-validates the overlay version and content hash), and composes it over the base plan (compose_onto): base pins win (no base-service-name or init collision), overlay services name only base-shipped binaries and request only allowedServiceCaps (no new authority classes), epoch >= minOverlayEpoch, count <= maxAdditionalServices, and carried settings require their allowance. Any schema-invalid, version-mismatched, content-hash-mismatched, stale-epoch, extension-point-violating, or missing overlay is rejected whole; init boots the base floor and surfaces [init] overlay rejected: <reason>. Host encoder tools/mkmanifest mkoverlay bin emits the overlay bytes (filling the canonical hash); tools/mkstore-image --writable --seed-overlay seeds system/config/overlay.bin. Proof: make run-installable-overlay boots the same featured ISO three times – valid overlay (the overlay-extra service runs), base-pin collision (rejected, base floor preserved), corrupt overlay (rejected, base floor). Did NOT add generations/rollback (task 4). Builds on: task 2, capos-config manifest validation, schema/capos.capnp SystemManifest.

  4. installable-system-generation-rollback (behavior). Done 2026-05-26 03:41 UTC. Userspace-only over the already-granted persistent Store + writable system/config Directory; no schema or kernel change. Represents system-config generations as content-addressed Store objects keyed by SHA-256 (immutable, deduped), tracks the known-good active pointer and a staged/attempting candidate pointer as monotonic-pointer-epoch marker files (gen-active/gen-candidate) in the writable config region, records a boot attempt durably before applying a candidate, and auto-falls-back to the known-good generation when a candidate is left unconfirmed (a boot that does not reach the health checkpoint) – the brick-proofing guarantee. Also promotes a confirmed candidate, rolls config back to a retained prior generation (monotonic pointer advance pointing at older content), and rejects a stale/replayed (lower-or-equal-epoch) pointer. init/src/main.rs run_generation_rollback_checks, gated by a base service named generation-proof, exercises all of this end-to-end against the real durable primitives with observable [gen] ... assertions. Did NOT add the installer (task 6) or update flow (task 8). Proof: make run-installable-generation boots a --seed-config disk twice – boot 1 exercises the full mechanism in one boot and durably leaves an unconfirmed attempting candidate; boot 2 re-reads the committed markers from a fresh mount and proves across-reboot auto-fallback to the known-good generation. Builds on: task 3 and the persistent Store/writable filesystem durability.

  5. installable-bootable-disk-system-data-regions (behavior). Done 2026-05-26 04:31 UTC. One integrated bootable disk now carries the boot ESP (GPT partition 1) and the co-located CAPOSST1 Store + CAPOSWF1 writable data region (GPT partition 2), and boots through the landed task 2-4 path reading the data region from the same disk it booted from – not a separate smoke-only drive. tools/mkdiskimage.sh gained --data-image / --data-offset-bytes (it folds the tools/mkstore-image --writable image into a second GPT partition) and derives the ESP size from --esp-sectors (the integrated disk uses the same 128 MiB ESP as the raw disk-image targets so a debug kernel fits). The kernel installable_disk feature (implies installable_data_region) adds a fixed data-region base LBA (cap::data_region_base_lba = 264192) applied at the single persistent_store/writable_fs read_range/write_range choke points; the kernel trusts that fixed tool/kernel layout contract rather than parsing the GPT, exactly as the superblock LBAs already are. Proof: make run-installable-disk builds one disk (boot ESP + seeded data region carrying a valid config overlay) and boots it as a single virtio-blk device; the gate is data region: mounted from the boot disk plus [overlay-extra] started via overlay – a service only the data region supplies – not a clean boot alone. Did NOT add the installer (task 6). Builds on: mkdiskimage.sh, tools/mkstore-image --writable, task 4.

5a. ddf-multi-virtio-blk-device-support (behavior, DDF milestone). Lift the single-virtio-blk limit (per-device driver instance, DMA pool key, interrupt route, PCI claim) and add a target-disk BlockDevice/Store grant source scoping a cap to a specific device. Owned by the Device Driver Foundation milestone (docs/backlog/hardware-boot-storage.md “Reusable Block-Device Path”); the install flow depends_on it. Builds on: the landed device-agnostic transport seam and per-queue-keyed DMA ledger.

5b. installable-userspace-image-source (behavior, DONE 2026-05-26 08:15 UTC). Expose a userspace-readable read-only packaged-image source so a userspace installer can read the packaged boot/system bytes. Chosen shape: a QEMU-gated read-only Directory/File cap over the existing boot_iso ISO 9660 reader (kernel/src/iso/), not the boot-package payload alternative and not a general post-bootstrap filesystem service. The installable_image_source grant source (KernelCapSource @45) mounts the booted CD-ROM ISO 9660 /boot/bins/ tree and serves Directory.list/open

  • File.read/stat; every mutating method fails closed. It is physically scoped to the ATAPI CD-ROM medium and cannot reach or mutate the writable virtio-blk target disk (blockDeviceTarget/writableFsRoot) – that write authority belongs to the install flow (task 6). Offsets/lengths are validated against the file extent before any device access, reusing the driver’s in-bounds checks; a past-EOF read clamps to empty and an absent name is rejected. Broader package browsing remains userspace-service work rather than an expansion of this fixture. Proof: make run-installable-image-source (kernel cap module kernel/src/cap/installable_image.rs; consumer demo demos/installable-image-source/; manifest system-installable-image-source.cue; harness tools/qemu-installable-image-source-smoke.sh). Builds on: task 5, the boot_iso ISO 9660 reader.
  1. installable-system-install-flow (behavior). Done 2026-05-26 10:12 UTC. The capos-system-install userspace service (demos/installable-system-install/) installs a bootable capOS onto a blank target disk using only two granted caps: the read-only installable_image_source Directory over the booted CD-ROM /boot/bins/ and the target-scoped block_device_target BlockDevice selected by manifest PCI identity, never the boot disk. It copies the packaged bootable boot-region head (BOOTHEAD.BIN: protective MBR + primary GPT + the FAT ESP with Limine + release kernel + base manifest) to LBA 0, writes the backup GPT (BOOTGPT.BIN) at the LBA read from the primary GPT header (Limine validates it), and initializes an empty data region (DATAIMG.BIN: empty CAPOSST1 Store + CAPOSWF1 filesystem with just the system/config directory) at the fixed cap::data_region_base_lba. It validates every sector range before writing and verifies the read-back. The empty data region is the install floor; the operator’s first non-empty config generation is provisioning (task 7), not install. Proof: make run-installable-install – pass 1 installs into the manifest-selected virtio-blk target disk; pass 2 boots that disk standalone (no CD-ROM) and reaches the base service with its data region mounted (data region: mounted + [init] data-region mounted: system/config entries=0 + [console-paths] Console paths ok.), not a clean boot alone. The build packages the boot region split into head + backup GPT (tools/split-boot-region.py) so the installer reads only the populated ~15 MiB over the slow ATAPI PIO path rather than the whole FAT32 ESP. Did NOT add provisioning (task 7) or update/rollback (task 8). Builds on: task 5, the BlockDevice sector path, content-addressed Store, and the precursors 5a ddf-multi-virtio-blk-device-support and 5b installable-userspace-image-source.

  2. installable-system-provision-flow (behavior). Done 2026-05-26 11:09 UTC. The capos-system-provision userspace service (demos/installable-system-provision/) runs as PID 1 over an installed system’s persistent data region and performs the proposal’s “Provision” flow, holding only three caps: a Console, the writable filesystem root (writable_fs_root, navigated to system/config), and the content-addressed persistent Store (persistent_store). On a disk whose system/config carries no active generation yet (the empty install floor task 6 leaves), it writes the operator’s first non-empty SystemConfigOverlay generation (epoch 1: an operator AccountRecord stored as a content-addressed Store record named from the overlay’s accountStoreLocation, a hostname, a log level, and one additional service), commits the generation object to the Store, writes system/config/overlay.bin (the shape init’s apply_config_overlay consumes, proven by task 3), and advances the gen-active pointer. It dispatches on durable state: a second boot of the same disk re-reads the gen-active pointer, resolves the generation object and operator account from the Store, and verifies the provisioned account/settings are the active durable config that survived the reboot. Did NOT add the update/rollback flow (task 8); reuses the overlay object and the existing AccountRecord schema with no schema change. Proof: make run-installable-provision boots the same --empty-config disk twice – pass 1 provisions and commits, pass 2 verifies the active generation + operator account + settings survived; a clean boot alone is not the gate. Builds on: task 6 (the empty install floor), task 3 (the overlay object and merge), task 4 (the generation/gen-active representation), and the writable-filesystem + persistent-Store durability.

  3. installable-system-update-rollback-flow (behavior). Done 2026-05-26 11:35 UTC. The capos-system-update userspace service (demos/installable-system-update/) performs the proposal’s “Update” flow on top of the landed generation/rollback mechanism (task 4), userspace-only over the same three caps provision holds (Console, writable_fs_root navigated to system/config, persistent Store); no schema or kernel change. It writes a new SystemConfigOverlay generation into the content-addressed Store as a new root hash (old generation objects remain; the shared operator AccountRecord dedups), stages it as an attempting gen-candidate pointer without advancing the known-good gen-active pointer, and on the next boot commits by advancing active only when the candidate reaches its health checkpoint – otherwise the boot-attempt-vs-confirmed auto-fallback keeps the prior known-good. The overlay re-validation against the new base reuses the production SystemConfigOverlay::compose_onto against a base plan whose extension points revoked the overlay’s authority, so an update whose new base no longer admits the overlay falls back to the base floor with a surfaced error rather than applying. The data region (operator account + active config) is carried across every transition. It dispatches on durable state (update-phase marker) so commit-on-success and auto-fallback are both proven across a REAL reboot, not one process. Proof: make run-installable-update boots the same --empty-config disk THREE times – boot 1 provisions known-good gen1, rejects an overlay against a revoked-cap new base (kept base floor), and stages a healthy candidate gen2; boot 2 commits gen2 across the reboot and stages a failing candidate gen3; boot 3 auto-falls-back from gen3 across the reboot to known-good gen2 – distinct per-generation content hashes and a stable account hash on every line, the staged/commit/fallback/ marker-survival/data-region-carried assertions, not a clean boot alone. Builds on: task 6 (the install floor), task 7 (the provision/overlay shape), task 4 (the generation/gen-active/gen-candidate representation), and task 3 (overlay compose/validation).

  4. installable-system-structural-doc-reconcile (docs-status). Done 2026-06-07 18:20 UTC through commit 12b8334a (committed 2026-06-07 18:19 UTC). Reconciled Installable System structural and body wording to the landed local/QEMU data-region, overlay, generation, install, provision, and update/rollback contracts. Preserved the RAM-only Namespace caveat and kept secure boot/signing, production release authority, public ingress, AWS/Azure live support, direct-remapping production hardware, userspace smoltcp/L4 readiness, and full durable account policy out of the closed Installable System scope.

Design Grounding

Cloud Image Import And Serial-Console Boot

Operator notes for importing the locally-boot-proven hybrid BIOS+UEFI disk image into GCP and AWS and reaching a serial-console boot. This is packaging and documentation only: tools/package-cloud-image.sh operates on a local artifact, adds no provider credentials, and performs no live cloud calls. Cloud NIC and storage driver readiness remain separate, blocked tracks (docs/backlog/hardware-boot-storage.md “Cloud Device Tracks”); the first cloud milestone is an imported-image serial-console boot, not a driver claim.

Local Artifacts

make image builds target/capos-image.raw (default 256 MiB, GPT, 128 MiB hybrid ESP + Limine MBR) and make run-disk / make run-disk-bios prove it boots under OVMF (UEFI) and SeaBIOS (legacy BIOS). Only run the import steps below once those local boot proofs pass; provider import only makes sense for a known-good image.

make package-cloud-image (or package-gcp-image / package-aws-image) repackages that artifact into target/cloud-image/:

ProviderOutputShape
GCPdisk.raw.tar.gzdisk.raw grown to a whole multiple of 1 GiB, GPT backup relocated, inside a gzip tar --format=oldgnu archive
AWScapos-aws.rawRAW (exact image size)
AWScapos-aws.vhdfixed VHD (conectix footer, disk-type fixed)
AWScapos-aws.vmdkstream-optimized VMDK

The helper self-verifies each shape (gzip + oldgnu tar member for GCP; the VHD fixed-disk footer and VMDK create-type for AWS) and fails if a conversion is wrong. It differs from make capos-cloudboot-image, which builds a from-scratch 10-GiB GCE disk for the tools/cloudboot/ end-to-end harness; the packaging helper repackages the small, already-boot-proven make image artifact instead.

GCP Custom-Image Import

GCP custom-image import requires a single file named exactly disk.raw, sized to a whole multiple of 1 GiB, in a gzip tar --format=oldgnu archive – exactly the disk.raw.tar.gz the helper emits.

make package-gcp-image
gsutil cp target/cloud-image/disk.raw.tar.gz gs://<your-bucket>/capos-disk.tar.gz
gcloud compute images create capos-hybrid \
  --project=<your-project> \
  --source-uri=gs://<your-bucket>/capos-disk.tar.gz \
  --guest-os-features=UEFI_COMPATIBLE

UEFI_COMPATIBLE lets the image boot through the GCE UEFI path (/EFI/BOOT/BOOTX64.EFI); the same image still carries the Limine MBR for the legacy boot path. After creating an instance from the image, read the boot landmark on the serial console:

gcloud compute instances create capos-test \
  --project=<your-project> --image=capos-hybrid
gcloud compute instances get-serial-port-output capos-test \
  --project=<your-project> | grep 'capos kernel starting'

The reference end-to-end GCE serial-console-boot flow (build, upload, import, boot, evidence capture, teardown) is the tools/cloudboot/ harness; see tools/cloudboot/README.md.

AWS VM Import

AWS VM Import/Export accepts RAW, fixed VHD, and stream-optimized VMDK. Upload one shape to S3 and import it as a snapshot, then register an AMI:

make package-aws-image
aws s3 cp target/cloud-image/capos-aws.vhd s3://<your-bucket>/capos-aws.vhd
aws ec2 import-snapshot --disk-container \
  Format=VHD,UserBucket="{S3Bucket=<your-bucket>,S3Key=capos-aws.vhd}"
# after the snapshot import task completes, register an AMI from the snapshot:
aws ec2 register-image --name capos-hybrid \
  --architecture x86_64 --root-device-name /dev/xvda \
  --boot-mode uefi \
  --block-device-mappings \
  '[{"DeviceName":"/dev/xvda","Ebs":{"SnapshotId":"<snap-id>"}}]'

Boot-mode notes

The hybrid image boots either firmware path, so the AWS boot mode is a deployment choice, not an image rebuild:

  • --boot-mode uefi – Nitro-based instance types boot the ESP /EFI/BOOT/BOOTX64.EFI (Limine UEFI). Recommended on modern Nitro instances.
  • --boot-mode legacy-bios – older/legacy instance types boot the Limine MBR path. Use this only if the target instance type does not support UEFI.

uefi-preferred is also valid and lets the instance type decide. RAW (Format=RAW, file capos-aws.raw) and stream-optimized VMDK (Format=VMDK, file capos-aws.vmdk) import the same way; choose the container your upload path prefers. AWS rounds the EBS volume size up to whole GiB on import, so the RAW shape is not pre-rounded.

Scope Boundary

These notes cover import + serial-console boot only. They do not enable cloud NIC or storage drivers, do not automate live cloud runs, and add no new trusted build inputs beyond format conversion of the already-pinned-Limine image. The provider-NIC/storage and cloud usable-instance tracks remain blocked in docs/backlog/hardware-boot-storage.md.

Local Users, Storage, And Policy Backlog

Design and task decomposition for manifest-seeded and disk-backed local user management. This work belongs to the User Identity, Sessions, And Policy track and depends on capability-native storage reaching at least a RAM-backed Store/Namespace proof before durable account mutation is meaningful.

Grounding

This decomposition is grounded in the current capability, identity, manifest, storage, and authority-broker documents:

  • docs/capability-model.md
  • docs/architecture/manifest-startup.md
  • docs/proposals/user-identity-and-policy-proposal.md
  • docs/proposals/userspace-authority-broker-proposal.md
  • docs/proposals/storage-and-naming-proposal.md
  • docs/proposals/oidc-and-oauth2-proposal.md
  • docs/proposals/cryptography-and-key-management-proposal.md
  • docs/security/trust-boundaries.md
  • docs/roadmap.md
  • docs/tasks/README.md

Relevant prior-art research files:

  • docs/research/eros-capros-coyotos.md
  • docs/research/genode.md
  • docs/research/plan9-inferno.md
  • docs/research/sel4.md
  • docs/research/zircon.md

Design Position

user remains a human-facing policy term, not a kernel subject. The kernel should not learn uid, role, group, tenant, or external-claim semantics. Account records, roles, attributes, labels, profiles, and federation claims decide which capabilities a trusted broker may mint or delegate; they are never independent authorization tokens.

Terms

The identity vocabulary should be precise enough that later schemas do not accidentally recreate Unix users.

  • Principal: the stable identity key used by auth, policy, audit, and ownership metadata. A principal can represent a human, service, guest, anonymous caller, deployment, or pseudonymous external subject.
  • User: a user-facing category for a principal/session that represents a human or human-adjacent actor. It is not a kernel object, not a UID, and not an authority source. In design text, prefer principal, session, or account when one of those is meant.
  • Account: a durable local record for a principal. It binds credential references, status, roles, attributes, storage roots, quotas, and default policy/resource profile names. Some principals have no account: anonymous callers, some guests, and some one-shot external sessions.
  • Profile: a named policy template selected by account data, manifest seed data, external admission rules, or service configuration. A profile contains no authority by itself. It selects bundle fragments, quotas, ABAC defaults, labels, and approval eligibility that the broker may use when minting actual capabilities. Use policy profile or resource profile when the narrower meaning is intended; use plain profile only for prose that intentionally covers both.
  • Policy profile: the authorization template: roles, ABAC defaults, allowed bundle fragments, approval paths, label defaults, and external admission constraints.
  • Resource profile: the quota and default-resource template: storage, memory, CPU share, process/thread/cap limits, IPC limits, log volume, network posture, and launcher posture.
  • Session: a live authenticated, guest, anonymous, or external context. It has freshness, expiry, source, auth strength, audit identity, and a selected policy profile plus resource profile. A session receives capabilities; an account does not run.
  • Session liveness cell: mutable trusted session-manager state behind the immutable process SessionContext. It records whether the session is live, logged_out, revoked, expired, or recovery_only, plus session and policy epochs used by renewal and grant decisions.
  • Role: an RBAC label attached to accounts or sessions. It is used by a broker to decide eligibility for bundle fragments or leased grants. It is not authority after the corresponding cap is absent.
  • Workload: a process or supervised subtree launched with a concrete CapSet. It may carry session/account metadata for audit and policy, but it runs with capabilities, not as a user.

There are three account and admission sources:

  • Manifest seed accounts: immutable or append-only bootstrap records in the boot package. These create first local operators, recovery identities, service identities, emergency guest policy, and initial policy bundles.
  • Local account store: mutable disk-backed account, credential, role, attribute, quota, and resource-profile records. This is the normal source for durable local accounts after storage is available.
  • External identity admission and bindings: OIDC, passkey, cloud, deployment, or certificate-backed principals mapped to system policy profiles or existing local accounts. External claims are inputs to ABAC and account binding; they do not grant local authority by themselves.

Manifest seed data should be sufficient to boot, recover, unlock storage, and create or repair the local account store. It should not become a permanent mutable account database. Disk state should be authoritative for ordinary accounts after the account store is initialized, with explicit versioning, rollback detection, and recovery import/export.

Account Model

The first durable data model should be small and cap-shaped:

struct AccountRecord {
  recordId @0 :Data;
  principalId @1 :Data;
  kind @2 :PrincipalKind;
  displayName @3 :Text;
  status @4 :AccountStatus;
  credentialRefs @5 :List(Data);
  roles @6 :List(Text);
  attributes @7 :List(Attribute);
  resourceProfile @8 :ProfileRef;
  policyProfile @9 :ProfileRef;
  homeRoot @10 :StorageRootRef;
  createdAtMs @11 :UInt64;
  updatedAtMs @12 :UInt64;
  schemaVersion @13 :UInt32;
  storeEpoch @14 :UInt64;
  recordVersion @15 :UInt64;
  policyEpoch @16 :UInt64;
  previousHash @17 :Data;
  contentHash @18 :Data;
}

struct ProfileRef {
  profileId @0 :Data;
  versionId @1 :Data;
  epoch @2 :UInt64;
}

struct StorageRootRef {
  storageServiceId @0 :Data;
  rootObjectId @1 :Data;
  rootKind @2 :StorageRootKind;
  schemaVersion @3 :UInt32;
  rootVersion @4 :Data;
}

enum StorageRootKind {
  namespace @0;
}

enum AccountStatus {
  active @0;
  disabled @1;
  locked @2;
  recoveryOnly @3;
}

struct ResourceProfile {
  profileId @0 :Data;
  versionId @1 :Data;
  epoch @2 :UInt64;
  homeQuotaBytes @3 :UInt64;
  tempQuotaBytes @4 :UInt64;
  processLimit @5 :UInt32;
  threadLimit @6 :UInt32;
  capLimit @7 :UInt32;
  memoryCommitLimitBytes @8 :UInt64;
  frameGrantLimitPages @9 :UInt64;
  endpointQueueLimit @10 :UInt32;
  inFlightCallLimit @11 :UInt32;
  retired12 @12 :UInt32; # was pending IPC submission quota; do not reuse
  ringScratchLimitBytes @13 :UInt64;
  logQuotaBytesPerWindow @14 :UInt64;
  networkProfile @15 :Text;
  cpuBudgetUsPerWindow @16 :UInt64;
  cpuWindowUs @17 :UInt64;
  timerWaiterLimit @18 :UInt32;
  launcherProfile @19 :Text;
}

homeRoot is a persistent reference that the account/storage broker resolves into a live Namespace capability at session-bundle time. It is not itself a capability, not a path, and not a raw Directory. The first implementation should use capability-native Namespace as the account home source of truth; Directory is a compatibility projection returned by a filesystem or POSIX adapter when a workload needs file-like APIs. storageServiceId names the trusted storage service instance, rootObjectId names the stored namespace root within that service, rootKind keeps the record extensible while v1 only accepts namespace, and schemaVersion lets future storage-root encodings fail closed.

External identities should bind to accounts through explicit records:

struct ExternalIdentityBinding {
  bindingId @0 :Data;
  provider @1 :Text;        # oidc issuer, cloud provider, cert authority
  subjectHash @2 :Data;     # hash(provider kind, issuer, tenant, subject)
  principalId @3 :Data;     # local or pseudonymous principal
  tenant @4 :Text;
  acceptedClaims @5 :List(Text);
  expiresAtMs @6 :UInt64;
  policyProfile @7 :ProfileRef;
  resourceProfile @8 :ProfileRef;
  schemaVersion @9 :UInt32;
  storeEpoch @10 :UInt64;
  recordVersion @11 :UInt64;
  policyEpoch @12 :UInt64;
  previousHash @13 :Data;
  contentHash @14 :Data;
}

Claims such as OIDC groups, acr, amr, tenant IDs, device posture, source network, and token age are ABAC inputs. They must be normalized before use and discarded or refreshed when stale.

Gate 0 schema-plan decisions are recorded in docs/proposals/user-identity-and-policy-proposal.md: durable account records belong in a separate account-store schema/service slice, while UserSession keeps only session/profile summaries and opaque broker result handles. Durable joins use fixed opaque binary IDs rather than display names. Disk-backed records require schema versions, monotonic store and record versions, policy epochs, previous hashes, content hashes, and compare-and-set mutation preconditions. Recovery import from manifest seed data is additive and conservative: preserve validated IDs, disable stale bindings, avoid automatic authority widening, and emit audit records or stay in bounded emergency mode.

Default Session Resources

The default resource bundle for a session backed by a local account should be useful but narrow:

  • terminal: the foreground TerminalSession for this login.
  • session: read-only UserSession or SessionContext for audit identity, auth freshness, and display.
  • home: read-write Namespace or Directory scoped to the user’s home root.
  • config: read-write user config namespace, separated from application data.
  • cache: bounded user cache namespace with eviction policy.
  • tmp: bounded per-session temporary namespace deleted at logout or expiry.
  • logs: read-only view of this user’s own session logs plus a write-only application log sink.
  • launcher: restricted launcher for approved applications and demos.
  • approval: client for requesting broker-reviewed grants.
  • credentials: self-service credential update interface that never exposes verifier material.
  • keyring: scoped secret unwrap/use interface for this user’s data classes, not raw global key export.
  • status: read-only system status with sensitive device and security state redacted unless a role grants more.

No entity should receive implicitly unbounded consumption of limited system resources. Every default bundle needs an associated ResourceProfile covering at least memory, CPU share, storage bytes, process/thread/cap counts, endpoint queue state, in-flight calls, network posture, and log volume. Ring submissions remain fixed-bound by ring depth and dispatch budget instead of a profile quota. This backlog can name the requirement, but the general resource-accounting model should be a separate design proposal because it applies to users, services, guests, anonymous callers, drivers, storage, network stacks, and test workloads.

Default guest resources should be explicitly weaker: terminal, session, ephemeral tmp/home, restricted launcher, self-contained logs, tight memory and CPU quotas, and low process/thread/cap limits. Guests should not receive durable home storage, persistent credentials, network listeners, service management, or administrative approval paths unless policy names that exception.

Anonymous remote sessions should receive almost nothing: a login/account- creation path, optional read-only documentation/help caps, the minimum auxiliary state needed for the protocol, tight memory quota, low CPU share, short expiry, and no default shell, home, launcher, network listener, durable namespace, or broad service cap. Authentication or explicit account creation is the normal path from anonymous to durable authority.

External sessions should be admitted only by explicit configuration. The configuration either maps the external subject to an existing local account, or permits auto-creation of a pseudonymous/tenant-scoped account with a named policy profile and resource profile. A federated login may receive a durable namespace only when an ExternalIdentityBinding or auto-creation rule maps it to a local principal and the provider assertion is fresh enough for that profile.

Service accounts should receive no terminal and no interactive bundle. Their default resources are measured-binary launch authority, service-specific state namespace, log writer, bounded network or IPC caps, and supervisor-approved credential/keyring usage.

Roles

Roles are bundle selectors and approval eligibility, not authority by themselves. The first role set should be conservative:

  • guest: interactive temporary session with no durable storage.
  • local-user: normal local account; owns a home/config/cache profile.
  • developer: may launch development tools, read own build logs, and request scoped test network/client caps.
  • storage-admin: may inspect and repair selected storage services and quota records, but cannot read user homes or unwrap user keys by default.
  • net-operator: may request leased network-stack and listener management caps for named services.
  • service-operator: may restart or inspect named services through init-owned supervisor caps.
  • security-auditor: may read selected audit/security logs but not user private content.
  • account-admin: may create, disable, lock, and bind accounts; cannot read credential verifier material or user homes.
  • policy-admin: may update role, ABAC, and label policy after fresh strong authentication; cannot directly mint end-resource caps.
  • recovery-operator: manifest-seeded break-glass identity with local-console and storage-recovery constraints.
  • system-updater: may update trusted boot packages, policy schema, and service packages through measured update workflows.
  • service-account: non-interactive role profile constrained by measured binary, supervisor, and service name.

External groups should not be imported as roles automatically. A binding rule may map a provider group to a local role only for a named tenant/provider, with expiry, audit, and conflict handling.

Permission Rules

Initial rules should be expressed in terms of cap bundles and wrappers:

  • No session receives raw ProcessSpawner, raw FrameAllocator, broad DeviceManager, or unrestricted StoreAdmin by default.
  • home grants are owner-scoped. Sharing returns attenuated sub-namespace or file capabilities through a broker and records the grant in audit state.
  • config writes are allowed for user-owned preferences. Security-relevant changes such as credential policy, role bindings, and external identity bindings require broker approval and fresh authentication.
  • Credential services expose verify, enroll, rotate, disable, and recovery operations. They never return password hashes, PHC verifier blobs, private passkey material, or raw MFA secrets to ordinary sessions.
  • Keyring caps expose use or unwrap operations scoped to a data class and session. Exportable key material requires a separate explicit backup grant.
  • Storage-admin repair caps should operate on volume metadata, namespace integrity, quota ledgers, and snapshots. They should not imply decrypt/read authority for user content.
  • Network listener authority is opt-in. Normal users may receive client network caps by profile; listeners require net-operator, service policy, or an application-specific grant.
  • Service management is named and leased. service-operator grants must name the service or service group and should not include arbitrary spawn authority.
  • External identity sessions are denied local administrative roles unless a local binding explicitly allows that provider, tenant, principal, role, auth-strength, and expiry.
  • Disabled or locked accounts may authenticate only to recovery flows that are explicitly allowed by account state.
  • Role changes, external binding changes, policy changes, and recovery actions emit audit events with principal, session, source, previous value, new value, policy version, and approval grant.

RBAC And ABAC Split

RBAC should answer coarse questions:

  • Which default bundle profile does this session receive?
  • Which approval requests is this session eligible to make?
  • Which service, storage, or account-management roles can appear in a grant?

ABAC should narrow or deny based on context:

  • auth strength and authentication age,
  • local console vs remote terminal vs browser companion,
  • external provider, tenant, normalized claims, and token freshness,
  • session age, account state, recovery mode, and boot mode,
  • requested capability interface, method class, target object owner, target sensitivity/integrity label, quota impact, and lease duration,
  • service package measurement and supervisor identity for service accounts.

The broker should return capabilities, wrapper caps, leases, or denials. A plain PolicyDecision.allowed = true is not authority and must not be usable outside the broker/minting path.

Username-Aware Local Password Login

The current shell-led login command is username-aware for the local password path as of 2026-04-30 02:18 UTC: it prompts username> before hidden password>, sends an account/principal selector plus proof/source metadata to SessionManager.login, and lets SessionManager choose the account-owned credential reference before minting a session. This is still a bootstrap implementation over one console verifier; disk-backed credential records and multi-verifier account storage remain future work.

Status 2026-05-01 08:47 UTC: default password-authenticated local operator sessions mint with expiresAtMs = u64::MAX; the shell renders that as expires_at_ms=never. Manifests that set a non-default operatorMs still exercise wall-clock expiry for focused stale-session proofs.

The target console UX is:

  1. login prints generic login text and prompts username>.
  2. The shell reads the account name with visible echo and bounded line length.
  3. The shell prompts hidden password> only after a submitted username.
  4. All denials print the same authentication denied. text, regardless of whether the account name is missing, disabled, locked, recovery-only, profile-incompatible, or the password proof is wrong.
  5. Setup remains explicit. A fresh-image setup path must either create the first local operator-kind account name or clearly state which volatile compatibility account owns the credential.

The implementation should change the request shape before adding user-visible multi-account behavior:

  • SessionManager.login should carry method, an account/principal selector, proof bytes, and source metadata. For the password path, the selector is a normalized local account name or opaque account ID; proof bytes remain the submitted password until a challenge/response verifier exists.
  • SessionManager verifies the bootstrap console password only after the selected manifest/default account owns credential reference console-password. Future CredentialStore record APIs should preserve the same account-owned reference rule without exposing whether the selector, account, credential reference, or verifier failed.
  • SessionManager uses the account store as the source of principal ID, display name, principal kind, policy profile, resource profile, account status, and credential references when seed account data exists. The no-store fallback accepts the normalized manifest operator seed account name when one exists, and retains operator only as the bare compatibility default when no seed account exists.
  • Default password-authenticated local operator sessions should not use fixed wall-clock expiry as their normal lifecycle. They should end through explicit logout, terminal/connection/process-tree close, or administrator revocation; configured hard maxima remain opt-in policy for proofs or deployments that require them.
  • AuthorityBroker.shellBundle continues to derive the shell bundle from the minted session’s policy and resource profiles; it must not trust the typed username as authority after session minting.

Audit and redaction rules are part of the contract:

  • Failed pre-auth attempts record only a terminal-local event ID, source class, generic password-denied or password-unavailable reason, auth method, and volatile flag. They leave principal, account, profile, and session fields blank so account enumeration is not possible through logs.
  • Successful login records stable principal/session/profile metadata from the minted session, not from raw username text. The password proof, verifier, credential reference secret, and full terminal line never appear in audit, kernel logs, QEMU transcripts, or panic text.
  • Wrong username and wrong password should have indistinguishable terminal text and audit shape except for terminal-local event IDs and timing/backoff.

Migration from the existing seeded operator password is explicit:

  • kernelParams.consolePasswordVerifierPhc maps to a manifest seed account named operator with a stable credential reference such as console-password when no richer seed account owns that verifier.
  • If the manifest already declares a seed account with that credential reference, the verifier belongs to that account and no synthetic account is created.
  • The shell accepts operator as the username for migrated manifests. A wrong or unknown username follows the same denial path as a wrong password.
  • Setup-created credentials remain volatile until disk-backed account storage lands; the prompt and audit record must keep saying so.
  • Documentation and smoke transcripts should stop treating a bare password as sufficient identity once the username-aware flow lands.

Required proof coverage for the first implementation slice:

  • make run-login and make run-smoke prompt for username> before hidden password>, accept operator plus the existing demo password, and reject a wrong username and wrong password with identical terminal denial text.
  • make run-login-setup covers first-credential setup and then username-aware login for the resulting account or the default migration account.
  • make run-local-users proves manifest-backed operator account lookup, resource/profile inheritance, and account-status denial without exposing account existence in failed audit records.
  • Host tests cover manifest migration from consolePasswordVerifierPhc to the default seed account, duplicate credential-reference rejection, and normalized account-name lookup.

Ordered Backlog

Gate 0: Grounding And Schema Plan

  • Update the identity proposal with this manifest/disk/external account model once the current Telnet milestone no longer owns serial focus.
  • Update identity docs to use principal, account, session, and profile consistently, reserving user for human-facing prose.
  • Publish the terminology in user-facing mdBook pages, not only this backlog. At minimum update docs/overview.md, docs/capability-model.md, docs/proposals/user-identity-and-policy-proposal.md, and docs/proposals/oidc-and-oauth2-proposal.md, and docs/security/trust-boundaries.md so readers encounter the same terms from normal documentation entry points.
  • Decide whether account records live in the existing user-identity schema slice or a separate account-store schema slice.
  • Define stable IDs for local principals, external bindings, resource profiles, storage root references, and policy versions.
  • Define rollback and version checks for local account-store records.
  • Add design notes for how recovery imports a damaged or missing local account store from manifest seed data.
  • Write the cross-cutting limited-resource and quota proposal before treating any guest, anonymous, local-account, service, or external profile as complete.

Gate 1: Manifest Seed Accounts

  • Extend boot/init config with manifest seed accounts, service accounts, resource profiles, and initial role bindings.
  • Validate that seed account names, principal IDs, roles, resource profiles, and credential references are unique and resolvable.
  • Reject manifests that grant ordinary users privileged kernel caps directly instead of broker-mediated policy inputs.
  • Add host tests for duplicate principals, missing resource profiles, invalid bootstrap roles, and service-account/binary mismatches.
  • Add a QEMU smoke that boots a manifest-seeded local operator and proves the session receives only the expected default bundle.

Gate 2: AccountStore And ResourceProfile Services

  • Add AccountStoreReader and AccountStoreManager userspace interfaces for lookup, create, disable, lock, role binding, external binding, and profile updates while keeping read and mutation authority separate.
  • Add ResourceProfileReader and ResourceProfileManager userspace interfaces, keeping mutation authority separate from session reads.
  • Implement a RAM-backed prototype for account records and resource profiles before durable storage.
  • Add broker integration that assembles default local-account, guest, anonymous, external, and service-account bundles from account/profile records. - [x] Add the config-side default bundle planner over account/profile records for a follow-up AuthorityBrokerCap wiring slice. - [x] Wire the bootstrap AuthorityBrokerCap shell-bundle path to the config-side planner for manifest-backed local/operator sessions. - [x] Add manifest-backed guest identity/planner wiring for shell bundles and QEMU proof coverage without preserving a bootstrap guest fallback. Guest sessions now require an explicit manifest seed, guest shell bundles receive no default service endpoints, and guest launchers are empty unless a resource-profile launcher posture names a narrow proof binary. - [x] Add a local-users QEMU proof that the initial anonymous shell bundle is minimal before password login. - [x] Wire SessionManager.sshPublicKey through RamAccountStore so SSH-minted sessions inherit account-status enforcement. SSH denial causes are exposed as stable auth= audit codes (ssh-account-missing, ssh-account-disabled, ssh-account-locked, ssh-account-recovery-only, ssh-account-lookup-failed, plus the existing key/signature/ profile codes); failed records keep principal/profile blank by policy. End-to-end QEMU proof of the account-status denial paths waits for AccountStoreManagerCap as a kernel cap source (Gate 2 follow-up below). - [x] Migrate local password login planning into schema and proof work: add a username-aware SessionManager.login selector, move password verification to account-owned credential references, preserve anti-enumeration audit/terminal behavior, and keep the existing single seeded operator password as an explicit operator account migration path. Implemented 2026-04-30 02:18 UTC as prioritized ad-hoc work; durable multi-verifier credential storage remains future Gate 2 account-store work. - [ ] Add AccountStoreManagerCap and ResourceProfileManagerCap as kernel cap sources so a focused QEMU demo can disable an account and prove SessionManager.sshPublicKey rejects with auth=ssh-account-disabled. This is also the prerequisite for external-binding admission tests below.
  • Complete mutable session lifecycle methods before treating short session expiry as production shell UX. The first live/logged_out liveness cell and UserSession.logout path is implemented for SessionManager-minted sessions, including explicit remote DTO gateway logout and owned-session connection-close propagation. Remaining work includes owner-shell exit, terminal close, administrator revocation, renewal/recovery, full audit reason separation, and in-flight endpoint result cancellation after logout. SessionManager.renew should extend or rotate a session only after account status, auth freshness, policy/resource profile epochs, requested duration, and revocation state pass. Renewal must mint fresh grant leases or wrappers when policy needs a new decision and must not silently revive stale ordinary grants.
  • Add host tests proving account-admin cannot read homes, credential verifier material, or key material through account-management caps.

Gate 3: Disk-Backed Local Account Store

  • Store account records, credential references, resource profiles, role bindings, external bindings, and policy-version metadata in capability-native Store/Namespace records.
  • Add atomic update or compare-and-set semantics for account mutations.
  • Add monotonic version/epoch checks to reject stale or replayed account records after reboot.
  • Add local snapshot/export records for recovery and rollback inspection.
  • Add QEMU reboot proof that a created local account, role binding, disabled state, and home namespace survive restart.

Gate 4: Default Resource Bundles

  • Implement bundle construction for guest, local-user, developer, service-account, and anonymous profiles.
  • Allocate per-account home, config, cache, and per-session tmp namespaces through storage caps instead of synthetic path strings.
  • Add quota checks for home bytes, temp bytes, processes, threads, caps, memory, CPU share, endpoint queue state, in-flight calls, and log volume.
  • Add QEMU proof that two local accounts receive different home/config namespaces for the same application binary.
  • Add QEMU proof that guest and anonymous sessions cannot persist data or request durable home caps.

Gate 5: RBAC Runtime

  • Implement a RoleDirectory backed by account-store role bindings.
  • Map roles to named bundle fragments and approval eligibility.
  • Add policy tests for account-admin, policy-admin, storage-admin, net-operator, service-operator, security-auditor, and recovery-operator.
  • Add deny tests showing roles alone do not authorize capability calls after the relevant cap is absent or revoked.
  • Add audit records for role grant, role removal, and role-derived bundle issuance.

Gate 6: ABAC Runtime

  • Define the first PolicyRequest context fields for auth freshness, source, external provider/tenant, object owner, object label, requested interface, method class, quota impact, and lease duration.
  • Prototype the PolicyEngine boundary with a small in-repo evaluator or Cedar-backed host-side prototype hidden behind the same interface.
  • Add ABAC tests for fresh-auth requirements, remote-vs-local denial, provider/tenant scoping, maintenance windows, service measurement, and storage label constraints.
  • Ensure PolicyDecision cannot be used directly by callers; only a broker may turn it into capabilities or leases.
  • Add QEMU proof that a stale authenticated session can keep ordinary home access only through policy-explicit recovery/renewal state and cannot obtain a privileged leased cap until renewal or reauth mints fresh grant leases.

Gate 7: External Users

  • Implement external identity binding records keyed by provider and subject hash, with tenant and expiry.
  • Normalize OIDC/passkey/certificate/cloud claims before they enter policy requests.
  • Add explicit external admission configuration. It must either bind an external subject to an existing account or permit auto-creation with a named policy profile and resource profile.
  • Add an external pseudonymous account profile with bounded temp storage, bounded durable storage only when configured, and no local administrative roles.
  • Add explicit local-account binding flow for external users that need durable local home storage.
  • Add tests rejecting stale tokens, wrong tenants, unmapped provider groups, disabled bindings, and external attempts to assume local admin roles without a binding rule.
  • Add default-deny admission tests for absent external admission config, auto-creation disabled, and unknown policy/resource profile names.

Gate 8: MAC/MIC And Labels

  • Attach confidentiality and integrity labels to account profiles, session profiles, namespaces, logs, secrets, and service accounts.
  • Implement wrapper caps for read-like, write-like, control-like, and transfer-like method classes where labels affect the grant.
  • Add tests for no-read-up, no-write-down, integrity write/control, and trusted-subject exceptions.
  • Decide whether any label/hold-edge metadata must become kernel-visible for mandatory transfer rules, or whether broker and wrapper enforcement is sufficient for the first implementation.

Gate 9: POSIX Profile Adapter

  • Add POSIX profile metadata for uid/gid/user name/group name/home path as compatibility data derived from account records.
  • Ensure setuid, chmod, and ownership metadata cannot grant caps outside the compatibility filesystem service.
  • Add tests proving POSIX metadata changes do not widen cap bundles.

Verification Gates

  • Host tests for manifest validation, account-store mutation policy, role mapping, ABAC request construction, external binding normalization, and audit emission.
  • QEMU smokes for manifest-seeded operator login, two-account namespace separation, guest/anonymous persistence denial, disk-backed account survival across reboot, external pseudonymous login, and stale-session privileged-grant denial.
  • Documentation updates to docs/security/trust-boundaries.md, docs/proposals/user-identity-and-policy-proposal.md, and storage docs before any implementation is treated as selected milestone work.

Shared-Service Demo Backlog

Detailed decompositions and design notes for chat, adventure, and federated service demos. docs/tasks/README.md links here but should not inline these subtasks.

Design Notes

Multi-process userspace applications exercise the resident-server plus shell-spawned-client pattern on top of the completed boot-to-shell, Endpoint, ProcessSpawner, and session-bound invocation substrate. Chat has migrated to service-scoped caller-session identity, and Aurelian ordinary player state is also keyed by live caller-session metadata. The focused text adventure manifest uses session-bound service grants for player, NPC, Adventure, and chat paths. Reuse is extracted after the second service lands, not speculatively. Federation is blocked on future network-transparency proof work.

The authoritative migration gates for removing caller-selected shared-service identity now live in docs/backlog/session-bound-invocation-context.md under Gate 4. The older service-object migration backlog is historical background only unless the selected milestone changes again.

The first slice keeps chat and adventure usable as ordinary spawned commands over generic Endpoint grants plus explicit StdIO for terminal I/O. The shared demos/capos-chat crate owns typed request/response DTOs for the prototype bridge, while the top-level shell/ crate owns generic process commands such as spawn, blocking run, wait, and grant parsing. The StdIO clients are a smoke harness and compatibility path, not the target capOS-native command boundary; native interactive apps should later expose command surfaces as described in docs/proposals/interactive-command-surface-proposal.md.

Room-scoped MUD speech (say, tell) maps naturally onto chat channels, so adventure should consume the chat service rather than reimplementing pub/sub. Keep the Adventure schema for world state and verbs that are not speech; route say/tell/NPC dialog through Chat subscriptions scoped to room channels.

Chat Follow-Ups

Completed context:

  • MVP Chat endpoint interface and event variants. The original receiver-selector identity MVP has since been replaced by service-scoped caller-session keys for normal chat membership.
  • Public chat lobby stays #lobby; adventure room speech uses hierarchical #room/<world>/<room> channels, with the demo world under #room/demo/<room>.
  • Client poll() is used for MVP event delivery; foreground client drains queued events before prompting again.
  • demos/chat-server/ scaffold with capos-rt entry and bounded per-channel history ring.
  • join/leave/send/who plus fan-out for the legacy endpoint-metadata MVP.
  • demos/chat-client/ over explicit capos-rt StdIO plus chat endpoint client.
  • Chat stays out of native shell builtins and runs as a spawned command with omitted-badge stdio: client @stdio and chat: client @chat grants.
  • make run-chat smoke: shell-spawned client sends a line through the resident service, resident bot observes it and replies, foreground client prints the reply.

Remaining:

  • Migrate chat from legacy endpoint receiver-metadata identity to service-scoped caller-session keys. ChatRoot possession authorizes join attempts; membership, channels, sends, leaves, and polls key off the live opaque caller-session reference instead of a caller-selected selector.
  • Add per-principal state keyed by UserSession.info, admin-only verbs, typed denial results, and redacted audit records per join/leave/send.
  • Defer a distinct Subscription interface until federation or native command surfaces need a separate event authority object.

Client-server interface sync audit (2026-05-03 19:02 UTC):

  • Walk the recent chat-server interface and behavior changes against the demo chat clients to confirm no drift. Audit covered six commits (5dc0e8ca /exit banner alignment, e7d0e00d [history] text label plus EventKind::History, 45384fa0 server-assigned member-N sender labels in place of the caller-supplied join handle, 7bb90528 idempotent re-join, dc7ece49 membership keyed by caller session, and f5eab276 EndpointUserData rename) and confirmed demos/chat-client/, demos/capos-chat/, demos/chat-bot/, and demos/chat-observer/ already track each one. The chat-client banner lists /exit and accepts both /exit and /quit; ChatEventKind decodes all five schema variants and renders the server’s [history] <text> prefix verbatim while the chat-observer reports kind=history vs kind=live separately; event.sender is always taken from the server response so server-assigned member labels are shown without modification; re-join goes through leave_wait followed by join_wait with no stale “already joined” assertion; chat-server session keying and the neutral endpoint user-data name are entirely server-side. Verified live with make run-chat (exit 0): the smoke transcript shows [chat] /join <channel>, /leave, /who, /exit, or plain text, [chat] #lobby <member-2> hello from shell, and [chat] #lobby <member-1> [chat-bot] echo-bot heard you. — i.e. the /exit banner, server-assigned member-N labels, and pass-through of message text all behave as expected. The [history] label path is not exercised by make run-chat (chat smoke joins fresh) but is covered by make run-adventure via assert_adventure_npc_chat_history_actor in tools/qemu-shell-smoke.sh.
  • Latent demos/chat-bot/ self-echo filter resolved in 7b9c5993 by skipping ChatEventKind::History events so replayed [history] [chat-bot] ... messages no longer slip past the prefix-only check.

Adventure Follow-Ups

Completed context:

  • MVP Adventure interface for non-speech verbs and room views.
  • Legacy receiver-selector layout distinguishes player from NPC authority on both a future PlayerSession and room chat channels. MVP manifests reserve low selectors for shell players (chat=1, adventure=2) and service/NPC authority at 100+.
  • Rooms map to chat channels as #room/<world>/<room-id>, with the demo world under demo.
  • demos/adventure-server/ scaffold with a small room graph, typed world verbs, live caller-session keyed player state, and chat channel metadata.
  • demos/adventure-client/ as a spawned command over explicit StdIO, adventure, and chat endpoint grants.
  • Adventure StdIO parser remains prototype-scoped and should later be replaced with a command surface exposing nested paths such as go, take, drop, inventory, say, and chat join.
  • make run-adventure smoke: scripted player moves rooms, completes one state-changing world action, and exits cleanly through the shell-spawned client.
  • NPC-as-process fleet: one process per NPC, each holding manifest-issued legacy player/adventure receiver metadata plus chat endpoint authority for room dialog.
  • At least two concrete NPCs ship with liveness asserted in the adventure smoke.
  • NPC process exit surfaces as ProcessHandle completion on the server side.

Game-depth follow-up:

  • Decompose the Aurelian Frontier proposal through docs/backlog/aurelian-frontier.md rather than expanding this shared service harness backlog with content, combat, economy, and multiplayer details.

Session-bound identity follow-up:

  • Finish adventure NPC and service-authority cleanup for the focused shared-service proof now that ordinary player state is keyed by live caller-session metadata. NPC service authority is broker/manifest-issued rather than caller-chosen, and the focused adventure manifest uses the already session-keyed chat service through ordinary chat authority.

Shared Harness Extraction

Completed context:

  • Extract duplicated legacy endpoint receive/release/return loop used by chat and adventure resident services into demos/service-common/.
  • Defer a shared bounded event queue until chat history/inbox and adventure/NPC event needs converge. Current evidence: only chat-server has bounded history/inbox queues; adventure room state and NPC polling do not expose a matching queue abstraction.
  • Extract bot/NPC client scaffolding shared by chat bot and adventure NPC processes.
  • Extract shared chat actor polling loop used by chat-bot, wanderer, and shopkeeper while keeping each actor’s cap validation, join/greeting, reply text, and exit logging local.
  • Extract shared chat actor bootstrap for required console and chat caps plus the single-owner ring client, while preserving actor-specific failure text and behavior setup.

Federated Chat Milestone

Blocked on future network transparency.

Extend chat across hosts after a separate proof shows cap transport crossing machines. This integration test exercises networking, TLS, OIDC, key-management, and audit proposals together.

  • Define cross-host addressing (@user@host, #room@host) and record it in schema.
  • First cross-VM channel smoke: two QEMU instances, one message delivered across TLS.
  • Federated audit: per-host records plus signed cross-host event trail.

Paperclips Terminal Demo Backlog

This backlog tracks future expansion of the clean-room Paperclips terminal demo described in Paperclips Terminal Demo. It is not the current selected milestone.

The clean-room mechanics baseline is recorded in docs/research/paperclips-clean-room-functional-spec.md. Use that note as the planning source for gameplay behavior. Do not copy source-game implementation identifiers, text, assets, generated tables, exact balance, CSS, or code when expanding content.

Current runnable status: the focused Paperclips manifest now boots an authoritative Paperclips server and a terminal client. The server owns generated content, resources, GameState, proof-command gating, unlock checks, and game-rule mutation. The server owns regular timer cadence and exposes the current command list and unlocked projects as structured data so server-mode terminal clients render plain help and projects from server state. Server-mode terminal clients also render plain status from the server’s PaperclipsStatusSnapshot, while status --json remains proof-only and server-gated. Follow-up work should move unlocked command facets behind server-issued capabilities so later terminal and web clients do not reimplement rules.

Design grounding for the client/server, structured command-list, structured plain-status, and structured project-list slices: docs/demos/paperclips.md, docs/research/paperclips-clean-room-functional-spec.md, docs/architecture/ipc-endpoints.md, docs/architecture/capability-ring.md, docs/proposals/session-bound-invocation-context-proposal.md, and docs/proposals/system-info-proposal.md. No other research note applies directly because this slice uses the existing endpoint/ring transport and does not introduce a new external OS/runtime protocol.

Current Baseline

Implemented:

  • Clean-room terminal implementation inspired by the public Paperclips premise without copying original game code or assets.
  • make run-paperclips boots a focused manifest, launches Paperclips server services plus a terminal client through the shell, grants explicit StdIO plus a PaperclipsGame endpoint to the terminal client, grants Timer to the server, drives the first production loop, proves project chains and proof gating, and exits cleanly.
  • In the focused manifest, game state is local to the Paperclips server process and disappears when that server exits. Direct standalone launches without a game endpoint retain the older in-process fallback.
  • The default system.cue manifest still advertises the standalone fallback launch with run "paperclips" with { stdio: client @stdio, timer: @timer } because it does not start the Paperclips server. The structured command-list, status-snapshot, and project-list methods only change server-mode client rendering, so no MOTD/default-manifest text change is needed for this slice.
  • The pure rules layer lives in demos/paperclips-content and is host-testable separately from the terminal adapter.
  • Paperclips content is authored in CUE, converted through pinned mkmanifest cue-to-capnp into the Paperclips-specific Cap’n Proto schema, checked in as generated Rust bytes, then deserialized through typed Paperclips schema bindings at runtime.
  • Core game balance and content live in CUE: initial state, purchases, projects, unlock effects, production rates, millisecond intervals, currency formatting, price limits, trust thresholds, and phase transition values.
  • Manual make produces one clip only; counted manual make requests are rejected.
  • Automation advances from the Timer capability in real time, while run <ms> is reserved for focused proof launches with an explicit proof_accelerator cap.
  • Opening business loop has dynamic demand plus CUE-owned raw-material bundle pricing, slower market updates, purchase pressure, and generated content freshness checks.
  • Business-phase explicit sales are time-aware: successful sales start a CUE-owned cooldown, repeated immediate sell <n> commands are refused without mutating state, and Timer/proof time advancement clears the cooldown.
  • Focused QEMU transcript now demonstrates manual work and explicit sales funding Autoclipper License, one repeatable economic choice, one wire purchase, and completion of precision-rollers with a visible autoclipper-count effect.
  • Focused QEMU transcript now also demonstrates representative Stage 1 refusal output: an early locked buy autoclipper, an insufficient-funds buy wire 1000, pending manual work, bulk manual rejection, and locked project survey-drones, plus a high-price sell 1 demand refusal and a sell 2 requested-count sale capped by one available clip, plus a no-wire manual production refusal after automation drains the available wire.
  • Focused QEMU transcript now proves a Stage 2 project chain after repeatable marketing investment and scaled business-phase production: autoclipper-license, precision-rollers, design-search, forecast-engine, and survey-drones, ending at == autonomous phase ==.
  • Stage 3 autonomous rules now use CUE-owned millisecond intervals for drone local-matter conversion, factory wire consumption, probe cosmic matter conversion, and probe replication caps. Host tests cover the resource caps, scaling projects, cosmic replication, completion gating, and validation for the new rule fields.
  • Focused QEMU transcript now continues after == autonomous phase == to complete material-harvesters and foundry-lines, run milliseconds, and assert visible drone/factory counts plus local-matter conversion and additional clip production.
  • Focused QEMU transcript now closes the representative late-game proof: after the autonomous/factory proof it completes mesh-coordination, transitions through seed-probes into == cosmic phase ==, asserts visible probe replication plus cosmic-matter conversion and clip production, then leaves final-conversion locked. make run-paperclips is a representative transcript, not an exhaustive playthrough.
  • Public player launches no longer expose fast-forward. run <ms> is hidden from normal help output and refused unless the launched process receives the explicit proof_accelerator capability used by the focused QEMU proof manifest. The shell rejects attempts to mint that authority by renaming an ordinary @timer grant.
  • Player-facing project ids, labels, title text, completion text, and Strategy resource wording have been renamed away from distinctive source-game terms.
  • Active schema, CUE content, Rust rules, generated-content guardrails, and focused smoke assertions use clean-room Strategy internals rather than source-game resource identifiers.
  • Purchase parsing treats omitted counts as one and rejects explicit zero counts without mutating game state.

Known limits:

  • There is no save/load path; process exit discards game state.
  • The focused QEMU proof stops at the cosmic production milestone. It does not prove a compact full win; host coverage checks that the final conversion cost exceeds a generous one-hour normal-play creativity upper bound.

Clean-Room Gameplay Stages

Stage 1, opening business loop:

  • Add host rules tests for manual production pacing, wire depletion, explicit sales, price-sensitive demand, marketing/demand investment, and automation intervals.
  • Extend the focused transcript so existing early progression shows manual work, one economic decision, one automation purchase, and one project unlock without copying external balance values.
  • Keep representative refusal output legible in the focused QEMU transcript for missing funds, pending manual work, bulk manual production, locked purchases, and locked projects.
  • Add focused transcript cases for missing wire and demand/sale refusals. The QEMU proof now asserts No wire available. and No demand at current price. without changing game balance.
  • Add any remaining Stage 1 sale-limit cases that need end-to-end transcript proof. The QEMU proof now asserts the unique sell 2 window starts with one available clip, ends with zero available clips, and increments Sold from 1 to 2.

Stage 2, data-driven project chain:

  • Expand original CUE project content around generic effects: production multiplier/resource grant, demand policy change, compute-resource generation, strategy resource unlock, capacity grant, and stage transition. Generated CUE content now covers production/resource grants, public-demand grants, operations grants, Strategy unlock/resource grants, processor/memory capacity grants, and stage transitions. Direct trust grants remain unsupported because available trust is recomputed from clip milestones minus spent trust.
  • Replace per-project Rust effect variants with one generic CUE-backed loader/evaluator. Project completion now applies generic production and resource grants, public-demand grants, compute and strategy-resource grants, design/Strategy unlock flags, capacity grants through processors/memory, and one-step stage transitions without matching gameplay on project ids or effect kinds. Direct trust grants remain unsupported because available trust is recomputed from clip-count milestones and trust spent; adding a trust field would be misleading without changing that invariant. Generated-content tests now verify the checked-in CUE payload exercises every currently supported generic category that fits the model.
  • Add current-model validation for project graph bounds before adding more content. Host tests now reject invalid/empty ids, duplicate ids, too many projects for the completion bitset, zero-cost projects, no-op or zero grant effects, out-of-stage effects, stage-transition regressions, and missing transition paths from business to autonomous, cosmic, and complete.
  • Add explicit prerequisite and cyclic unlock-chain validation after the project schema grows named prerequisite/unlock edges. Project now carries data-only named prerequisites from CUE through the typed schema; generated content records the intended business, autonomous, cosmic, and completion unlock chain; runtime availability requires completed prerequisites in addition to stage and cost gates; and host validation rejects missing, malformed, self-referential, duplicate, and cyclic prerequisite edges.
  • Add focused smoke coverage for at least one project unlock chain after repeatable demand investment, including a phase transition out of the business phase.

Stage 3, autonomous and completion mechanics:

  • Model later-stage autonomous production with independently authored labels and bounded rules for resource conversion, factory/drone-style scaling, exploration or replication capacity, and completion progress.
  • Add host tests for stage transition predicates, autonomous production cadence, trust/capacity limits, and the completion condition.
  • Keep QEMU coverage representative rather than exhaustive: prove one transition and one timer-driven later-stage action, then rely on host rules tests for full playthrough cases. The transcript now covers one autonomous timer-driven conversion, one cosmic transition/probe interval, and locked completion-stage availability without scripting every late-game purchase.
  • Split proof acceleration from player gameplay. Normal interactive Paperclips sessions should advance only from the granted Timer capability; run <ms> should either be removed from the player-visible command set or gated behind an explicit harness-only authority that normal shell users cannot mint accidentally. help and docs should stop presenting fast-forward as a regular player command. Implemented by requiring the proof_accelerator cap for terminal fast-forward and by proving the normal launch refusal before the accelerated QEMU proof path.
  • Rebalance the completion path after fast-forward is no longer public. A normal player should not be able to reach == complete phase == within one real-time hour. Keep the smoke proof representative by stopping at selected milestones or by using a clearly test-only acceleration path, rather than shrinking late-game matter and project costs until a full win fits in the QEMU transcript. Implemented by increasing seed-probes cosmic matter/wire scale, raising final-conversion clip and creativity costs, adding host coverage for the one-hour creativity bound, and changing the focused QEMU proof to stop after cosmic probe replication/production with final-conversion still locked.
  • Add coverage for the gameplay/test-mode split: host rules should still test bounded millisecond advancement directly, but the terminal adapter should prove that normal player input cannot invoke fast-forward. If a harness-only accelerator remains, the focused QEMU proof must demonstrate that it is tied to an explicit proof capability or proof manifest, not ambient player authority. The focused QEMU proof first launches Paperclips with StdIO plus the normal PaperclipsGame endpoint, asserts run <ms> is refused, asserts a forged proof_accelerator: @timer grant is rejected, then relaunches against the proof server endpoint with proof_accelerator for the accelerated transcript.
  • Make business-phase sales time-aware. Repeated immediate sell <n> commands should not bypass demand cadence; model a replenishing demand budget, outstanding orders, or a sell cooldown backed by Timer advancement, and keep host/QEMU coverage for sale-limit refusals. Implemented with a CUE-owned sale cooldown and QEMU coverage for the immediate repeat-sale refusal.
  • Tighten clean-room naming. Replace player-facing names and text that mirror the source game’s title or distinctive project labels with independently authored names while preserving the generic paperclip maximizer premise.

Stage 4, persistence and assertions:

  • Add a compact status --json or equivalent machine-readable command only if future smoke tests need stronger assertions than the human transcript. status --json now emits one deterministic compact JSON object with numeric game-state fields, and the focused QEMU proof asserts a late-game machine-readable status line without dropping the human transcript checks.

Blocked on platform persistence:

  • Add save/load or restart-resume behavior after capOS has a durable user storage path appropriate for spawned demos.
  • Keep saved state scoped to this child process or an explicitly designed storage capability; do not introduce ambient filesystem or service state.

Schema-Aware Content Migration

Completed:

  • Defined schema/paperclips-content.capnp as a bounded data-only schema for initial state, rules, purchases, trust milestones, projects, costs, and project effects. It contains no live capOS capabilities or interface objects.
  • Kept demos/paperclips-content/content/paperclips.cue as the authoring source, now matching the PaperclipsContent schema root directly.
  • Converted generated content with pinned tools through mkmanifest cue-to-capnp, then rendered checked-in aligned Rust bytes from the schema-validated binary.
  • Updated paperclips-content runtime loading to deserialize the typed Paperclips Cap’n Proto message instead of capos_config::CueValue.
  • Wired the freshness check into make generated-code-check through generated-paperclips-content-check.

Remaining guardrails:

  • Keep generated content as schema-validated binary data; do not add runtime CUE parsing to the demo.
  • Keep the focused QEMU transcript representative: one launch, one production loop, one automation purchase, one early project unlock/effect, and clean exit. Cover larger rule validation with host tests.
  • Continue using the Rust validator for semantic bounds that Cap’n Proto cannot encode directly, such as project count, id shape, graph reachability, and nonzero costs/effects.

Client/Server Architecture Backlog

Goal: migrate Paperclips from one terminal process into an authoritative server plus thin clients. The terminal client should render output, parse player commands, and invoke server capabilities; it should not own GameState, timer advancement, proof acceleration, unlock checks, or game-rule mutation.

Staged tasks:

  • Define the first coarse Paperclips server/client schema. The initial PaperclipsGame endpoint covers initial text, command text, command results, proof-only explicit time advancement, and the first structured command-list and plain-status queries. Regular automation is driven by the server’s own timer capability. Session creation, broader structured state/events, project/purchase listings, and capability transfer points remain future protocol work.
  • Add the first Paperclips server process. It owns GameState, generated content, proof-command gating, unlock checks, and mutation rules while preserving current clean-room mechanics and host rules coverage.
  • Convert the terminal Paperclips process into a client when a game endpoint is present. The client keeps stdio, blank-command repeat, and transcript handling, then routes commands to the server. It still accepts server-rendered text in this first slice.
  • Move regular game timer cadence into the Paperclips server. Server-mode terminal clients still receive a timer grant so they can poll and display server-generated status messages while the player is idle at the prompt.
  • Add a structured command-list protocol method. The Paperclips server reports the commands available for the current state/session, including proof-only and later-stage commands only when they are actually available, and server-mode terminal clients render help from those command specs. Command execution remains the existing text request path in this slice.
  • Add a structured status snapshot protocol method. PaperclipsGame.status returns PaperclipsStatusSnapshot with the fields needed to render the existing plain status transcript, and server-mode terminal clients format that snapshot locally instead of relying on server-formatted status text. status --json remains proof-only instrumentation decided by server-side authority and is not exposed through normal structured status.
  • Add a structured project-list protocol method. PaperclipsGame.projects returns unlocked project entries with id, label, description, rendered cost, and status markers so server-mode terminal clients render plain projects locally from server-provided state. project <id> execution remains the existing text request path and still mutates server-owned game state.
  • Split command parsing and presentation more cleanly. The terminal client can now render help from structured server command specs, plain status from structured server snapshots, and plain projects from structured server project lists, but it should eventually parse player command syntax and render broader structured server state/events instead of sending raw command strings and displaying server-formatted command results.
  • Model unlocks as server-issued facets or command capabilities. Early stages may expose coarse play, project, purchase, and proof facets; later stages should narrow toward facet-per-command authorities once capability transfer ergonomics are ready.
  • Keep proof acceleration explicit. The server, not the client, should decide whether proof-only commands such as millisecond advancement and machine-readable status are available for a session.
  • Update make run-paperclips to prove the first split in QEMU: shell launches server services plus terminal clients, normal and proof sessions use different server endpoints, proof-only commands remain gated by server authority, and the existing representative transcript still exits cleanly.
  • Extend make run-paperclips after command facets land: prove the client cannot mutate state locally beyond server-granted facets and that unlock/facet changes are visible.
  • Add the later web shell/client path after the server protocol is stable. A browser-facing client or gateway should share the same Paperclips game capabilities as the terminal client instead of reimplementing game logic.

Deferred:

  • Add durable save/load only after capOS has a durable user storage capability appropriate for spawned demos.
  • Split every gameplay command into a distinct transferred capability only when the platform has ergonomic capability transfer and revocation patterns for short-lived command facets.

Run Targets, Init Mandate, And Default-Run Integration

This backlog captures three intertwined make-target and manifest-policy requirements raised against the current Makefile and system-*.cue set. They share manifests, harness scripts, and review surface, so they should land as one mainline track rather than scattered fixes.

Policy Statements

  1. make run-* targets only start QEMU. Any scripted input driving, transcript assertion, timeout-based pass/fail, log greps, or harness script wrapping must live outside the run-* recipe – either in a sibling test-* target or in a host harness invoked by the user directly.
  2. init usage is MANDATORY in every boot manifest. The boot init binary must be init (the capos-init ELF). Service or demo binaries such as capos-shell, credential-store, terminal-session, network-client, revocable-read, memoryobject-shared-parent, and per-demo entrypoints must be declared as services and launched by init, never as the top-level init binary.
  3. make run stays the default user-facing target demonstrating a sane, safe, full-featured (as of the current state) capOS instance. When a milestone introduces a user-visible common service or binary, it must be integrated into make run – either auto-started or advertised through MOTD instructions describing how the operator reaches it – as part of the milestone’s doc-update gate.

Current State

run-* recipes that contain test logic

Snapshot from Makefile at branch base. All targets in this list embed input drivers, asserts, or harness invocations and therefore violate policy 1:

  • run-smoke, run-uefi, run-net
  • run-spawn, run-shell, run-restricted-shell-launcher
  • run-chat, run-adventure, run-terminal
  • run-credential, run-login, run-login-setup, run-local-users
  • run-tcp-listen-authority
  • run-revocable-read, run-memoryobject-shared
  • run-ssh-host-key, run-ssh-authorized-key, run-ssh-public-key-session, run-ssh-public-key-auth, run-ssh-feature-policy
  • run-ringtap-failing-call
  • run-measure

(run-network-client, run-telnet(-vm), and run-ssh-gateway-terminal-host(-vm) were on this list but are now exit-2 retirement stubs with no test logic, retired with the kernel socket owner.)

Compliant run-* recipes (QEMU-only):

  • run – interactive, manifest-driven, terminal on stdio.
  • run-display – interactive variant with QEMU display.

Manifests violating the init mandate

Init binary is something other than init:

  • system-smoke.cue – init binary capos-shell
  • system-shell.cue – init binary capos-shell
  • system-login.cue – init binary capos-shell
  • system-login-setup.cue – init binary capos-shell
  • system-local-users.cue – init binary capos-shell
  • system-credential.cue – init binary credential-store
  • system-terminal.cue – init binary terminal-session
  • system-revocable-read.cue – init binary revocable-read
  • system-memoryobject-shared.cue – init binary memoryobject-shared-parent

Manifests already compliant: system.cue, system-adventure.cue, system-chat.cue, system-spawn.cue, system-measure.cue, system-restricted-shell-launcher.cue, system-tcp-listen-authority.cue, all remaining system-ssh-*.cue (system-telnet.cue, system-network-client.cue, and system-ssh-gateway-terminal-host.cue are removed with the kernel socket owner).

Default-run feature integration gap

make run boots system.cue, which already wires the anonymous shell, the login flow with the seeded password verifier in MOTD, the chat/adventure demos, chat/adventure spawn instructions, the host-local remote-session CapSet gateway, and (as of 2026-05-14 09:07 UTC) the self-served remote-session-web-ui service. The Telnet research demo is retired (the focused make run-telnet / system-telnet.cue path and its gateway demo are removed with the kernel socket owner). Milestones still absent from the default path or its MOTD are local-user setup, terminal-session focused proofs, SSH gateway terminal host, and any future SSH shell milestone.

The default make run recipe now attaches virtio-net with host-local remote CapSet forwarding to guest port 2327 and host-local web UI forwarding to guest port 8080. Both use the same ?=-overridable host port with fallback-to-free-port behavior implemented in tools/qemu-run-hostfwd.py. Other network-backed milestones, such as the SSH gateway terminal host and future SSH shell, still require their own safe default forwarding or an explicit deferral before they can be called integrated into make run.

Open Gates

Gate A: Naming and contract

  • Decide the rename split. Pick one of the two consistent options and apply it uniformly: - Strict: run and run-display are the only run-* entrypoints; every other current run-* recipe (including run-uefi, run-net, run-measure) becomes test-* regardless of whether its body is reduced to a plain QEMU start, because the policy is enforced by name, not by current contents. - Permissive: any QEMU-only recipe against the default manifest with documented firmware/device flag variations may keep a run-* name, with test-* reserved for recipes that script input or assert output. Pick this only if the policy text in CLAUDE.md/REVIEW.md can spell out the boundary unambiguously so reviewers do not have to relitigate the split per target.
  • Document the chosen policy in CLAUDE.md “Build and Test” section and REVIEW.md so future targets are added under the right prefix without case-by-case judgement.

Gate B: Init mandate enforcement

  • For every non-compliant manifest above, restructure so the init binary is init and the previous top-level binary becomes a service. Preserve the focused-proof intent: the service receives the same scoped caps it had as init, init holds only the bootstrap authority needed to spawn and supervise it, and the smoke/proof transcript continues to assert the same boundary properties.
  • Add a manifest-loader validation rule (or mkmanifest check) that rejects any manifest whose initConfig.init.binary is not init. The rule should also reject the field being missing. Update host tests to cover the negative case.
  • Update every doc that currently describes shell-led or service-led manifests as having the service as init. A 2026-04-28 12:48 UTC docs pass reconciled the current default system.cue path as standalone-init-owned while preserving focused shell-led smoke descriptions where system-smoke.cue and system-shell.cue still boot capos-shell directly. Gate B remains open until the focused manifests themselves are migrated or documented as explicit exceptions, the loader/manifest validation rule lands, and a final re-grep confirms no stale default-boot wording remains.

Gate C: Test split

  • Move every scripted input driver, transcript assertion, timeout wrapper, harness invocation, and log grep currently embedded in a run-* recipe into a new test-* recipe. The run-* side, where retained, becomes a one-line qemu-system-x86_64 ... $(QEMU_COMMON) $$serial_args invocation against the same ISO.
  • Keep tools/qemu-*-smoke.sh, tools/qemu-*-harness.sh, and the ringtap viewer assertion out of run-* recipes. They are acceptable inside test-* recipes or as standalone host scripts.
  • Update CI hooks, developer docs, and docs/tasks/README.md checkpoints that reference make run-<x> for verification to call make test-<x> instead. Audit the migrated review-finding task records, the REVIEW_FINDINGS.md tombstone history, and the recent changelog updates so historical entries stay accurate while new gates use the renamed targets.

Gate D: Default-run feature integration

  • Define an integration checklist that every milestone’s doc-update step must satisfy before close: either auto-start the new user-visible service from system.cue with safe defaults, or extend the MOTD with a clear, copy-pasteable instruction block describing how to reach the feature from the default boot.
  • Backfill the integration for already-shipped milestones whose user-visible services are still absent from make run: local-user setup, terminal-session, and the SSH gateway terminal host slice. The Telnet gateway remains a focused research fixture under make run-telnet and is deliberately absent from the default operator path. For each remaining milestone, either wire the service into system.cue (preserving the default-safe posture) or add a MOTD section with the exact command. Network-backed milestones must also record the QEMU device and forwarding posture. SSH gateway terminal-host integration remains deferred until its production/non-loopback gates pass or a separate host-local development forwarding rule is reviewed. A MOTD-only addition is not sufficient for a network-backed milestone.
  • Add the integration checklist to the “Stage Implementation Workflow” section of CLAUDE.md so future milestones cannot land without it.

Interaction With Paused SSH Shell Gateway Milestone

docs/tasks/README.md currently pauses the SSH Shell Gateway behind Service Object Identity Migration. When SSH work resumes, it will still have a visible goal of make run-ssh-shell and additional make run-ssh-* proofs. Without an explicit checkpoint, that milestone can land more non-compliant run-* recipes (scripted host harnesses, transcript asserts, network-only smokes) before this backlog is applied.

  • Before the SSH Shell Gateway milestone closes, add Gate A’s naming decision and Gate C’s test split as a milestone-level prerequisite: the user-visible target name (run-ssh-shell vs test-ssh-shell) and the location of any host harness must conform to the chosen rename split, and make run integration must be addressed under Gate D rather than left as a separate run-ssh-* recipe. Record the decision in the SSH milestone checkpoint or block its closeout.

Sequencing

Gate A is purely policy and naming and unblocks the others. Gate B (init mandate) and Gate C (test split) can proceed in parallel on separate branches per affected manifest area, because they touch different files: B rewrites system-*.cue and may add services to init/src/main.rs, while C touches Makefile and the tools/qemu-* harnesses. Gate D follows once the test split lands so MOTD updates land alongside system.cue changes without competing with make run’s recipe.

Out Of Scope

  • Renaming or relocating tools/qemu-*-smoke.sh and tools/qemu-*-harness.sh scripts. They stay where they are; only their callers change.
  • Producing a new test runner that aggregates all test-* targets. That is a separate CI ergonomics task.
  • Reworking the focused-proof transcript content. The intent is to preserve current proof coverage, not extend it.

Aurelian Frontier Backlog

Detailed decomposition for growing the current deterministic mission slice into the Aurelian Frontier game described in docs/proposals/aurelian-frontier-proposal.md.

This track is low priority and currently dormant (deprioritized in docs/roadmap.md under “Game/demo plans … are deprioritized”). It is a forward decomposition reservoir, not a landed-history log: completed-phase milestone chronology lives in docs/roadmap.md (the dated Aurelian Phase 9-12 entries) and in git history. This file keeps the forward-looking plan, the unstarted gates/themes, and one-line orientation for why current shapes exist. Promote it into docs/tasks/state.toml and root task records only when the selected visible outcome changes to a game-depth milestone.

Current Baseline

The deterministic Aurelian expedition slice is landed: a shell-spawned adventure-client with explicit StdIO/Adventure/Chat grants drives a session-keyed adventure-server that owns room, inventory, combat, writ, evidence, and effect state. Typed Adventure methods cover look, movement, inventory, inspect, use, status, combat, authority verbs, delegation, order, seal, leave, and the market/repair verbs. The expedition mission proves ward-writ, route evidence, ward-wraith combat, delegation, effects, eagle-standard recovery, witness-certified custody, evacuation, gate sealing, downed-state refusal, and leave cleanup. adventure-content owns the pure deterministic combat/zone/profile foundation and bounded construction-job state. Inventory/status splits into Items, Writs, Relics, Marks, Evidence.

Phase-level done/not-done state is encoded in the checkbox lists below; the dated landed milestones for each phase are in docs/roadmap.md and git.

Known limitations (still open):

  • Most state-transition/failure text still lives in Rust handlers. Authored item, spell, and use text has moved into generated content for the named slices; broader text migration is open.
  • NPCs that matter to world state are mostly server text, not separate actors holding scoped game authority. Aurelian chat-only boot NPCs share init’s system session under session-bound chat membership, so the smoke proof treats them as one session-keyed chat member (all greetings visible, Centurion Varro the single deterministic polling reply actor). Distinct concurrent NPC chat memberships need distinct spawned session contexts.
  • Combat profiles are generated and proven for the current mobs, but broad weapon parsing, durable alert groups, pending interruption state, generalized stealth openings beyond the imp-scout route slice, and broader authority-combat verbs remain open.
  • Rank, faction, debrief, market, party, and item-transfer logic are bounded proof slices, not durable profile/ledger subsystems. PvP consent and two-client multiplayer proofs are not present.
  • Construction jobs are bounded to one service-owned field-repair proof. They do not yet persist durable stock ledgers, replenish from outposts, update output/currency inventories, advance job time, persist crash-recovery state, or expose a general crafting API.

Implementation Posture

The kernel capability model remains the authority boundary. Game code should not be trusted because it is written in Rust or Lua; it should be trusted only to the extent that it holds narrow caps and correctly uses typed capOS interfaces. A useful game demo should eventually show both Rust and Lua code using the capability model properly.

Rust remains the right implementation language for bounded state, no-std userspace services, typed Cap’n Proto calls, deterministic QEMU proofs, and resource validation.

Do not let Rust become the long-term content authoring language. Larger room graphs, mission beats, item descriptions, dialogue hints, aliases, shop catalogs, and debrief text should move into a bounded data-driven mission format before the Aurelian content grows materially.

Keep this split:

  • The kernel owns authority enforcement through capabilities, while Rust services own simulation rules, combat resolution, object limits, schema encoding, and failure behavior.
  • Mission content owns room/site data, visible descriptions, actor dialogue, aliases, lead text, deterministic encounter placement, and debrief records.
  • Lua can later own deterministic scenario glue and NPC behavior when the capos-lua runner exists: mission beats, state-machine dialogue, debrief variants, quest-board text, and scripted reactions that still call typed capOS/game interfaces through granted caps.
  • Runtime loading may stay compile-time embedded at first, but the content must pass the same validator used by host tests and QEMU smoke setup.

Candidate content formats:

  • CUE plus mkmanifest cue-to-capnp: preferred for new schema-rooted data messages now that host-side CUE evaluation can feed a caller-specified Cap’n Proto struct through the pinned capnp convert path.
  • RON: compact Rust-native authoring, but adds another format and tooling convention.
  • TOML: familiar for simple data, weaker for graph validation and nested mission rules.

Prefer CUE if the implementation can reuse existing host-side validation and generate a bounded Rust data blob. Avoid runtime parsing in the game service until there is a concrete reason.

New Aurelian content migrations should use the cue-to-capnp flow when the data has, or needs, a stable schema boundary:

  1. Define a bounded Cap’n Proto root struct for the content slice rather than extending SystemManifest or encoding ad hoc JSON.

  2. Author the source as CUE in package mode, with the same id/text/list bounds documented here and with build-time variation supplied through --tag or CAPOS_CUE_TAGS only when the generated output is intentionally tagged.

  3. Convert with the pinned tools:

    make cue-ensure capnp-ensure
    CAPOS_CUE="$(make -s cue-path)" \
    CAPOS_CAPNP="$(make -s capnp-path)" \
    cargo run --manifest-path tools/mkmanifest/Cargo.toml --target "$(rustc -vV | awk '/^host:/ {print $2}')" -- \
        cue-to-capnp --package adventure_content --import-path schema \
        demos/adventure-content/content/prototype.cue schema/adventure-content.capnp \
        AdventureContent target/generated-adventure-content.bin
    
  4. Feed the converted data into the existing host validator/generator or a reviewed no-std decode path, then check in only deterministic generated artifacts required by the current build.

  5. Keep live capOS authority out of content files. Writs, grants, NPC roles, and future service references may be represented as ids or policy records, but actual capability transfer stays in runtime IPC and service logic.

The existing tools/adventure-content-gen JSON-to-Rust path may remain for already implemented slices. When a new content family needs a schema or a larger migration touches generator boundaries, prefer moving that family to cue-to-capnp instead of growing bespoke JSON parsing.

Near-Phase Gates

The first game-depth milestone must produce a player-visible improvement. A branch that only moves the existing hardcoded room data into a generated blob is technical prep, not completion of the near phase. The first complete near-phase slice must keep the current Aurelian expedition mechanically stable while also making the path discoverable through canonical ids, aliases, lead text, and specific failure messages.

Legacy endpoint badges are not part of the Aurelian authority model. New Aurelian phases must keep player, party, NPC, and chat participation keyed by session-bound invocation context or by future broker-granted service facets, not by manifest-assigned or user-selected receiver selectors. The focused run-adventure gate rejects system-adventure.cue if badge: fields are reintroduced.

Input and content bounds for the near phase:

  • command lines accepted through the current StdIO adapter: 256 bytes;
  • typed object ids, actor ids, mob ids, writ ids, directions, spell names, and skill names: 64 bytes, ASCII alphanumeric plus _ and -;
  • chat say text and future free-form command text: 256 bytes after trimming, with no semantic parsing beyond the declared text field;
  • generated content ids and aliases: same 64-byte id rule unless a reviewed schema/runtime change raises it;
  • room/site titles: 80 bytes; descriptions: 320 bytes; lead and failure hint lines: 160 bytes; actor dialogue and debrief lines: 320 bytes;
  • content lists must use the explicit per-player, per-site, and per-room caps in this file, not unbounded vectors.

If generated mission content is checked in, every branch that changes content or the generator must provide a freshness check equivalent to make generated-code-check; stale generated Rust blobs are a review finding.

Authority-RPG Direction

The next design target is a compact expedition RPG where rare authority is RPG power fantasy, not paperwork. The core loop is:

accept mission
choose writs / companions / relics
enter dangerous site
discover authority conflicts
fight / negotiate / delegate / revoke
extract with loot, survivors, evidence, or consequences
upgrade rank, base, companions, and future authority

Design rules for subsequent backlog slices:

  • Writs are loot: gear, skill tree, access key, social status, and sometimes curse. A good writ changes what the player can do, carries inspectable issuer/scope/expiry/delegation/revocation rules, and may have bounded affixes or drawbacks under the mission seed.
  • Classes are authority archetypes: Warden, Marshal, Archivist, Custodian, Factor, and Heretic/Renegade. Differences come from legal, social, and supernatural verbs, not generic damage numbers.
  • Delegation is buildcraft. Companion loyalty, ambition, competence, reputation, fear, and doctrine should affect how delegated authority behaves under pressure.
  • Combat attacks authority as well as HP. Forgers, null-priests, bandit captains, corrupt magistrates, spies, oathbreakers, and wraiths should threaten writs, custody, witnesses, route grants, and legal control.
  • Denial should reward with leads: a missing witness, hidden jurisdiction, forged seal, rival claim, corrupt actor, unsafe state, rank gate, or alternate route.
  • Progression unlocks reach: new jurisdictions, deputy appointment, remote revocation, relic custody capacity, hostile negotiation, disputed shrine access, and operating without a local witness in constrained cases.
  • Base modules unlock verbs. Archive, Temple vault, Barracks, Court, Market hall, Signal tower, and Sanctuary should affect future expeditions through explicit actions, not passive percentage bonuses.
  • Controlled randomness covers mission complications, route hazards, faction demands, companion behavior, relic side effects, enemy authority tricks, optional objectives, and loot/writ modifiers. The legal model remains deterministic and auditable under a seed.
  • Multiplayer stays scoped to cooperative expedition pressure first. Defer MMO scale, open economies, broad construction seasons, LLM-critical NPCs, federation, and worldlines until the compact expedition loop is excellent.

The pure combat-targeting foundation, generated combat profiles, server integration, and the bounded authority-challenge / writ-affix / delegation / Archive-reach proofs have landed (see the Phase 8/9 checkboxes). Remaining forward sequence:

  1. Extend the first authority-attacking enemy behavior beyond the bounded forged route/custody claim into broader authority-bearing enemy variants.
  2. Generalize writ-affix and delegation-buildcraft proofs beyond the single bounded ward-writ/Livia cases into more writs and companions.
  3. Extend base/rank reach unlocks beyond the bounded Archive evidence unlock without starting a general construction/base-management system.
  4. Extend construction jobs only after a visible gameplay need appears: durable stock ledgers, job-time advancement, artifact custody outputs, and facility slot capacity remain future work.
  5. Keep proofs deterministic: pure Rust tests for new rules and one adventure-scenario-test path per new cross-service behavior; keep the shell transcript to representative parser coverage.

Phase 1: Player-Visible Mission Substrate

Visible outcome: a first-time player can complete the current Aurelian expedition without reading source or memorizing hidden ids, and the read-only mission content comes from a validated generated blob instead of hardcoded room tables and scattered text. The mission path and existing QEMU transcript outcomes stay stable, but look, status, inspection, and failures become clearer.

  • Define a bounded AdventureContent model for sites, exits, visible items, actors, mobs, aliases, objectives, leads, and scripted proof-path metadata.
  • Add host validation for content graph integrity: unique ids, valid exits, valid aliases, referenced actor/item/mob ids, bounded text length, and deterministic ordering.
  • Generate or embed a compact static Rust representation for userspace; keep runtime parsing out of the no_std service unless explicitly justified.
  • Add a generated-content freshness check and wire it into the relevant branch verification so checked-in content blobs cannot drift from source mission data.
  • Move current square, tavern, garden, cellar, map, coin, key, scout-marker, and ward-wraith descriptors into content data.
  • Keep all state-changing behavior in Rust handlers; content may select text and ids but must not bypass authority checks.
  • Extend AdventureRoomView or status text so look presents objective, visible interactables, actors, active mobs, exits, and one lead line.
  • Add canonical-id display for objects, actors, mobs, writs, and exits.
  • Add alias resolution for common casing and titles, with responses that name the resolved canonical id.
  • Add near-miss suggestions for known ids, starting with common failures such as ward -> ward-writ, wraith -> ward-wraith, and livia casing.
  • Improve invalid order results so they name plausible next actions when player knowledge allows it.
  • Split status text into survival state, mission state, held/delegated authority, evidence/effects, and lead.
  • Add host tests for rejecting malformed content graphs.
  • Keep make run-adventure transcript stable after the migration and add assertions for at least one canonical-id suggestion and one improved actor-task hint.

Implementation notes:

  • Start with read-only content fields. Do not introduce a general scripting engine for mission logic in this phase.
  • Keep object ids ASCII, stable, and bounded by the near-phase limits above unless a reviewed schema/runtime change raises those limits.
  • Lua scripting belongs after the data model exists. Do not use Lua to bypass the content validator or make transcript-critical behavior depend on an unbounded script.

Phase 1b: Deterministic Scenario Scripting

Visible outcome: once capos-lua can run scripts with exact grants, selected scenario and NPC behaviors can move from Rust match branches into deterministic Lua scripts without changing the authority boundary.

  • Use docs/proposals/lua-scripting-proposal.md as the scripting design source.
  • Expose only narrow game host APIs to scripts, such as read current mission state, choose a dialogue branch, emit a debrief line, or request a typed game action through a granted object cap.
  • Keep mission authority, inventory mutation, relic custody, combat damage, and cap transfer in kernel-enforced capability calls and Rust service handlers.
  • Add deterministic script fixture tests for NPC state machines and scenario beats.
  • Add QEMU transcript coverage showing one Lua-scripted NPC or scenario reaction using a granted cap and one denied ungranted path.
  • Keep Rust and Lua examples side by side so the demo proves capability discipline is language-independent.

Cut scope:

  • No dynamic native Lua modules, no broad ProcessSpawner, no raw CapIds in scripts, and no script-owned authority beyond the runner’s CapSet.

Phase 1c: Non-Deterministic NPC Brains

Visible outcome: non-transcript-critical NPC flavor can later use the language-model/agent proposals without weakening deterministic proofs.

  • Use docs/proposals/llm-and-agent-proposal.md for any LLM-backed NPC implementation.
  • Keep LLM NPCs behind narrow caps and treat model outputs as suggestions or dialogue data, not authority.
  • Restrict LLM use to ambient tavern chatter, optional hints, flavor summaries, or player-facing explanation when exact transcript output is not part of the proof.
  • Keep main mission success paths, combat outcomes, custody decisions, policy denials, and QEMU smoke assertions deterministic.

Deferred from Phase 1:

  • Dynamic completions belong with the future CommandSession interface and should not duplicate full parser logic in the StdIO adapter.

Phase 2: Aurelian Expedition Map

Visible outcome: the playable mission uses the proposed frontier expedition locations rather than the four-room prototype.

  • Replace prototype content with a small Site graph: fort_aurelian, gate_yard, ashen_road, signal_tower, and under_vault.
  • Model site metadata: region, threat level, exits, visible items, actors, active wards, and optional required route authority.
  • Implement the first mission objective: recover eagle-standard from the ruined signal tower.
  • Add complications: unstable tower gate, wounded legionary behind a ward, guild scout route information, and temple witness custody requirements.
  • Provide at least two acceptable good outcomes, such as recovered standard plus sealed gate, or recovered standard plus survivor evacuation.
  • Update make run-adventure to drive the new mission path with stable assertions.

Cut scope:

  • Do not add random mission variants in this phase.
  • Do not split mission state into a new service until the single-server model blocks explicit authority or proof coverage.

Phase 3: Authority Inventory And Relic Custody

Visible outcome: player-facing inventory makes authority, evidence, and relic custody visible without implying every entry is a pick-up item.

  • Split inventory/status output into Items, Writs, Relics, Marks, and Evidence.
  • Keep take and drop for physical items only.
  • Keep request, accept, delegate, and revoke for authorities.
  • Add relic custody state for eagle-standard, including a failure path when the player lacks temple or rank authority.
  • Add temple-seal or equivalent witness-certified custody proof.
  • Ensure relic failures distinguish missing location, missing authority, unsafe state, and witness refusal.
  • Add QEMU assertions for relic custody denial, successful custody, and audit/evidence status output. Complex custody coverage runs in the capOS adventure-scenario-test userspace process through real Adventure cap calls; the shell-driven adventure-client transcript remains representative interactive client coverage.

Phase 4: Persistent Profile And Ledger Substrate

Visible outcome: player profile data and mission evidence have bounded save/load semantics, while ordinary client launches remain fresh unless the player explicitly resumes an expedition.

  • Define bounded Cap’n Proto records for AdventureProfile, AdventureExpeditionCheckpoint, and AdventureLedgerRecord, including schema version, content hash or release id, profile id, record/checkpoint version, size limits, and migration policy.
  • Add host tests for save-record encode/decode, first schema-version acceptance, unknown-content rejection, over-limit rejection, stale-version rejection, and wrong-profile rejection.
  • Add the AdventureProfileService summary substrate for bounded create/load/save, local non-reward settings and progression updates, and validation of rank marks, warrior stars, wizard circles, faction standing, cosmetics, contributor badges, title choices, and settings.
  • Connect AdventureProfileService reward and title mutations to ledger-backed authorization once AdventureLedger exists, so rank marks, faction standing, cosmetics, contributor badges, and title choices are applied from auditable mission facts rather than direct summary edits.
  • Add AdventureLedger as append-only mission evidence: debrief records, relic custody, forbidden-rite use, witness certifications, reward mints, market/trade receipts, and revocations.
  • Add AdventureExpeditionService for active expedition checkpoints: current site, objective state, player state, party state, mob state, pending events, and turn ordering.
  • Add AdventureSaveStore as the only persistence adapter used by the profile, ledger, and expedition services. It may target RAM, local disk-backed Store/Namespace, or a future CloudGameStore, but gameplay services should not call provider-specific APIs directly.
  • Prove the local baseline first: save and reload a profile, append and replay one ledger record, and explicitly checkpoint/resume one expedition through RAM-backed or disk-backed store semantics.
  • Keep run adventure-client fresh by default. Add an explicit resume command or profile option before loading active expedition state.
  • Add proof coverage for one rejected stale checkpoint write and one rejected wrong-profile load.

Cut scope:

  • Do not make the kernel persist process memory or the live capability graph.
  • Do not merge divergent combat checkpoints automatically; reject stale writes and require the player or service to pick a checkpoint.
  • Do not require GCP to pass the local QEMU proof path.

Phase 5: Cloud Persistence Bridge

Visible outcome: the same profile, ledger, and expedition records can be stored through an optional cloud-backed capability without changing game service logic.

  • Define CloudGameStore as a narrow bridge with save/load/append operations matching the local AdventureSaveStore semantics.
  • Keep the GCP bridge outside the game authority boundary: the bridge stores records, but AdventureProfileService, AdventureLedger, and AdventureExpeditionService decide which mutations are valid.
  • Use Firestore Native mode only for mutable profile/index documents and transactional compare-and-set style updates.
  • Use Cloud Storage for versioned snapshots and larger evidence blobs, with object versioning and lifecycle policy so old snapshots do not accumulate without bounds.
  • Use Cloud Run or an equivalent narrow service endpoint for the bridge and Secret Manager for bridge-side service credentials. Do not expose those credentials inside ordinary game clients.
  • Add local fake-cloud tests that enforce the same stale-write, wrong-profile, append-only-ledger, and size-bound behavior before using real GCP services.
  • Add an operational note for project, region, IAM service account, retention, backup/export, and cost controls before any real deployment.

Operational note:

  • The first real deployment must use a dedicated Google Cloud project per game-world environment, or an equivalently isolated folder/project split for development, staging, and production. Record the project id, numeric project number, billing account owner, support contact, and break-glass owner in the deployment runbook before enabling writes.
  • Choose one primary region for the Cloud Run bridge, Firestore database, Cloud Storage buckets, Secret Manager secrets, and Cloud KMS keys unless a reviewed multi-region design exists. The runbook must name the region and the data-residency reason; cross-region replication is a separate design decision because it affects latency, cost, and recovery semantics.
  • Cloud Run is the only provider-facing bridge endpoint in this phase. Ordinary capOS game clients see only the CloudGameStore capability and never receive Firestore document names, bucket names, OAuth tokens, service account keys, Secret Manager secret names, or broad network/provider authority.
  • The bridge must not be public. Launch requires authenticated invocation, no allUsers or disabled-invoker-IAM setting, an explicit Cloud Run ingress mode, and a named invoker identity for the capOS bridge path. Public HTTPS exposure or unauthenticated browser calls would bypass the CloudGameStore capability boundary even if provider credentials remain hidden.
  • The bridge runs as a dedicated service account. Isolate Firestore by database or project boundary, then enforce adventure collection/document path allowlists in bridge code before issuing provider calls; do not rely on Firestore security rules or collection-scoped IAM for server-side access. Grant only the database-level Firestore role needed by the isolated database, Cloud Storage object access for the configured adventure buckets, Secret Manager secret access for named bridge secrets, and KMS encrypt/decrypt authority for the configured game-world key. Do not grant project owner/editor, wildcard bucket admin, or user-browser OAuth authority to the bridge.
  • Firestore Native mode holds mutable profile/index documents and version/CAS records only. Every mutable write must read the current document version and commit inside a transaction or equivalent preconditioned update; stale writes fail closed and preserve the current document.
  • Cloud Storage holds immutable or versioned records: expedition snapshots, larger evidence blobs, exports, and content-addressed objects. Buckets must enable object versioning before production writes and must have a lifecycle policy bounding noncurrent versions and abandoned exports. Versioning is recovery, not immutability: create-only evidence and content-addressed writes must use generation-match preconditions, and audit evidence that must resist replacement or deletion needs an explicit retention policy or hold gate before launch.
  • Retention policy belongs in the runbook before launch: profile/index documents keep only the current mutable summary plus required audit references; ledger/evidence objects retain enough noncurrent versions for recovery and audit; debug exports and test objects have a short TTL. Legal hold, public world audit, or contributor-reward evidence retention needs separate approval before becoming indefinite.
  • Backup/export is explicit. Schedule Firestore exports and Cloud Storage inventory or backup jobs to a separate restricted bucket, record restore drills, and verify restore through CloudGameStore validation rather than accepting provider bytes as authoritative.
  • Cost controls are launch gates: configure budgets and alerts for Cloud Run requests/egress, Firestore reads/writes/storage, Cloud Storage live and noncurrent object bytes, KMS operations, and Secret Manager access. Add lifecycle rules before enabling object versioning so stale snapshots do not grow without bounds.
  • Provider credentials stay bridge-side. Prefer service account identity and Secret Manager references over static keys. If a static credential is unavoidable for a development bridge, record its rotation owner, expiry, allowed environment, and revocation procedure; never put it in manifests, game save records, browser JavaScript, or QEMU transcripts.

Cut scope:

  • No direct Firestore/Cloud Storage calls from adventure-client or adventure-server.

Sequencing note:

  • Cross-device multiplayer through GCP is on the roadmap, but it must wait until local multiplayer authority, session-bound invocation context, and stale-write rejection are already correct behind AdventureSaveStore and CloudGameStore. The cut is sequencing, not a permanent scope exclusion.

Phase 6: User-Owned Browser Save Vault

Visible outcome: private player data can be exported and imported as signed, encrypted save capsules through a browser using user-granted Google Drive or Firebase authority, without making those blobs authoritative for shared world state.

  • Define UserSaveCapsule with schema version, capsule version, profile id, device id, content hash, migration policy, record kind/version, previous capsule hash, plaintext hash, ciphertext, AEAD algorithm, signature algorithm, signer public key id, signature, and timestamp.
  • Define the save-vault key-boundary policy model for local capOS-host key material, GCP game-world Cloud KMS authority, and browser transport authority.
  • Use storage-domain encryption keys: local capOS-host key material for local storage and GCP Cloud KMS envelope encryption for GCP-backed data, with a per-world or per-shard KMS KEK wrapping service-owned DEKs. The browser transports ciphertext and provider handles; it must not receive DEKs, SymmetricKey caps, KeySource caps, KMS decrypt/unwrap grants, or provider-independent plaintext authority.
  • Prefer Google Drive appDataFolder with the narrow drive.appdata scope for personal backup files that the user should not edit directly.
  • Allow Firebase/Firestore user documents only as a transport/cache for encrypted capsules. Firestore/Firebase rules can bind access to the authenticated user through an explicit {request.auth.uid} path template, but cannot validate encrypted game semantics.
  • Add KMS/IAM design notes for the GCP path: one key ring/key per game-world instance or shard, narrow decrypt authority for the game-world service, key rotation policy, and revocation behavior for retired worlds.
  • Add restore validation in AdventureSaveStore: signature, content hash, schema version, profile id, previous hash, monotonic version, size bounds, and wrong-profile rejection.
  • Add rollback policy: importing an older private checkpoint may restore an explicit local expedition snapshot, but it must not erase append-only ledger facts, contributor rewards, market receipts, or public multiplayer outcomes.
  • Add host tests for tampered ciphertext, wrong signing key, wrong profile, stale version, unknown content hash, oversized capsule, and replayed old capsule.
  • Add a web-terminal or browser-companion fixture path with fake Drive and fake Firebase adapters before using real Google APIs.

Cut scope:

  • No authoritative public world state from user-owned blobs.
  • No direct provider SDKs inside adventure-server.
  • No mandatory Google account for local QEMU adventure proof.
  • No silent cloud sync; export/import or sync must be visible user action or profile setting.
  • No browser-held game-world key capabilities, KMS decrypt/unwrap grants, or provider-independent plaintext authority.

Phase 7: Actors As Capability-Bounded Processes

Visible outcome: important NPCs have process identity and only the capabilities their role needs.

  • Keep adventure-server as the authority owner until direct NPC mutation needs are explicit.
  • Add actor content and chat behavior for Centurion Varro, Magister Livia, Acolyte Iunia, Maro the Guild Scout, Wounded Legionary, and Gate Echo.
  • Give chat-only NPC processes only console and the narrowest available chat authority. The focused manifest uses selector-free chat grants; user-selectable or manifest-assigned receiver selectors must not be part of this proof.
  • For any NPC that can affect world state, add a separate scoped broker-granted AdventureNpc facet or equivalent session-bound service authority. Do not use receiver-selector compatibility grants as NPC mutation authority.
  • Route NPC offers and refusals through player-visible commands and chat events rather than hidden server side effects.
  • Add focused smoke assertions proving each resident chat-only NPC process launches and contributes visible room chat history under session-bound chat membership.
  • Add distinct service sessions, chat participant ids, or a scoped AdventureNpc facet before requiring every boot-launched NPC process to act as an independently polling chat participant.

Current shape: system-adventure.cue launches the six named actor processes with only console plus selector-free chat grants. Because boot-launched actors inherit init’s system session, chat membership intentionally collapses to the service-scoped caller-session key, and make run-adventure proves each named actor published visible room history with Centurion Varro as the single deterministic polling reply. Independent per-NPC chat participants and direct world-mutation NPC authority remain the open [ ] items above.

Phase 8: Tactical Combat And Mob State

Visible outcome: combat remains deterministic and bounded, but offers more than repeating attack.

  • Add a bounded mob model with hp, armor, ward, attack, morale, traits, intent, and threat level.
  • Keep ward-wraith; add at least two of imp-scout, ash-ghoul, gate-hound, and echo-centurion.
  • Implement command-level turns: player action, eligible ally action, hostile action, deterministic transcript.
  • Add visible intent when scout or wizard support makes it available.
  • Add retreat and at least one blocked-retreat failure.
  • Extend guard to protect an ally when one is present.
  • Add QEMU assertions for one intent line, one ally-related combat action, and one deterministic hostile response.

Cut scope:

  • No random combat outcomes until seeded mission variants land.
  • No hidden dice rolls that make QEMU transcript assertions fragile.

Follow-up combat architecture, grounded by Game Mechanics Prior Art. Most of this is landed (deterministic target-zone damage, fatigue, interrupt, recognition disclosure, stealth openings, alert-source generalization, construction-fed weapon/focus/cloak combat, sustained-magic fatigue refusal, and scenario coverage); the open forward items are:

  • Use Evil Islands as planning input for tactical fight shape (targeted body zones, damage-type/armor matchups, stealth openings, visibility-dependent recognition, fatigue/retreat pressure, cast interruption, equipment-derived effects). Not a clone target; Aurelian keeps command-level turns, capability-gated authority, deterministic smoke coverage, and service-owned outcomes.
  • Move mob combat definitions out of hard-coded adventure-server templates into validated generated content once the next combat slice needs more than the current generated profile fields (damage affinities, zone armor, alert groups, recognition thresholds, stealth-opening permissions, cast-interrupt vulnerability).
  • Extend adventure-content pure logic before server integration: CombatZone, DamageKind, CombatAttackProfile, MobCombatProfile, deterministic target-zone damage, fatigue cost, interrupt outcome, recognition level, and alert propagation helpers.
  • Extend CUE content and tools/adventure-content-gen beyond the current generated mob combat profiles when alert groups, stealth openings, or richer profile references land.
  • Add typed Adventure surface only where the existing text target cannot stay unambiguous (e.g. structured target/zone/weapon fields for the browser client); current explicit-zone parsing already covers the proof commands.
  • Update AdventureRoomView/status output for inspected vs rough mob intel.
  • Keep adventure-server as the authoritative combat state owner. Durable alert state, broader limb persistence, and pending multi-turn interruption remain future work.
  • Add targeted attacks with a small fixed zone set: head, hands, legs, core, with deterministic zone effects.
  • Add damage-type and mitigation metadata for weapons, spells, armor, ward state, and zone armor, with explicit result text.
  • Make enemy recognition depend on scout/wizard support, distance, direct inspection, and prior codex evidence.
  • Add height and route-position inputs to enemy recognition once room topology and browser-client world positioning expose those facts as structured state rather than server-local command context.
  • Add stealth-opening support for ambush/backstab-style advantages.
  • Add a bounded pull/alert behavior for the ward-wraith to gate-hound path.
  • Add a bounded imp-scout warning path.
  • Generalize alert-source resolution across ward-wraith alarm, imp-scout warning, and escaping-scout paths.
  • Add bounded failed-stealth gameplay integration for route-supported imp-scout attacks lacking scout-track evidence.
  • Add bounded noisy-movement gameplay for recovered relic movement.
  • Add broader noisy-movement integration beyond the current relic movement, ward-wraith, and imp-scout paths.
  • Tie combat output to equipment construction inputs from Phase 11c: weapon/shield/focus/cloak object type, material, facility quality, warrior stars, wizard circles, and remaining enchantment budget affect bounded damage, guard, fatigue, interruption, and resistances. Bounded slices for shield-wall cloak, bronze-gladius weapon, and ember-dart focus have landed; broader equipment handling, construction jobs, and durable runtime inventory semantics remain open.
  • Add a bounded sustained-magic fatigue refusal for the shield-bind path.
  • Generalize explicit fatigue and cast-interruption rules for heavy equipment, running, retreat, additional sustained magic, and monster fatigue, creating meaningful retreat/guard choices rather than hidden penalties, and without unfair infinite-fatigue monster behavior.
  • Add QEMU scenario coverage through adventure-scenario-test for inspected targeted attack, damage/armor explanation, stealth/scout opening, alert/pull response, cast interruption/fatigue refusal, and retreat/blocked-retreat.
  • Keep rewards mission-audited. Do not add enemy grinding as a rank, warrior-star, wizard-circle, or faction-standing source.

Phase 9: Skills, Spells, Ranks, And Reputation

Visible outcome: player competence affects available actions and future grants without becoming a grind.

  • Model player rank labels: tiro, signifer, centurion, and legate.
  • Keep warrior stars and wizard circles visible in status, but make them policy inputs for brokered authorities.
  • Add missing skills from the proposal as needed by the first mission: shield-wall, counter, rally, or narrowed equivalents.
  • Add missing spells as needed by the first mission: mend-wound and stabilize-gate before higher-circle spells.
  • Add explicit failure text when rank, stars, or circles block an action.
  • Add debrief outcomes that update rank marks, faction standing, and evidence records from auditable mission facts.
  • Add QEMU assertions for one rank/circle denial and one debrief reward.

Deferred:

  • dome-shield, demon-brand, and high-circle gate rewriting are later campaign scope unless a focused proof needs them. rally remains explicitly reserved for later centurion command authority.

Phase 10: Market And Logistics

Visible outcome: the shopkeeper becomes a small capability-shaped economy proof instead of flavor chat.

  • Add typed verbs for quote, buy, sell, trade, and repair before accepting them as implemented gameplay.
  • Define bounded market roles: quartermaster, guild scout, temple annex, and field engineer.
  • Implement one deterministic route purchase or favor exchange with Maro.
  • Implement one authority-gated refusal, such as focus equipment requiring wizard circle 1 or temple certification requiring clean custody.
  • Define trade/custody transfer as a service-mediated transaction protocol, not two save-file edits: reserve or escrow both sides, commit or release with idempotency keys, reject stale versions, record one ordered ledger receipt, and specify cancellation, retry, and crash-recovery behavior.
  • Ensure prices and blocked authority are named in failure text.
  • Add QEMU assertions for one quote, one successful exchange, and one rejected trade explaining the gate.

Planning input, grounded by Game Mechanics Prior Art: use external game-mechanics research as planning input only, not a clone target. Stardew Valley is useful for calendar pressure, seasonal resource tables, festivals, routine changes, quests, gifts, affection, and season-bound crops. EVE Online is useful for regional markets, market-eligible item classes, brokered buy/sell orders, immediate matching, and blueprint/material/facility manufacturing constraints. Evil Islands is useful for equipment construction and the targeted combat model. The capOS translation turns these stable mechanics into the capability-shaped tasks in Phases 11-12: seasonal cycles, regional settlements/outposts, service-owned order books, blueprint/artifact construction, targeted deterministic combat, token-budgeted agent NPCs, and a rich tilemap client.

Phase 11: Seeded Variation

Visible outcome: repeated runs vary content meaningfully across normal play, while the smoke transcript stays reproducible under a fixed seed.

Current shape: live adventure player state is keyed by endpoint caller-session scoped refs, and generated mission content carries fixed smoke seed/variant metadata printed in status and asserted by the scenario cap-call path. This is the deterministic seed-metadata foundation only; seeded gameplay variation, production per-run seeds, festivals, NPC routines, and full seasonal economy behavior remain open.

  • Add generated mission content fields for a fixed smoke seed label and selected variant metadata.
  • Add manifest or mission setup field for a fixed mission seed and a separate per-run seed for production play.
  • Print seed and selected variant metadata in transcript/debug mode.
  • Seed mob placement, optional hazards, shop inventory, rumor lines, loot cache locations, debrief complications, and ambient encounter timing.
  • Seed seasonal state for normal play: season, day, weather/hazard class, seasonal resources, festival/event hooks, and NPC routine variants. The deterministic smoke seed forces a stable generated calendar state, but normal-play seed selection remains open.
  • Keep season-sensitive resource tables bounded: crops, forage, fish, shop stock, route hazards, and outpost production all have explicit per-site caps and stable sorted output under a fixed seed.
  • Keep combat outcomes reproducible under a fixed seed; production play may add bounded variance per turn as long as the smoke seed reproduces the recorded transcript.
  • Add scenario assertion for seed and variant metadata through real Adventure cap calls.

Phase 11a: Calendar, Seasons, And Resource Cycles

Visible outcome: the frontier feels alive across repeated sessions without making proof transcripts nondeterministic.

  • Add an AdventureCalendar model with four 28-day seasons as the initial default, explicit day advancement rules, and debug output for the fixed smoke seed.
  • Attach bounded fixed-smoke seasonal availability primitives to generated content for crops, forage, fish, shop inventory, route hazards, and repair/material production. Multi-season resources must be declared explicitly.
  • Apply seasonal availability to gameplay systems. Bounded slices have landed: quartermaster field-rations quotes read the fixed-smoke seasonal shop-stock table; Adventure.status forecasts carried seasonal crops expiring and fish/forage degrading at the next season change; the season-transition ask path applies the next-season transition to actual player inventory (crops expire, fish/forage become -degraded tokens); and the field-rations buy path spends audited Aurelian standing and records per-expedition seasonal stock usage. Broader season advancement, economy, persistence, market orders, seeded normal-play calendars, and automatic world mutation remain open.
  • Add festival and military-event records that can temporarily expose actor-location, shop, witness, route, and rumor metadata. This is metadata/status only; actual gameplay mutation remains open.
  • Give named actors bounded routine variants by season, festival, mission beat, and local emergency, visible as structured actor presence/state. This is metadata/status selection only; it does not move actors or mutate authority.
  • Add simple quest/gift/affection hooks only after profile and ledger facts can record them. Daily interactions and gifts should affect actor standing through auditable records, not client-owned counters.
  • Add pure Rust unit tests for calendar rollover, season/day bounds, seasonal resource eligibility, multi-season exceptions, and stable fixed-seed ordering.
  • Add pure Rust unit tests for festival scheduling once festival records exist.

Phase 11b: Regional Settlements, Outposts, And Trade Routes

Visible outcome: Aurelian is one settlement in a wider frontier economy with multiple cities, outposts, production sites, and routes.

  • Model more than one settlement: fort_aurelian remains the proof settlement, while later content can add at least one civilian city, one temple-administered site, one guild waystation, and multiple resource outposts.
  • Define outpost roles such as mine, farm, timber camp, shrine, gate-yard, salvage yard, and repair yard. Each role produces bounded resources, consumes supplies, exposes route risks, and may require specific writs.
  • Add region and route metadata: distance, hazard, faction control, route authority, cargo limits, seasonal closure, and known-safe/unknown states.
  • Extend markets from actor-local deterministic handlers toward a service-owned regional market with market-eligible item classes, brokered buy orders, sell orders, price/time priority, immediate matching when price crosses, expiry, fees, and ordered ledger receipts.
  • Add the bounded generated-content order-book foundation for regional markets: market book id/location/settlement, buy/sell side, item id, price, quantity, expiry day/duration, fee, owner actor/faction/outpost, receipt ledger id, pure validation, and deterministic non-mutating price-cross matching.
  • Add the first bounded service-mediated transaction proof on top of the generated regional order books: reserve one crossed match, commit or release it with idempotency keys, reject stale versions, record ordered receipt facts, and keep the server as the owner of live transaction state.
  • Route real player, NPC, and outpost inventory/currency transfers through the Phase 10 service-mediated transaction protocol. The current proof is one regional market match: on fresh commit it debits player-local Aurelian chits once, decrements seller ash_farm field-ration stock once, accrues service-owned regional market fees once, credits service-owned ash_farm seller proceeds once, and delivers the committed quantity into the player inventory only when ordinary capacity can accept it. It does not yet move NPC stores, broader outpost inventories, durable currency/proceeds ledgers, profile ledger balances, or durable save records.
  • Add broader scenario coverage for crash-recovery state, receipt replay after restart, multi-client settlement, and player/NPC/outpost transfer effects. The current scenario path covers quote, reserve, idempotent retry, commit replay, stale-version rejection, no-cross partial release, explicit cancellation/release, fee withdrawal, bounded receipt-snapshot restore, and a bounded settlement side-effect snapshot-view replay. These are bounded recovery proofs, not durable persistence or a restart harness.

Phase 11c: Blueprint And Artifact Construction

Visible outcome: equipment and artifacts become authored constructions with traceable materials, skills, facilities, and enchantment limits.

  • Add blueprint records for craftable equipment, repair jobs, gate parts, relic containers, focus items, and lawful wards. Blueprints name required materials, facility class, skill/rank/circle gates, expected duration, cost, and output bounds.
  • Keep the first construction job proof service-mediated. The field-engineer gate repair job reserves materials at a generated facility, validates blueprint/facility/rank constraints, records ordered job facts, and either completes or releases the reservation. Currency escrow, job-time advancement, output inventory, and general crafting remain future work.
  • Add deterministic property-derivation primitives as a bounded result of base blueprint, material, facility quality, and paid cost. Full crafting job integration remains open.
  • Add artifact construction metadata for rare pieces whose authority matters: witness-sealed relic cases, warded cloaks, focus rings, route compasses, golem cores, and gate-stabilizer parts.
  • Add enchantment slot metadata and validation bounds. The constrained post-process gameplay remains open until construction jobs exist.
  • Add pure Rust unit tests for blueprint validation, material/property derivation, enchantment slot limits, facility/rank/circle gates, and missing or retired authority references.
  • Add service-side material reservation and stale construction job rejection for the bounded field-repair proof. The server owns per-session construction material stock, mutates holds/restores only for fresh outcomes, and keeps stale/version and idempotent replay behavior in the pure job-state model. Durable stock ledgers and broad crafting remain future work.

Phase 11d: Token-Budgeted Agent NPCs

Visible outcome: optional agent-controlled NPCs can feel reactive while staying bounded, auditable, and outside transcript-critical authority.

  • Use docs/proposals/llm-and-agent-proposal.md, docs/proposals/hosted-agent-swarm-proposal.md, docs/proposals/capos-repo-harness-engineering-proposal.md, and docs/research/hosted-agent-harnesses.md as grounding before any implementation.
  • Treat model output as dialogue or proposed action data. Mission-critical authority, custody, combat, market commits, rank rewards, and policy denials stay in deterministic services.
  • Add an NpcAgentBudget or equivalent service-owned quota: per actor, session, day, and model profile; input/output token limits; tool-call limits; cooldown; and exhaustion behavior.
  • Let NPCs spend quota on bounded chatter, optional hints, and outpost status summaries. Spending must be visible in logs/debug output for review.
  • Extend token-budgeted NPCs to personal routines, shop negotiation flavor, and festival reactions as fake-agent dialogue/proposed-action data only.
  • On quota exhaustion, fatigue, sleep schedule, or policy denial, the NPC should refuse in-world, for example: I'm tired. Going to sleep. The refusal must not be a hidden transport error.
  • Keep hosted-agent memory separate from authority. Long-lived NPC memory can record bounded facts and reflections, but only reviewed/compiled facts influence deterministic game services.
  • Add tests with a deterministic fake model that proves quota decrement, quota exhaustion refusal, no authority mutation from free text, and stable transcript output when agent NPCs are disabled.

Current shape: agent NPC budget metadata is disabled-by-default for Iunia, Livia, and Maro; a deterministic fake-model turn function drives bounded chatter/hints/outpost-summaries plus routine/shop/festival flavor, decrements quota, and refuses in-world for quota/fatigue/sleep/cooldown/policy blocks. Live LLM calls, hosted-agent service execution, durable memory service, autonomous NPC actions, and any transcript-critical model gameplay remain open.

Phase 12: Multiplayer, Parties, And Lawful PvP

Visible outcome: shared multiplayer authority works correctly across local multi-client play first, with cross-device play following on the same service-mediated boundaries. Players can party up, delegate scoped authority, assist each other, and engage in lawful PvP without leaking private inventory or allowing ambient harm.

  • Do not start this phase until Adventure and chat authority use session-bound caller identity, or future broker-granted service facets, rather than receiver-selector identity. The first bounded slices key local player labels from live caller-session metadata.
  • Add Expedition or equivalent shared state only when the single Adventure service cannot cleanly model party authority. The first bounded slice keeps deterministic party state inside Adventure because no cross-service coordinator is needed for local create/invite/accept/ leave/delegate/assist records.
  • Add party verbs: create, invite, accept, leave, and delegate. Implemented for service-created local labels (e.g. player-1) derived from live caller-session keys, not caller-selected badges or global session data.
  • Add assist <player> with <task> for deterministic cooperative action. Implemented for the first detect-ward assist record, requiring party membership plus delegated ward-writ; it records scoped service-owned state and grants no unrelated inventory authority.
  • Route first player-to-player physical-item transfer through a single-owner atomic mutation path inside Adventure. Implemented as transfer <item> to <player> for service-local player labels: both players must be in the same party, eagle-standard relic custody is refused, and source/target inventories mutate atomically.
  • Add currency escrow and broader two-party trade/custody transfer protocol only after the economy model and multi-client proof harness justify it; user-owned backup capsules must not be transfer authority.
  • Add a two-client QEMU proof with two service-created player objects, one shared party, one delegated writ, and one assist. The proof must use two distinct live caller-session keys for Adventure cap calls, not manifest receiver selectors or user-chosen identity. Still open: the focused Adventure manifest does not yet provide a reliable two-client launcher/session harness, so real cap-client assertions stay at one-client party surface coverage and complex transitions are covered in pure Rust.
  • Keep PvP opt-in: duel, spar, contest, or bounty authority must exist before harmful verbs can target another player.
  • Add denial text for unauthorized player harm that names the missing lawful conflict authority. attack <player-label> refuses known local player labels with text naming the missing duel/contested-yard authority. Duel/spar/contest/bounty authority remains future work.

Sequencing note:

  • The first slice is local multi-client (two clients on one capOS instance) because that is the cheapest deterministic proof. Cross-device multiplayer is on-roadmap and lands once the local authority model is correct and CloudGameStore carries shared expedition/ledger state.
  • Network-transparent multiplayer (full federation across capOS instances) stays separate from this phase and follows the broader networking work.

Future Phase: Parallel Universes And Worldline Federation

Visible outcome: separate capOS-hosted Aurelian worlds can expose alternate seeded worldlines and limited cross-world interaction without making remote instances trusted authorities for local inventory, relic custody, profile standing, or market settlement.

This is deliberately after local multiplayer, durable ledger/profile state, service-owned market/escrow, and basic networking. The near-term shape is not a single shared MMO world. It is a federation of sovereign worldlines, each with its own content release, worldline id, seed epoch, generated overlays, ledger head, market policy, and profile-import rules.

  • Add a WorldlineSeed model whose outputs are deterministic artifacts: generated regional overlays, seasonal economy tables, event schedules, market starts, outpost production, route hazards, optional encounters, loot caches, and bounded NPC routine variants.
  • Keep authored anchors static: factions, core law, major sites, named relics, capability interfaces, canonical proof missions, and security policy. Seeded generation may vary conditions around those anchors, but must not mint new authority classes or bypass service-owned validation.
  • Store provenance for every admitted generated artifact: content release id, worldline id, seed epoch, generator version, scope label, provenance hash, and bounded output size.
  • Add pure deterministic generator tests before gameplay integration: same seed produces the same artifacts, different seeds produce bounded variation, invalid generated references are rejected, and generated outputs remain sorted/stable for proof transcripts.
  • Add fixed-seed QEMU proof once the generator exists. The smoke path should still use pinned selections until generator coverage is strong enough.
  • Define WorldlineDirectory, WorldlineVisit, WorldlineExpedition, WorldlineTransfer, and WorldlineAudit service surfaces as local facade caps over remote protocol messages. Do not serialize raw cap slots, endpoint generations, global session ids, or local player labels as portable authority.
  • Start with echo-only federation: list a remote or second local worldline, inspect content/seed/ledger metadata, and view public state without mutation authority.
  • Add a denial proof for cross-world relic transfer before implementing successful transfer: eagle-standard transfer must fail until custody escrow, remote policy, dual-ledger receipts, content compatibility, and replay protection exist.
  • Later, add envoy visits and expedition bridges. Projected remote characters may observe, chat, or perform explicitly granted low-risk actions; spending home-world inventory or importing rewards requires a transfer/settlement receipt.
  • Treat cross-world markets and migration as receipt-verified claims. Remote order views, faction standing, rank, contributor rewards, and custody history require local policy gates before they affect local authority.

Feasibility note: this is feasible if it is built as capability federation plus deterministic worldline generation. It is not feasible as “trust another capOS instance’s save file” or “transfer local caps over the network.” Cross-world state changes need the same reserve/escrow, commit/release, stale-version rejection, idempotency, and ledger receipt discipline already planned for local markets and trades.

Phase 13: Contributor Quest Mechanics

Visible outcome: after the base Aurelian game has stable profiles, evidence, debriefs, and cosmetic rewards, the game can recognize real capOS development work through maintainer-witnessed outer-world quests.

  • Use docs/proposals/contributor-quest-mechanics-proposal.md as the design source for this phase.
  • Keep all rewards cosmetic, narrative, reputational, or bounded game-only perks unless a separate reviewed security design grants authority.
  • Use full GitHub issue and PR URLs, commit hashes, issuer identity, and timestamps in contribution evidence records.
  • Add manual quest and witness records before any read-only forge connector.
  • Add QEMU proof that witnessed contribution evidence mints a badge or decoration, while an unwitnessed claim does not.

Cut scope:

  • No automatic GitHub mutation, no token handling in the game client, no public leaderboard that pressures maintainers or security reviewers, and no reward that grants repository or OS authority.

Phase 14: Rich Browser Adventure Client

Visible outcome: after WebShellGateway, session-bound game authority, profiles, persistence, and the core game loop are stable, a browser-hosted adventure client presents the same game as a pixel-art interface with animated characters, location art, inventory panels, combat affordances, and chat/event feeds. The browser client should feel like a native game client, not a terminal skin.

  • Treat adventure-client as the text/QEMU proof client and compatibility adapter. Do not route the rich browser UI through StdIO command lines.
  • Implement the web shell or WebShellGateway side as a capability-call proxy for the authenticated session. The gateway holds the real capOS caps and invokes only the allowed adventure/chat methods for that web session.
  • Keep the browser authority opaque: browser JavaScript receives web-session handles and typed DTOs, never raw capOS CapIds, badge selectors, provider credentials, shell spawn authority, game-world keys, or broad network capability.
  • Prefer narrow game-session objects such as AdventurePlayer and ChatParticipant with methods for look, movement, inventory, status, combat actions, orders, delegation, chat send/history, and bounded event polling. A generic CommandSession may coexist for terminal-style front ends, but it is not the required ABI for a purpose-built game UI.
  • Return structured view state and events suitable for rendering: current site, exits, actors, mobs, visible items, held/delegated authority, evidence, effects, party state, combat state, animation/event cues, and chat history cursors.
  • Represent the world as a 2D tilemap data model for browser presentation: maps, tilesets, tile layers, object layers, collision/interaction zones, spawn points, actor paths, region/outpost markers, and event triggers. Tiled JSON is an acceptable authoring/export candidate if validation rejects oversized maps, missing tiles, unknown layer types, and invalid object references.
  • Evaluate PixiJS plus @pixi/tilemap for the first rich client because it gives a WebGL-oriented 2D renderer and rectangular tilemap path with a canvas fallback. This is a client rendering choice, not game authority.
  • Keep all semantic validation in game services. Browser-side disabled buttons, command palettes, targeting hints, and animations are presentation only; the server still rejects missing authority, invalid location, stale state, bad custody, unsafe combat, and oversized input.
  • Use explicit asset manifests for pixel art, sprite sheets, portraits, tiles, VFX, UI sounds, and animation ids. Asset lookup must not grant game authority, and missing or mismatched assets must fail as presentation errors rather than game-state mutations.
  • Add a headless browser harness that authenticates through the web gateway, opens the rich client, drives one deterministic mission slice using UI actions, verifies rendered state transitions/events, and checks logout or tab-close teardown.
  • Add browser rendering checks for tilemap layer order, actor placement, viewport/camera bounds, collision affordance display, event feed updates, and no browser-side mutation of authoritative adventure state.

Blocked by:

  • WebShellGateway authentication, origin/TLS policy, session teardown, and bounded browser transport.
  • Broker-granted Adventure/chat authority or gateway-owned live caller-session mapping so web sessions do not depend on caller-selected receiver identity.
  • Persistent profile/ledger/checkpoint semantics for save/resume UX.
  • Stable core gameplay phases through at least authority inventory, relic custody, actor roles, combat, and debrief rewards.

Cut scope:

  • Do not make browser-rendered state authoritative.
  • Do not let browser UI bypass game service methods or mutate save records directly.
  • Do not require the rich browser client for QEMU proof coverage; the text client remains the deterministic low-dependency proof path.

Service Split Gates

Keep one adventure-server until there is a concrete proof value in splitting state. Split services only at these gates:

  • Mission service: when multiple clients or NPCs need shared expedition state independent of private player profiles.
  • Profile service: when rank marks, cosmetics, contributor badges, or settings must persist beyond one process lifetime.
  • Audit/Witness service: when relic custody, forbidden rites, and debrief evidence need a separate authority boundary.
  • Save store: when profile, ledger, or expedition state needs a shared adapter over RAM, local disk, or cloud backing.
  • Cloud bridge: only after local save/load semantics and stale-write rejection are proved behind AdventureSaveStore.
  • User-owned save vault: when private profile/export data should sync through a user’s browser or Google account without granting provider credentials to game services.
  • Market/Trade service: when two-party exchange or shop inventory becomes more than a deterministic local handler.
  • Expedition service: when parties, assists, duels, or contested sites need shared state and explicit consent capabilities.

Every new service split must include manifest grants and QEMU assertions for both allowed behavior and at least one rejected overbroad action.

Verification Gates

For each phase that changes behavior:

  • make fmt-check
  • make generated-code-check when schema or generated bindings change.
  • Generated content freshness check when mission source data or content generation changes.
  • Relevant host tests for content validation or pure logic.
  • Prefer pure Rust unit tests for complex deterministic game logic: calendar/season rules, resource tables, blueprint validation, market matching, escrow state machines, route constraints, and agent quota accounting.
  • Use a real Rust test client process calling game caps for complex scenario tests that cross service boundaries: custody, construction, market transactions, party assists, and regional economy flows.
  • Keep the current command-client transcript focused on basic command and client functionality: parsing, rendering, representative success/failure calls, and stable QEMU smoke proof. Do not make it the only coverage for complex game state machines.
  • Save-record encode/decode and migration tests when profile, expedition, ledger, or cloud-bridge persistence changes.
  • User-save capsule tamper/replay/wrong-profile tests when browser-mediated backup or restore changes.
  • make run-adventure with deterministic transcript assertions for the new behavior.

For content-only changes:

  • Content validator tests must pass.
  • Generated content freshness check must pass when content blobs are checked in.
  • If the content family uses mkmanifest cue-to-capnp, rerun the conversion with pinned CAPOS_CUE and CAPOS_CAPNP, decode or validate the produced Cap’n Proto message, and include the freshness check in make generated-code-check.
  • make run-adventure must still prove the visible mission path.

Do not claim the full adventure proposal is implemented until the Aurelian mission, authority inventory, actor roles, relic custody, debrief, and deterministic proof path all land.

Full-Scope Review 2026-06-09

Findings ledger for the full-scope review cycle completed at 2026-06-09 19:01 UTC. Eight independent subsystem reviews covered the tree at commit 50e8eaba (2026-06-09) against the previous review base bb776326e (2026-05-23). Each open finding below is remediated through a task record under docs/tasks/ whose source points here; severities are carried into task priority. Documentation-status findings (stale status wording, landed-behavior drift) were remediated directly in commit 3ac860dc and are not re-listed.

Scopes Reviewed

  1. Storage on-disk formats and mount validation (kernel storage caps, tools/mkstore-image).
  2. Storage services and installable-system flow (init generation/rollback, storage-persist-service, NVMe-backed BlockDevice).
  3. Kernel core and x86_64 architecture (fault handlers, TLB shootdown, ELF spawn, percpu/SMP/paging/IOAPIC/ISO reader).
  4. Device Driver Foundation authority (MMIO bounds, DMA-buffer release invariants, device-manager proof gating).
  5. Remote-session Web UI and network-facing services.
  6. Schema, generated bindings, and System Manual.
  7. Userspace runtime and POSIX adapter (capos-rt, libcapos-posix).
  8. Fuzzing, host-test harnesses, tooling, and CI workflows.

Findings By Scope

1. Storage on-disk formats and mount validation

  • High — kernel/src/cap/persistent_store.rs:parse_disk_store, kernel/src/cap/writable_fs.rs:mount_volume: live extents are validated only against the data region, not against next_free_sector or each other. A crafted or torn image with a live extent in the bump-allocator free region mounts cleanly and is silently overwritten by the next put_blob / persist_file; compact_reclaim’s shadow-generation copy into the data-region tail clobbers such extents mid-copy; overlapping live extents are accepted.
  • Low — writable_fs.rs:mount_volume (also readonly_fs, persistent_store): duplicate sibling names are silently collapsed by BTreeMap insert instead of failing the mount.
  • Low — tools/mkstore-image:write_caposwf1_dir_node / write_caposwf1_file_node: name-length assertion uses WF_NODE_RECORD_BYTES - WF_NODE_OFF_NAME (104) instead of the kernel’s MAX_DISK_NAME_BYTES (88).
  • Medium — persistent_store.rs:DiskStoreCap::get_blob: returns disk bytes trusting the entry table without re-verifying content_hash(bytes) == key; init fetches generation objects by hash from this store, so a disk-level edit swaps active system-config content undetected.

2. Storage services and installable-system flow

  • Medium — demos/storage-persist-service/src/bin/server.rs:commit: overwrites the single payload region in place before the superblock write; a crash mid-payload-write destroys the previously committed snapshot and wedges startup. The doc comment overclaims torn-write safety; this service is the named production storage route.
  • Medium — init/src/main.rs:read_candidate_pointer / decide_boot_generation: a corrupt or truncated gen-candidate marker parses to Err and fails boot closed (the CREATE|TRUNCATE marker rewrite persists a durable size-0 window), contradicting the “a bad generation can never permanently brick the system” guarantee.
  • Medium — kernel/src/cap/block_device.rs:NVME_ARBITRARY_NAMESPACE_BLOCKS: hardcodes the 16 MiB QEMU fixture geometry (32768 blocks) on the always-built NVMe arm; BlockDevice.info and the filesystem/store BlockSource::info repeat it, so larger real namespaces are unreachable. kernel/src/nvme_storage_backend.rs “production” wording omits the bounded sync-io seam (64 ops/boot, wedges on CQ wrap).

3. Kernel core and x86_64 architecture

  • High — kernel/src/arch/x86_64/idt.rs:page_fault_handler / gp_fault_handler / invalid_opcode_handler: CPL3 faults halt the whole machine; the “no task abstraction yet” rationale is stale now that sched::exit_current_thread and process exit cleanup exist. Any userspace null deref is a full-system denial of service.
  • Medium — kernel/src/arch/x86_64/tlb.rs:kernel_tlb_shootdown_all: the remote ack uses a CR3-reload flush, which under CR4.PGE does not evict GLOBAL entries — the very kernel upper-half/MMIO mappings it exists for. Safe for the sole current caller (fresh non-present→present installs), but tlb.rs and mem/paging.rs advertise unmap/revoke reuse.
  • Medium — kernel/src/spawn.rs PT_LOAD mapping: PF_W|PF_X segments map PRESENT|USER|WRITABLE without NX; capos_lib::elf does not reject W+X.
  • Low (bundle) — percpu.rs:current_cpu_id unwrap_or(0) masquerades unknown LAPIC ids as the BSP; smp.rs AP_CPUS spin-mutex IF constraint undocumented; mem/paging.rs:map_kernel_physical_range partial failure leaks installed PTEs and the VA window; ioapic.rs:write_destination restores mask from the cached record, not hardware; mem/validate.rs legacy validate_user_buffer is dead code; kernel/src/iso/mod.rs ISO_BOOT_SOURCE mutex held across a full polled-PIO ELF transfer; capos-rt/src/panic.rs emergency console write can race a live SQ producer.

4. Device Driver Foundation authority

  • Medium — kernel/src/virtio_transport.rs:MmioRegion: volatile accessor bounds are debug_assert!-only and the kernel ships release, so the documented “range-checks before reaching device MMIO” contract is false in shipped builds; some regions claim the full BAR length while only a MAPPED_COMMON_CFG_LIMIT prefix is mapped.
  • Medium — kernel/src/device_manager/stub.rs: detach_dmabuffer_record_for_cap_release_with_reason: the pinned-enabled-vring refusal, RX-DMA quarantine, and autonomous-MSI-X/NVMe handoff blocks live in per-proof cfg islands, while the invariant — never free a frame the device may still master — is production DMA-lifetime behavior.

5. Remote-session Web UI

  • Medium — demos/remote-session-web-ui/src/main.rs:do_login: no login rate limiting and no accepted.peer_addr check; loopback-only is enforced solely by topology plus forgeable Host/Origin headers, weaker than the host bridge sibling. /api/probe/expire and /api/probe/stale-call proof seams ship unconditionally in the production-named binary.

6. Schema, generated bindings, and System Manual

  • Medium — schema/capos.capnp SymmetricKey..CertVerifier block uses leading-style doc comments; capnp attaches docs to the preceding declaration, so every comment shifts one method in the checked-in bindings and the System Manual ships misattributed descriptions. The manualc coverage gate is interface-level only, so it passes.

7. Userspace runtime and POSIX adapter

  • Medium — capos-rt/src/ring.rs:pack_copy_transfers: computes params_offset from the Vec’s as_ptr before into_boxed_slice may realloc, invalidating the computed alignment; currently saved only by undocumented allocator behavior, and the existing alignment test passes vacuously under the 16-aligned host allocator.
  • Medium (bundle) — libcapos-posix: dup/dup2/F_DUPFD snapshot pos per slot instead of sharing the open-file-description offset (src/fd.rs); poll/select ignore the timeout entirely so infinite timeout returns 0 and callers busy-spin (src/poll.rs); errno::clear() on shim entry violates C11 §7.5; F_SETFL accepts-and-ignores O_NONBLOCK then read blocks forever. No #[cfg(test)] host unit tests exist in the crate.

8. Fuzzing, harnesses, tooling, and CI

  • Medium — fuzz/fuzz_targets/manifest_capnp.rs fuzzes a 4096-word/16-deep envelope while production default_reader_options allows 64 Mi words and nesting 32; the ISO 9660 record/PVD parser, the CAPOSRO1/CAPOSST1/CAPOSWF1 mount parsers, the capos-tls DER validity walk (capos-tls/src/cert.rs:parse_validity), and storage-persist-service:deserialize_state/parse_superblock have no fuzz or host coverage.
  • Low (bundle) — CI Miri step soft-skips when the component is missing (.github/workflows/ci.yml); publish-crates.yml cargo publish --no-verify is uncommented; the sqe_validation fuzz PARK_BENCH arm is permanently reject-only without a measure-feature fuzz build; capos-wasm/src/wasi/fs.rs:install_preopen discards its try_reserve_exact result.

Spawned Task Records

All records carry source: docs/backlog/full-scope-review-2026-06-09.md:

  • review-storage-mount-extent-placement-validation (high)
  • review-storage-store-get-hash-verification
  • review-storage-persist-service-crash-safe-commit
  • review-installable-torn-candidate-fallback
  • review-storage-nvme-identify-geometry
  • review-kernel-user-fault-containment (high)
  • review-kernel-tlb-global-shootdown-ack
  • review-spawn-wx-segment-rejection
  • review-ddf-mmio-region-release-bounds
  • review-ddf-dmabuffer-detach-invariant-hoist
  • review-webui-inguest-login-hardening
  • review-schema-crypto-doc-attribution
  • review-fuzz-parser-coverage
  • review-capos-rt-transfer-pack-alignment
  • review-posix-fd-semantics
  • review-kernel-arch-hardening-lows (low bundle)
  • review-tooling-ci-lows (low bundle)

Proposal Index

This page classifies proposal documents by current role so readers do not confuse implemented behavior, active design direction, future architecture, and rejected alternatives.

The sidebar nests long proposal documents under this index so the public site opens as a current-system manual instead of an archive dump. Use this table as the first status checkpoint before opening a long proposal.

Current design authority lives in Current Design Authority. Proposal files are design history or active design records; when a proposal is implemented, future technical changes should update the stable current-design page first.

Lifecycle classes used below:

  • Implemented: shipped behavior; proposal is archival unless the status link or historical note is being corrected.
  • Accepted design: selected direction; implemented subsets need a stable current-design home.
  • Partially implemented: some behavior is in tree; future/planned text must remain explicit.
  • Active design: unimplemented or near-term design record still available for planning. Older rows that say “Future design” are active design records with no current implementation unless the row says otherwise.
  • Superseded or Rejected: retained historical rationale, not current direction.
Proposal or decisionStable current-design authorityDisposition
Session-Bound Invocation ContextSession Context and IPC and EndpointsImplemented proposal is archival.
Error HandlingError Handling and Capability RingImplemented proposal is archival.
System ConfigurationConfiguration and Manifest and Service StartupImplemented proposal is archival.
DMA Assurance ModelDMA IsolationAccepted design is grounded in the stable DMA design page.

Active or Near-Term

ProposalStatusPurpose
Service ArchitecturePartially implementedDefines authority-at-spawn, service composition, exported capabilities, and the init-owned service graph direction.
Schema RegistryFuture designActive design record for runtime schema reflection as the machine-readable twin of the System Manual; no implementation yet.
Session Archive & Gantt EffortFuture designActive design record for session recap and planning-timeline effort records; retained as workflow design, not system behavior.
Task State and Agent TelemetryPartially implementedFile-per-task ledger, selected-milestone state, lifecycle directories, and the tools/vibe-loop-capos-tasks adapter are implemented; generated checked-in views and tracker sync remain future.
Session-Bound Invocation ContextImplementedArchival record for replacing caller-selected endpoint identity and the superseded service-object migration with one immutable session context per process. Current design authority is Session Context.
Storage and NamingAccepted designDefines capability-native storage, namespaces, boot-package structure, and future persistence instead of a global filesystem.
Error HandlingImplementedArchival record for the implemented transport/capability-exception/schema-result split. Current design authority is Error Handling.
Security and VerificationPartially implementedDefines the security review vocabulary, trust-boundary checklist, and practical verification tracks used by capOS.
DMA Assurance ModelAccepted designDefines the DMA authority model, invariants, and TLA+/Alloy/Kani/Loom evidence mapping that cloud and production driver backend claims must use before attended sign-off.
Device Manager RefactorImplementedSeparates the kernel device authority ledger from QEMU proof scaffolding while preserving one MMIO/DMA/IRQ ownership transaction for userspace-driver readiness; further registry, ledger, or proof-internal splits are optional risk-reduction follow-ups.
Cloud Driver Foundation Gap AnalysisSupersededRetained as a DDF coverage map; the central blocked virtio-net driver gap it tracked is closed and successor work lives in Phase C userspace NIC relocation and NVMe BlockDevice graduation records.
NVMe Model B Doorbell DMA ValidatorAccepted designRecords the conditional direct-remapping/vIOMMU validator model and explicitly excludes the current no-IOMMU bounce path.
Network-Reachable Datapath Scope DecisionAccepted designFixes the real-GCE-boot milestone’s reachable-network requirement to raw-frame TX/RX reachability, not a TCP/UDP socket round trip.
Phase C Userspace NIC Driver RelocationAccepted designActive Phase C design record for relocating the virtio-net driver into userspace over the landed device-authority surfaces.
Remote Session UI SecurityPartially implementedDefines the per-browser BrowserSession model, OWASP-style web hardening posture, cookie/CSRF/CSP/headers/Fetch-Metadata controls, and Tauri-wrapper capability-allowlist minimization for the trusted local remote-session-ui bridge; the loopback bridge now has per-browser cookies, CSRF checks, Host/Origin/content-type validation, first-wins ownership, and bounded HTTP parsing/threading.
mdBook Documentation SitePartially implementedDefines the documentation site structure, status vocabulary, and curation rules for architecture, proposal, security, and research pages.
capOS Repository Harness EngineeringFuture designApplies OpenAI-style harness engineering to the capOS repository through agent-facing maps, run-target inventories, proposal metadata, decision records, compiled knowledge, and workflow evals.
capOS Agentic Development ExperimentFuture designDefines the longitudinal study design for using capOS development sessions, subagents, reviews, raw archives, and recap tooling as an agentic software-engineering experiment; initial tooling only exists today.
SMPAccepted designDefines the selected per-CPU Phase A direction plus later AP startup, multi-core scheduler, and TLB shootdown work.
Ring v2 For Full SMPFuture designDefines per-thread capability rings, completion routing, and SQPOLL ownership as the target transport model for full SMP.
Scheduler EvolutionAccepted designDefines the layered scheduler architecture. Phase D WFQ and Phase E SchedulingContext gates are accepted; Phase F SQPOLL/nohz/tickless idle, realtime islands, and EEVDF evaluation remain follow-on work.
Tickless and Realtime SchedulingFuture designDefines staged tickless idle, SQPOLL nohz CPU isolation, request deadline metadata, scheduling-context CPU-time authority, donation, and admitted realtime islands.
System Configuration and Operator ExtensibilityImplementedDefines operator-extensible CUE configuration. Slices 1-3 are closed, including defaults-package migration, system.local.cue overlay hooks, strict top-level manifest decoding, and the operator configuration how-to; Slice 4 adds mkmanifest cue-to-capnp for schema-aware CUE-authored data conversion.

Future Architecture

ProposalStatusPurpose
Real-Filesystem DecisionPartially implementedRecords the accepted role split between capnp-native managed state and read-only FAT32 host/interop images; several FAT and host-tool increments have landed.
Installable SystemPartially implementedDefines installed persistent capOS boot/config/update/rollback composition; the bounded local/QEMU data-region, overlay, generation, install, provision, and update/rollback smokes have landed. Secure boot/signing, production release authority, public ingress, provider breadth, and full durable account policy remain future work.
Standard App CapabilitiesFuture designDefines per-app AppData private storage, a user-mediated powerbox/file-picker grant, and attenuated capability sharing as native, structural alternatives to Google Drive’s appData/Picker/role mechanisms.
Google Drive Storage BackendFuture designDefines using a Google-authenticated user’s Drive behind the standard storage caps, via a near-term browser-transport path and a gated native OAuth2/HTTP/TLS backend, with explicit remote-vs-local-cap trust semantics.
NetworkingPartially implementedRecords implemented kernel-internal virtio-net ping/HTTP smokes, kernel TCP capability objects, and the host-local Telnet shell demo; userspace NIC and network-stack decomposition remains blocked on production DMAPool/DeviceMmio/Interrupt authority.
capos-servicePartially implementedDefines a userspace service framework above capos-rt for lifecycle, endpoint serve loops, readiness, shutdown/drain, request/session context, metrics, and resource budgeting hooks. The first slice landed the standalone lifecycle crate and Telnet gateway wrapper; endpoint-loop helpers and richer supervision hooks remain future work.
Stateful Task and Job GraphsFuture designDefines durable stateful task/job graphs for init orchestration, IX-style package builds, operator work queues, and notebook-style run stories without making the graph coordinator a god object.
Resource Accounting and QuotasPartially implementedGeneralizes existing per-process ResourceLedger mechanisms to cross-service resource profiles, ledgers of record, quota donation, and fail-closed reservation semantics.
Memory Authority ModelFuture designDefines memory authority classes, residency, mapping consistency, TLB/frame-reuse rules, pinned/DMA/swap boundaries, and proof obligations before future shared-memory and device work build on the existing VirtualMemory and MemoryObject substrate.
OOM Handling and SwapFuture designDefines memory-pressure policy, explicit OOM outcomes, budgeted anonymous memory, and optional encrypted swap without an ambient OOM killer.
Cryptography and Key ManagementPartially implementedMinimal SymmetricKey, PrivateKey/PublicKey ABI, RAM XChaCha20+HMAC/P-256 cores, RAM-only KeyVault custody, and development KeySource bootstrap landed; production custody and persistence remain future.
Volume EncryptionFuture designDefines encryption-at-rest for system and user volumes, including passphrase, recovery, cloud KMS, and measured-boot-backed key sources.
Userspace BinariesPartially implementedDescribes native userspace binaries, capos-rt, Rust std, C/libcapos, C++, Go, Python, Lua, JavaScript/TypeScript, POSIX adapters, WASI host adapters, and runtime authority handling.
Go RuntimeFuture designPlans a custom GOOS=capos path, runtime services, memory growth, TLS, scheduling, and network integration for Go.
Lua ScriptingPartially implementedDefines Lua as an ordinary capability-scoped userspace runner with curated libraries, exact grants, and no ambient shell or POSIX authority; Phase 0 and Phase 1 host bindings are in tree, while Phase 2+ remains future work.
WASI Host AdapterPartially implementedDefines a capos-wasm userspace host adapter whose WASI imports are backed by typed capOS capabilities, with wasmi for v0 (Phases W.1–W.6), wasmtime/WAMR as W.7+ migration targets, and the Component Model as the typed-cap-handle path. Phase W.1 host-runtime scaffold landed 2026-05-05 19:12 UTC (capos-wasm/ standalone crate over vendored vendor/wasmi-no_std/wasmi-1.0.9/, make capos-wasm-build); Phase W.2 closed 2026-05-07 10:53 UTC across four sub-slices: sub-slice 1 (wasm-host binary + empty-instantiation smoke + userspace-image budget bump, 2026-05-06 20:19 UTC), sub-slice 2 (Preview 1 stdout-only import resolver in capos-wasm/src/wasi/preview1.rs plus probe-driven nosys=52 proof, 2026-05-07 08:03 UTC), sub-slice 3 (Rust hello, wasi smoke + manifest-payload load path, 2026-05-07 09:36 UTC), and sub-slice 4 (C hello, wasi smoke through system clang-18 + Ubuntu wasi-libc, 2026-05-07 10:53 UTC). make run-wasm-host / make run-wasi-hello-rust / make run-wasi-hello-c are the boot smokes. Phase W.3 (per-instance CapSet plumbing + LaunchParameters) and successor phases remain future design.
POSIX AdapterPartially implementedDefines a two-layer C substrate (libcapos thin Rust staticlib, libcapos-posix POSIX surface on top) whose POSIX wrappers are backed by typed capOS capabilities. P1.1 closed at merge fe5f5208 (2026-05-05 13:28 UTC), P1.2 UDP + DNS smoke closed 2026-05-05 21:21 UTC, and P1.3 pipe + recording-shim fork-for-exec closed 2026-05-07 09:55 UTC; broad POSIX headers and a whole dns.c build remain future work.
POSIX fork/execve fd InheritanceImplementedRecording-shim execve inherits the parent’s live fd table by default with FD_CLOEXEC/O_CLOEXEC handling; only optional pre-spawn transferability refinement remains.
ShellPartially implementedDescribes native, agent-oriented, and POSIX shell models over explicit capabilities instead of ambient paths.
Remote Session CapSet ClientsPartially implementedDefines regular host apps, including CLI, native GUI, Tauri backends, webapp gateways, and agent runners, that authenticate to capOS, keep broker-issued remote CapSets in trusted client-side backends, call granted capabilities over Cap’n Proto RPC, and optionally grant bounded UI-composition caps back to capOS services. The first implementation slice proves this with a schema-framed DTO transport; standard capnp-rpc proxy transport remains future work.
SSH Shell GatewayPartially implementedDefines production remote CLI shell access through SSH while preserving the same TerminalSession and broker-issued shell-bundle boundary proven by the Telnet shell demo; focused QEMU proofs now cover the non-production SshHostKey, manifest-seeded AuthorizedKeyStore, public-key session bridge, unsupported-feature policy table, scoped listener, restricted shell launcher, and a bounded plain-TCP terminal-host wiring slice. Full OpenSSH transport remains future work.
Telnet over TLS ShellFuture optional designDefines a peer optional remote-shell path to the SSH gateway: TLS 1.3 over the existing Telnet TerminalSession handoff, with mTLS client certificates as the recommended user-auth path and CredentialStore passwords as fallback. Reuses the project’s PKI/ACME/cert-rotation track instead of inventing a parallel SSH-only key-management story. Smaller protocol surface than SSH; different operational profile, not the default main access interface.
Language Models and Agent RuntimeFuture designDefines language-model and embedder capabilities, local and remote backends, capOS-side agent runners, and browser-agent UI orchestration through gateway-enforced tool execution.
capOS-Hosted Agent SwarmsFuture designDefines OpenClaw-like hosted personal agents, swarms, harness controls, task workspaces, agent memory/wiki services, MCP/A2A-style adapters, and the research agenda for capability-scoped background agents.
Enterprise Agent Game ShowcaseFuture designPositions a playable business simulation as the capOS enterprise-agent showcase: agents manage procurement, finance, operations, logistics, markets, and audit under OS-enforced capability policy.
Chat As Multimedia SubstrateFuture designDefines Chat as a unified text/audio/video transport for human, agent, and service participants, with listener-cap delivery and a clean WebRTC mapping for browser surfaces, so new messaging surfaces do not require new top-level capabilities or gateway DTOs.
Realtime Voice Agent ShellFuture designExtends the agent-shell path for native realtime audio models, direct browser provider media, and browser-agent UI sessions while preserving broker-mediated tool execution and web-shell session boundaries.
Interactive Command SurfacesFuture designDefines structured command sessions for native interactive applications so familiar text commands compile to typed invocations instead of application-owned StdIO parsers.
Userspace Authority BrokerFuture designProposes moving shell bundle policy out of the kernel and making shutdown an init-owned lifecycle control capability granted only after login.
Aurelian FrontierPartially implementedCapability-native persistent-world RPG on a Roman-inspired magical frontier. Current proof slice covers the deterministic mission, command discoverability, typed room view, CUE-sourced content with make generated-code-check freshness, resume cap, Phase 9 rank/skill/standing gates, Phase 10 market quote/buy/sell/trade/repair, Phase 11 session-keyed player state with fixed-smoke seed/variant metadata, Phase 11a calendar/festival/military event status plus the seasonal quartermaster ration purchase, Phase 11b regional delivery with bounded inventory capacity, player-local chit currency, seller-outpost stock, service-owned market fee accrual/withdrawal, seller-outpost proceeds, order expiry, Phase 11c construction material holds/restores plus the receipt snapshot proof, Phase 11d disabled-by-default fake-agent budget/dialogue, Phase 12 party labels/verbs and physical-item transfer, the settlement snapshot proof, and the eagle-standard/gate-seal/temple-seal/under_vault interactive transcript. See the runnable proof slice for current commands and coverage. Production seeds, two-client multiplayer transfer escrow, PvP consent authority, durable ledgers, full economy behavior, and a 2D tilemap browser client remain future work.
Contributor Quest MechanicsFuture designDefines a post-adventure follow-up where maintainer-witnessed open-source contributions can mint cosmetic badges, states, decorations, and bounded game perks without granting repository or OS authority.
Public Release and Maintainer BoundariesFuture designDefines the release posture, security-audit disclaimer, issue/PR intake limits, maintainer-load boundaries, and the adventure-repository-split and git-history-rewrite hygiene gates required before making the repository public. Defers the long-term sibling-repository rule to the Repository Composition proposal.
Repository CompositionFuture designDefines the scope rule for the capOS core repository, the list of tracks (adventure, whitepaper, public site, userspace netstack, remote-access services, protocol stacks, language runtimes, GPU, agent shell, cloud images, volume crypto) that should ship as siblings, the when-to-split criteria, the cross-repository mechanics, and the intended cap-os-dev GitHub organization placement.
Boot to ShellPartially implementedDefines text-only console and web-terminal login/setup, password verifier and passkey authentication, and the authenticated native shell launch path after manifest execution, terminal input, native shell, session, broker, audit, and credential-storage prerequisites are credible.
System Info CapabilityPhase 1 + Phase 2 implementedUnifies the system-wide informational capability (MOTD today; hostname, help topics, manpages later), moves banner printing into the shell, and has AuthorityBroker.shellBundle mint SystemInfo plus profile-scoped chat/adventure service endpoint caps for operator shells. Guest and anonymous shells receive no service endpoints by default.
System Manual CapabilityPartially implementedA built-in man-pages analog: shell man/apropos, self-served web-UI doc viewer, schema-derived section-2 description proofs, and programmatic API/agent-export consistency are settled, with remaining follow-ups described in the proposal.
System MonitoringFuture designDefines capability-scoped logs, metrics, health, traces, crash records, and audit/status views.
Time and Clock AuthorityPartially implementedDefines WallClock and ClockDiscipline; Phase 1 WallClock read/provenance is landed, with trusted/network-synchronized time still future.
Debug and Trace AuthorityFuture designCapability-scoped process-attach, read-only cap-table inspection, ring-trace capture, and sampler authority with explicit consent and audit; no ambient ptrace analog.
Hardware Audit Log PersistencePartially implementedStore-inventory segment retention, retained-window recovery, hash-chain evidence, manifest reader admission, a local persistent-store reboot proof, development-source RAM-local HMAC segment seals, and explicit runtime-reader refusal have landed; external key custody, production rotation/revocation, rollback policy, and authority-broker runtime admission remain future.
Crash Recovery and SupervisionFuture designDefines stale-cap DISCONNECTED propagation on unplanned process death, structured crash records appended to the supervisor’s AuditLog, bounded restart policy with crash-loop detection, watchdog liveness, and degraded-boot fallback.
System Performance BenchmarksFuture designDefines correctness-gated primitive, workload, and user-story benchmarks for comparing capOS with other operating systems without distorting capability semantics.
HPC Parallel Processing PatternsFuture designExtends benchmark planning from static SMP/thread scaling proofs to generic single-node and multi-node parallel pattern coverage: map/reduce, task pools, barriers, scans, stencils, dense/sparse kernels, graph frontiers, pipelines, and collectives.
Scientific Standard Package and Agent Lab CapabilitiesFuture designDefines a curated scientific service graph for CAS, numerical computing, solvers, proof assistants, notebooks, package closures, provenance, and LLM agent research-lab workflows.
User Identity and PolicyPartially implementedDefines users, sessions, guest profiles, and policy layers for RBAC, ABAC, and MAC over capability grants. Current implementation has anonymous/operator/guest UserSession metadata, bootstrap credential/session flows, broker-issued shell bundles, and seed-account configuration; durable accounts, external bindings, session revocation, quotas, and broader ABAC/MAC remain future work.
Delegated Subject ContextFuture designDefines bounded act-on-behalf-of subject context as separate from capability transfer and from the completed session-bound invocation context milestone.
Default User AvatarPartially implementedDeterministic default user avatar derived from a stable account identifier, with the shell-side default mapping implemented and schema-carried avatar caps plus durable overrides still future work.
Cloud MetadataFuture designDescribes cloud instance bootstrap through metadata/config-drive capabilities and manifest deltas.
Cloud DeploymentPartially implementedRecords QEMU boot, serial output, ACPI/PCI/MSI-X discovery work, the landed cloudboot image/harness, the first GCP imported-image serial-console boot proof, and the GCP-first usable-instance provider rollup; public L4/SSH/WebShell ingress, broader storage variants, cloud clocking, production cloud-image release, AWS/Azure proofs, and aarch64 deployment remain future work.
Live UpgradeFuture designDefines service replacement without dropping capabilities or in-flight calls through retargeting and quiesce/resume protocols.
GPU CapabilityFuture designSketches capability-oriented GPU, CUDA, memory, and driver isolation models.
capOS As A Robot BrainFuture designDefines capability-oriented robotics service graphs, actuator gateways, safety monitors, realtime control islands, and ROS 2/micro-ROS/MAVLink/OPC UA bridges.
Formal MAC/MICFuture designDefines a formal mandatory-access and mandatory-integrity model plus future proof obligations.
Browser/WASMFuture designExplores running capOS concepts in a browser using WebAssembly and worker-per-process isolation.
Browser Capability and Agent Web SessionsFuture designDefines browser profiles, a cap-native document-engine middle track, visual browsing after GUI, and earlier agent/shell browser sessions as capability-scoped services over external or native browser backends.
Certificates and TLSPartially implementedPhase 1 dependencies, host verifier, minimal signing keys, RAM-only vault custody, and development KeySource bootstrap have landed; TLS and ACME remain future.
OIDC and OAuth2Future designDefines federated login, OAuth2 clients, typed token capabilities, JWKS, DPoP, token-exchange workload identity federation, and the broker integration for scopes/claims as ABAC input.

Rejected or Superseded

ProposalStatusPurpose
Endpoint Badges as Service IdentityRejectedPost-mortem for the seL4-style endpoint badge identity model that was superseded by Service Object Capabilities, then by Session-Bound Invocation Context.
Service Object CapabilitiesSupersededHistorical service-minted object capability model; the landed synthetic routing/lifecycle proof remains low-level coverage, but the implemented replacement is Session-Bound Invocation Context.
Cap’n Proto SQE EnvelopeRejectedRecords why ring SQEs stay fixed-layout transport records instead of becoming Cap’n Proto messages themselves.
Sleep(INF) Process TerminationRejectedRecords why infinite sleep should not replace explicit process termination, while preserving typed status and future sys_exit removal as separate lifecycle work.

Maintenance

When a proposal becomes implemented, rejected, or stale, update this index in the same change that changes the proposal or corresponding implementation. If the proposal is implemented, also update or create the stable current-design page named by Current Design Authority. Long proposal files may describe target behavior; this index is the first status checkpoint before a reader opens those documents.

Proposal: Capability-Based Service Architecture

How capOS processes receive authority, compose into services, and expose layered capabilities — without a service manager daemon.

Problem

Traditional OSes grant processes ambient authority (file system, network, IPC namespaces) and then restrict it via sandboxing (seccomp, namespaces, AppArmor). Service managers like systemd handle dependencies, lifecycle, and resource limits through a central daemon with a massive configuration surface.

capOS inverts this: processes start with zero authority and receive only the capabilities they need. The capability graph implicitly encodes service dependencies, resource limits, and access control. No central daemon required.

Process Startup Model

A process receives its entire authority as a set of named capabilities at spawn time. There is no ambient authority to fall back on — if a capability wasn’t granted, the operation is impossible.

The child process sees its granted capabilities by name. It cannot discover or request capabilities it wasn’t given.

Capability Layering

Each process consumes lower-level capabilities and exports higher-level ones. Authority narrows at every layer:

Kernel
  │
  ├─ Nic cap (raw frame send/receive for one device)
  ├─ Timer cap (monotonic clock)
  ├─ DeviceMmio cap (one device's BAR regions)
  └─ Interrupt cap (one IRQ line)
       │
       v
NIC Driver Process
  │
  └─ Nic cap ──> Network Stack Process
                   │
                   ├─ TcpSocket cap (one connection)
                   ├─ UdpSocket cap (one socket)
                   └─ NetworkManager cap (create sockets)
                        │
                        v
                   HTTP Service Process
                     │
                     ├─ Fetch cap (any URL)
                     │    │
                     │    v
                     │  Trusted Process (holds Fetch, mints scoped caps)
                     │
                     └─ HttpEndpoint cap (one origin)
                          │
                          v
                     Application Process

The application at the bottom holds an HttpEndpoint cap scoped to a single origin. It cannot make raw TCP connections, send arbitrary packets, or touch any device. The capability is the security policy.

HTTP Capabilities

Two levels of HTTP capability: Fetch (general) and HttpEndpoint (scoped). HttpEndpoint is implemented by a process that holds a Fetch cap and restricts it.

Fetch

Unrestricted HTTP access — equivalent to the browser Fetch API. The holder can make requests to any URL. This is the base capability that HTTP service processes use internally.

interface Fetch {
    # General-purpose HTTP request to any URL.
    request @0 (url :Text, method :Text, headers :List(Header), body :Data)
        -> (status :UInt16, headers :List(Header), body :Data);
}

struct Header {
    name @0 :Text;
    value @1 :Text;
}

Fetch is powerful — granting it is roughly equivalent to granting arbitrary outbound network access. It should only be held by service processes that need to make requests on behalf of others, not by application code directly.

HttpEndpoint

A restricted view of Fetch, scoped to a single origin. The holder can only make requests within the bounds encoded in the capability.

interface HttpEndpoint {
    # Request scoped to this endpoint's origin.
    # Path is relative (e.g., "/v1/users").
    request @0 (method :Text, path :Text, headers :List(Header), body :Data)
        -> (status :UInt16, headers :List(Header), body :Data);
}

Note: same request() signature as Fetch, but path instead of url. The origin is implicit — bound into the capability at mint time.

Attenuation

A process holding Fetch mints HttpEndpoint caps by narrowing authority. The core restriction is always origin — Fetch can reach any URL, HttpEndpoint is locked to one host. Additional constraints (path prefixes, method restrictions, rate limits) are possible but are userspace policy details, not OS-level concerns.

This is the standard object-capability attenuation pattern: same interface, less authority. The application code is identical whether it holds a broad or narrow HttpEndpoint.

Boot and Initialization Sequence

The kernel doesn’t know about services. It boots, creates a handful of kernel-provided caps, and spawns exactly one process: init. Everything else is init’s responsibility.

Current State vs Target State

The implementation has crossed the single-init startup milestone and the 15.4 schema split. SystemManifest now carries schemaVersion, binaries, initConfig, and kernelParams. The Cap’n Proto schema no longer exposes ServiceEntry, ServiceCapSource, CapRef, exports, or restart policy as kernel-consumed fields. Those service-graph concepts remain as Rust parsing types inside capos-config because the focused init executor still interprets initConfig.services.

Each process now also carries an immutable session context produced at spawn time by kernel/src/session_context.rs; default inheritance comes from the parent’s session context, and a broker can select a child session through the AuthorityBroker/UserSession path. This invocation context is the basis for session-scoped audit attribution and identity-policy enforcement; see User Identity and Policy and make run-session-context for the one-session-per-process proof.

Current manifests put the first process description at initConfig.init. The default system.cue manifest now boots the separate init binary with BootPackage and ProcessSpawner; that init process reads initConfig.services and starts the shell, remote-session CapSet gateway, chat server, and resident demo services. Focused shell-led manifests such as system-smoke.cue and system-shell.cue still boot capos-shell as the lone init process for narrow login/shell proofs. Focused init-executor manifests such as system-spawn.cue, system-chat.cue, and system-adventure.cue boot the separate init binary with BootPackage and ProcessSpawner; that init process reads initConfig.services and resolves the remaining service graph through ProcessSpawner. Other focused single-service or harness manifests still boot a demo/service binary as the init process for narrow proofs. The kernel validates only the kernel-owned boot boundary: schema version, binaries, kernelParams, initConfig.init.binary, and kernel-sourced initConfig.init.caps.

Current Bootstrap Ownership Inventory

As of 2026-05-13, the repo is in the schema-split init-owned startup state:

  • schema/capos.capnp defines SystemManifest as schemaVersion, binaries, initConfig, and kernelParams. Service graph fields are not Cap’n Proto schema fields.
  • capos-config/src/manifest.rs still defines ServiceEntry, CapRef, CapSource::Kernel, CapSource::Service, and RestartPolicy as internal Rust types for parsing initConfig.services.
  • tools/mkmanifest still embeds every declared binary into the manifest and validates the full init-owned graph before writing manifest.bin.
  • capos-config/src/validation.rs separates kernel bootstrap validation from init graph validation. Kernel bootstrap validation covers binary names, initConfig.init.binary, init kernel cap sources, and kernelParams. Full graph validation covers initConfig.services for mkmanifest and init’s metadata-only ManifestBootstrapPlan path.
  • kernel/src/main.rs::run_init reads the Limine manifest module, validates the kernel-owned bootstrap contract, configures serial policy from kernelParams, and loads only initConfig.init.binary.
  • kernel/src/cap/mod.rs::create_boot_service_caps builds only initConfig.init.caps. Those caps are kernel-sourced by type, so the kernel has no CapSource::Service branch.
  • The init cap bundle is currently described by initConfig.init.caps. In the default system.cue manifest this grants the separate init binary the bootstrap caps it needs to read BootPackage and spawn the service graph. In focused shell-led manifests such as system-smoke.cue, this still grants capos-shell terminal, credential, session, audit, and broker capabilities directly. In focused single-service or harness manifests, initConfig.init.caps grants only the capabilities the harness itself needs.
  • BootPackage exposes the full serialized manifest bytes to init. That path is live for default and focused init-executor manifests. Focused shell-led manifests do not grant BootPackage to capos-shell.
  • ProcessSpawner owns the embedded binary set. It receives the boot manifest bytes so delegated ProcessSpawner grants can preserve that same boot package context; child BootPackage caps are not minted from SpawnGrantSource::Kernel. ProcessSpawner.createPipe(bufferBytes) mints a bounded SPSC kernel Pipe capability used by the POSIX adapter Phase P1.3 recording-shim fork-for-exec path; see POSIX Adapter §Phase P1.3 and Userspace Binaries Part 4.
  • ProcessSpawner.spawn resolves SpawnGrantSource::Kernel for the bounded manager-issued DDF authority surfaces (DeviceMmio, DMAPool, Interrupt, HardwareAuditLog) through the matching grant-source records in kernel/src/cap/devicemmio_grant_source.rs, kernel/src/cap/dmapool_grant_source.rs, and their interrupt/audit peers. Each grant attaches a fresh manager-owned record, validates owner/quiesce/ scrub state for DMA-side caps, and returns a child-local handle without sharing the parent’s owner object. See device-driver-foundation.md Task 5 for the bounded-authority scope and the focused make run-devicemmio-grant, make run-dmapool-grant, make run-interrupt-grant, and make run-hardware-audit smokes.
  • init/src/main.rs is the focused BootPackage executor. When that binary is the init process, it reads the BootPackage manifest, builds a ManifestBootstrapPlan, validates it again, discovers its own kernel grants from initConfig.init.caps plus the CapSet, preflights the initConfig.services graph, resolves kernel and service cap sources, records exports, spawns children through ProcessSpawner, and waits on their ProcessHandles.
  • system.cue, system-smoke.cue, system-spawn.cue, system-chat.cue, system-adventure.cue, and the other focused manifests now express their first-process bundle under initConfig.init and any child topology under initConfig.services.

The practical cleanup boundary is therefore not “move service startup to init”; that already happened. The current cleanup target is narrower: the kernel no longer understands the service graph as a bootstrap authority structure. The remaining future cleanup is to stop letting focused harnesses choose arbitrary init binaries and direct kernel cap bundles, then move to one fixed generic-init ABI.

Narrowed Transitional Contract

The current schema is schemaVersion, binaries, initConfig, and kernelParams. The narrowed kernel contract is:

  • The kernel validates schemaVersion, parses kernelParams for kernel-consumed boot policy, and configures serial policy.
  • The kernel resolves only initConfig.init.binary against binaries and loads only that ELF.
  • The kernel may interpret initConfig.init.caps only as the bootstrap cap bundle for the single first process. Those caps must be kernel-sourced; a service-sourced cap in initConfig.init.caps is invalid because no non-init service exists at kernel handoff time.
  • initConfig.services[*], their caps, exports, restart, and any CapSource::Service references are init-owned configuration while the transitional Rust parser exists. mkmanifest and init continue validating them for smoke coverage, but kernel bootstrap does not run the multi-service graph validator or a service export resolver.
  • Focused harness manifests that intentionally boot a demo/service binary as init stay valid during this slice. Their harness-specific caps are still described by initConfig.init.caps until those smokes are migrated behind a generic init-owned executor config.

Kernel bootstrap implements this contract with a first-service cap-table builder. That builder covers only implemented kernel sources used by current initConfig.init.caps lists. That current first-service surface is wider than the eventual generic-init minimum: the default init-owned path needs Console, TerminalSession, CredentialStore, SessionManager, AuditLog, AuthorityBroker, BootPackage, ProcessSpawner, listener, launcher, and chat endpoint authority so it can launch the current service graph; focused shell-led paths still need TerminalSession, CredentialStore, SessionManager, AuditLog, and AuthorityBroker directly; focused harnesses need their own direct kernel caps. Cross-service export lookup, service-source attenuation, and non-init cap-resolution policy stay in init/src/main.rs for the focused BootPackage-executor manifests.

Target Boot Package Contract

After the harness migration, SystemManifest should keep the same outer shape but initConfig.init should stop being a per-manifest kernel bootstrap bundle. At that point:

  • ServiceEntry, CapRef, CapSource::Service, service exports, and restart policy remain ordinary data inside initConfig, interpreted and validated by init or a supervisor service.
  • Kernel validation is limited to the schema version, kernel parameters, boot-package integrity/measurement policy, and enough binary metadata to load the one init image.
  • The first process is the generic init/supervisor, not a demo harness or shell. Shell-led and focused single-service proofs should become init-owned configurations rather than alternate kernel bootstrap contracts.
  • The fixed direct kernel bundle for that generic init starts with Console, BootPackage, and ProcessSpawner in the currently implemented system. This is the target generic-init minimum, not the full transitional initConfig.init.caps surface. The architecture-level target also includes Timer, DeviceManager, FrameAllocator, and per-process VirtualMemory once those authorities are ready to be part of init’s stable bootstrap ABI. Until then, FrameAllocator, VirtualMemory, and Endpoint grants for child processes remain minted through ProcessSpawner spawn grants.

The target model removes the kernel-side service graph entirely. The manifest stops being a kernel authority graph and becomes a boot package delivered to init:

  • List of embedded binaries (init needs them before any storage service exists; they can’t be fetched from a filesystem that hasn’t started).
  • Init’s config blob (CUE-encoded tree; what to spawn, with what attenuations, with what restart policy).
  • Kernel boot parameters (memory limits, feature flags) consumed by the kernel itself, not forwarded to init.

The kernel spawns exactly one userspace process (init) with a fixed cap bundle:

  • Console — kernel serial wrapper (may be replaced later by a userspace log service, with init retaining a direct console cap for emergency use).
  • ProcessSpawner — only init and its delegated supervisors hold this.
  • FrameAllocator — physical frame authority for init’s own allocations.
  • VirtualMemory — per-process address-space authority for init.
  • DeviceManager — enumerate/claim devices; init delegates device-specific slices to drivers.
  • Timer — monotonic clock.
  • BootPackage — read-only cap exposing the embedded binaries and the config blob.

Everything else — drivers, net-stack, filesystems, supervisors, apps — init spawns at runtime via ProcessSpawner with appropriate attenuation. No manifest ServiceEntry, no cross-service CapRef, no manifest exports.

Pre-Init Boundary After Stage 6

Rule of thumb: no userspace service runs before init. The kernel’s job is primitive cap synthesis and a single-process handoff; init’s job is the whole service graph. Concretely, after Stage 6:

  • Stays in kernel pre-init: memory map ingest, frame allocator, heap, paging, GDT/IDT/TSS, serial for kernel diagnostics, scheduler, ring dispatch, kernel-cap CapObject impls, ELF loading for init, boot package measurement (if attested boot is added).
  • Stays in manifest: binaries list + init config blob + kernel boot params. Schema-wise, ServiceEntry and CapSource::Service disappear; SystemManifest shrinks to binaries + initConfig + kernelParams.
  • Moves to init: service topology, cross-service cap wiring, attenuation, restart policies, dynamic spawn, cap export/import, supervision trees. Anything a service manager would do.
  • Moves to init or later services: logging policy, config store, secrets, filesystem mounts, network configuration, device binding.

Edge cases that might look like they want a pre-init service but don’t:

  • Early crash / panic handling. Kernel-side panic handler, no service needed.
  • Recovery shell. Kernel fallback: if init fails to reach a healthy state within a timeout (e.g. exits immediately, or never issues a liveness SQE), kernel optionally spawns a “recovery” binary from the boot package with the same cap bundle. Still just one userspace process at a time pre-supervisor-loop.
  • Attested/measured boot. Kernel hashes binaries in the boot package before handing BootPackage to init. The measurement agent, if any, runs as a normal service spawned by init with a cap to the sealed measurements.
  • Early-boot console. Kernel owns serial and exposes Console to init. A userspace log service can layer on top later; it is not pre-init.

Legacy Manifest Fields After Stage 6

ServiceEntry.caps, CapSource::Service, and ServiceEntry.exports are transitional init configuration, not kernel schema. The 15.4 schema split deleted them from schema/capos.capnp, collapsed the service graph into initConfig: CueValue, and kept kernel bootstrap on the first-service cap-table builder. The remaining cleanup is to make that first-service bundle fixed rather than manifest-selected:

  1. Move shell-led and focused harness proofs behind an init-owned executor config instead of booting their binaries directly as init.
  2. Embed or otherwise pin the generic init image as the only kernel-loaded userspace image. Partially landed (2026-05-25 23:26 UTC): the init image is embedded and loaded from kernel::boot::INIT_ELF whenever init.binary == "init" (see “Init Binary Embedding”). It is not yet the only kernel-loaded image — until step 1 moves the focused/shell proofs behind an init-owned executor, non-"init" PID-1 selectors are still kernel-loaded from binaries.
  3. Replace per-manifest initConfig.init.caps with the fixed bootstrap cap bundle described above plus BootPackage.
  4. Keep initConfig.services as ordinary init/supervisor configuration until a later libcapos or supervisor API gives it a more concrete format.

The re-export restriction added in capos-config::validate_manifest_graph (service A exports cap sourced from B.ep) becomes moot at that point because there are no kernel-owned manifest exports at all. It stays as defensive validation for initConfig.services while the transitional init-owned executor exists.

Init Binary Embedding

Status: landed 2026-05-25 23:26 UTC as a hybrid keyed on the reserved init selector (see below). Init is part of the kernel’s bootstrap contract, not a configuration choice: the cap bundle handed to init is a kernel ABI, the _start(ring, pid, …) entry shape is a kernel ABI, and a version-mismatched init is a footgun with no payoff in a single-init research OS. So the init ELF ships inside the kernel binary via include_bytes!, not as a separate manifest entry or Limine module.

Shape (as landed):

  • init/ stays a standalone crate with its own linker script and code model (user-space base 0x200000, static relocation model, 4 KiB alignment). Not a workspace member; different build flags than the kernel.
  • kernel/build.rs reads the prebuilt init/ artifact (the Makefile passes CAPOS_INIT_ELF and orders init before the kernel; a conventional-path fallback covers a bare cargo build after init is built) and emits an include_bytes!("…") into a kernel::boot::INIT_ELF: &[u8] static. Driving init’s build from build.rs was rejected to avoid duplicating its custom target/code-model flags; failing closed on a missing artifact is the chosen behavior.
  • initConfig.init.binary is a generic “which binary is PID 1” selector, so embedding is keyed on the reserved name capos_config::RESERVED_INIT_BINARY_NAME ("init"). When init.binary == "init", kernel bootstrap parses INIT_ELF through the same capos_lib::elf path used for service binaries, creates the init address space via AddressSpace::new_user(), loads segments, populates the cap bundle (including BootPackage), and jumps — no Limine module lookup and no binaries resolution for that identity. When init.binary names any other binary (the shell on run-smoke, the ~70 focused test-as-PID-1 manifests), PID 1 still resolves from SystemManifest.binaries exactly as before.
  • The reserved name "init" must not appear in SystemManifest.binaries: manifest validation (capos-config and mkmanifest) rejects it, since the kernel owns the init image. Real-init manifests drop their init entry; their binaries list is services-only.
  • The embedded image is the canonical init binary, so init’s own child spawns that reference init by name (e.g. system-spawn.cue’s spawn-hardening fixtures) still resolve: when init is embedded, run_init injects the embedded bytes into the ProcessSpawner binary set under the reserved name (the BootPackage cap serves only the serialized manifest bytes, which never carry the reserved entry). This keeps the spawnable set identical to the pre-embedding state without init re-entering the serialized manifest. Service binaries remain distinct BootPackage blobs.
  • Measured-boot attestation (if added) covers the kernel ELF, which transitively covers init’s bytes. Service binaries are hashed separately by the kernel before handing BootPackage to init.

What this does not change:

  • Init still runs in Ring 3 with its own page tables; embedding is byte packaging, not privilege merging.
  • Init is still ELF-parsed at boot — the same loader and W^X enforcement apply. The only thing different is where the bytes came from.
  • Service binaries (everything spawned after init) stay in the boot package as distinct blobs, exposed to init via BootPackage. They are not linked into the kernel; their lifecycle is independent of the kernel’s.

What option was rejected: fully linking init into the kernel crate (shared compilation unit, shared text). That collapses the kernel/user build boundary, couples linker scripts and code models, and puts init’s panics/UB inside the kernel’s compilation context. The process-isolation boundary survives that arrangement — but the build-time separation that makes the boundary trustworthy does not. include_bytes! preserves the separation; static linking destroys it.

Kernel boot
  │
  ├─ Create kernel caps: Console, Timer, DeviceManager, ProcessSpawner
  │
  └─ Spawn init with all kernel caps
       │
       init process (PID 1)
         │
         ├─ Phase 1: Core services (sequential — each depends on previous)
         │    ├─ DeviceManager.enumerate() → list of devices
         │    ├─ Spawn NIC driver with device-specific caps
         │    ├─ Wait for NIC driver to export Nic cap
         │    ├─ Spawn net-stack with Nic + Timer caps
         │    └─ Wait for net-stack to export NetworkManager cap
         │
         ├─ Phase 2: Higher-level services (can be parallel)
         │    ├─ Spawn http-service with TcpSocket cap from net-stack
         │    ├─ Spawn dns-resolver with UdpSocket cap
         │    └─ ...
         │
         └─ Phase 3: Applications
              ├─ Spawn app-a with HttpEndpoint("api.example.com")
              ├─ Spawn app-b with Fetch cap (trusted)
              └─ ...

The Init Process in Detail

Init is a regular userspace process with privileged caps. It is the only process that holds ProcessSpawner (the right to create new processes) and DeviceManager (the right to enumerate and claim devices). It can delegate subsets of these to child supervisors.

// init/src/main.rs — this IS the system configuration

fn main(caps: CapSet) {
    let spawner = caps.get::<ProcessSpawner>("spawner");
    let devices = caps.get::<DeviceManager>("devices");
    let timer = caps.get::<Timer>("timer");
    let console = caps.get::<Console>("console");

    // === Phase 1: Hardware drivers ===

    // Find the NIC
    let nic_device = devices.find("virtio-net")
        .expect("no network device found");

    // Spawn NIC driver — gets ONLY its device's MMIO + IRQ
    let nic_driver = spawner.spawn(SpawnRequest {
        binary: "/sbin/virtio-net",
        caps: caps![
            "device_mmio" => nic_device.mmio(),
            "interrupt"   => nic_device.interrupt(),
            "log"         => console.clone(),
        ],
        restart: RestartPolicy::Always,
    });

    // The driver exports a Nic cap once initialized
    let nic: Cap<Nic> = nic_driver.exported("nic").wait();

    // === Phase 2: Network stack ===

    let net_stack = spawner.spawn(SpawnRequest {
        binary: "/sbin/net-stack",
        caps: caps![
            "nic"   => nic,
            "timer" => timer.clone(),
            "log"   => console.clone(),
        ],
        restart: RestartPolicy::Always,
    });

    let net_mgr: Cap<NetworkManager> = net_stack.exported("net").wait();

    // === Phase 3: HTTP service ===

    let tcp = net_mgr.create_tcp_pool();

    let http_service = spawner.spawn(SpawnRequest {
        binary: "/sbin/http-service",
        caps: caps![
            "tcp" => tcp,
            "log" => console.clone(),
        ],
        restart: RestartPolicy::Always,
    });

    let fetch: Cap<Fetch> = http_service.exported("fetch").wait();

    // === Phase 4: Applications ===

    // Trusted telemetry agent — gets full Fetch
    spawner.spawn(SpawnRequest {
        binary: "/sbin/telemetry",
        caps: caps![
            "fetch" => fetch.clone(),
            "log"   => console.clone(),
        ],
        restart: RestartPolicy::OnFailure,
    });

    // Sandboxed app — gets scoped HttpEndpoint
    let api_cap = fetch.attenuate(EndpointPolicy {
        origin: "https://api.example.com",
        paths: Some("/v1/users/*"),
        methods: Some(&["GET", "POST"]),
    });

    spawner.spawn(SpawnRequest {
        binary: "/app/my-service",
        caps: caps![
            "api" => api_cap,
            "log" => console.clone(),
        ],
        restart: RestartPolicy::OnFailure,
    });

    // Init stays alive as the root supervisor
    supervisor_loop(&spawner);
}

Key Mechanisms

Cap export. A spawned process can export capabilities back to its parent via the ProcessHandle (see Spawn Mechanism section). This is how the NIC driver makes its Nic cap available to the network stack — init spawns the driver, waits for it to export "nic", then passes that cap to the next process.

Restart policy. Encoded in SpawnRequest, enforced by the supervisor loop in the spawning process. When a child exits unexpectedly:

  1. Old caps held by the child are automatically revoked (kernel invalidates the process’s cap table on exit)
  2. Supervisor re-spawns with the same SpawnRequest
  3. New instance gets fresh caps — same authority, new identity

Dependency ordering. Sequential in code: wait() on exported caps blocks until the dependency is ready. No declarative dependency graph needed — Rust’s control flow is the dependency graph.

Service Taxonomy

Concrete categories of userspace services capOS expects to run. All spawned by init (or a supervisor init delegates to) after Stage 6. None are pre-init.

Hardware Drivers

One process per managed device. Each holds exactly the caps for its own hardware: an DeviceMmio slice, the corresponding Interrupt cap, and optionally a DmaRegion cap carved out of the frame allocator. Exports a typed device cap (Nic, BlockDevice, Framebuffer, Gpu, …). Examples: virtio-net, virtio-blk, NVMe, AHCI, framebuffer/GPU.

Platform Services

  • Logger / journal — accepts Log cap writes, forwards to console and/or durable storage. Init and kernel bootstrap use a direct Console cap until the logger is up; afterwards new services get Log caps only.
  • Filesystem — one per mounted volume. Consumes a BlockDevice cap, exports Directory / File caps. FAT, ext4, overlay, tmpfs.
  • Store — capability-native content-addressed storage backing persistent capability state (storage-and-naming-proposal.md).
  • Network stack — userspace TCP/IP (networking-proposal.md). Consumes Nic + Timer, exports NetworkManager, TcpSocket, UdpSocket, TcpListener.
  • DNS resolver — consumes a UdpSocket, exports Resolver.
  • Config / secrets store — reads the initial config from BootPackage, exposes runtime Config and Secret caps with per-key attenuation.
  • Cloud metadata agent — detects IMDS / ConfigDrive / SMBIOS on cloud boot and delivers a ManifestDelta (cloud-metadata-proposal.md).
  • Upgrade manager — orchestrates CapRetarget for live service replacement (live-upgrade-proposal.md).
  • Capability proxy — makes selected local caps reachable over the network. The near-term shape is typed Cap’n Proto RPC or a schema-framed proxy, following Cloudflare’s production pattern of schema-bundled Workers bindings to internal services; later remote-capability sessions can borrow Spritely/OCapN CapTP’s session, handoff, and reference-lifetime model without treating current OCapN drafts as capOS ABI commitments. The proxy must never serialize local CapId values, endpoint generations, receiver selectors, or kernel/session ids as portable authority, and it must own explicit resource ledgers for remote refs, queued calls, streams, and retries. See Cloudflare, Cap’n Proto, Workers RPC, and Cap’n Web and Spritely, OCapN, and CapTP.
  • Measurement / attestation agent — consumes sealed kernel hashes from BootPackage, exposes Quote caps for remote attestation.

Supervisors

Per-subsystem restart managers that hold a narrowed ProcessSpawner plus the caps of the subtree they own. If any child crashes, the supervisor tears down and re-spawns the set. Example: net-supervisor owns NIC driver + net-stack + DHCP client.

Application Services

User-facing or user-spawned processes: HTTP servers, API gateways, worker pools, shells, interactive tools. Hold only the narrow caps the supervisor grants (HttpEndpoint for one origin, Directory for one mount, etc.). Human users, service accounts, guests, and anonymous callers are represented by session/profile services that grant scoped cap bundles; they are not kernel subjects or ambient process credentials. See User Identity and Policy.

What Does Not Become a Service

  • Console / serial — stays in the kernel as a CapObject wrapper. Small enough, needed for kernel diagnostics, no benefit from userspace isolation. A userspace log service can layer on top.
  • Frame allocator, virtual memory, scheduler, ring dispatch — kernel primitives, exposed as caps but not as services.
  • Interrupt delivery, DMA mapping — kernel mechanisms, exposed to drivers as caps.
  • Boot measurement — if added, happens in the kernel before BootPackage exists; the measurement agent (userspace) only reports them.

Supervision

Supervision Tree

Init doesn’t have to supervise everything directly. It can delegate:

init (root supervisor)
  ├─ net-supervisor (holds: spawner subset, device caps)
  │    ├─ virtio-net driver
  │    ├─ net-stack
  │    └─ http-service
  └─ app-supervisor (holds: spawner subset, service caps)
       ├─ my-service
       └─ another-app

Each supervisor is a process that holds a ProcessSpawner cap (possibly restricted to specific binaries) and the caps it needs to grant to children. If net-supervisor crashes, init restarts it, and it re-spawns the entire networking subtree.

Supervisor Loop

#![allow(unused)]
fn main() {
fn supervisor_loop(children: &[SpawnRequest], spawner: &ProcessSpawner) {
    let mut handles: Vec<ProcessHandle> = children.iter()
        .map(|req| spawner.spawn(req.clone()))
        .collect();

    loop {
        // Wait for any child to exit
        let (index, exit_code) = wait_any(&handles);
        let req = &children[index];

        match req.restart {
            RestartPolicy::Always => {
                handles[index] = spawner.spawn(req.clone());
            }
            RestartPolicy::OnFailure if exit_code != 0 => {
                handles[index] = spawner.spawn(req.clone());
            }
            _ => {
                // Process exited normally, don't restart
            }
        }
    }
}
}

Socket Activation

systemd pre-creates a socket and passes the fd to the service on first connection. In capOS, the supervisor does the same with caps:

Eager (default): supervisor spawns the child immediately with a TcpListener cap. Child calls accept() and blocks.

Lazy: supervisor holds the TcpListener cap itself. On first incoming connection (or on first accept() from a proxy cap), it spawns the child and transfers the cap. The child code is identical in both cases.

#![allow(unused)]
fn main() {
// Lazy activation — supervisor holds the listener until needed
let listener = net_mgr.create_tcp_listener();
listener.bind([0,0,0,0], 8080);

// This blocks until a connection arrives
let _conn = listener.accept();

// Now spawn the actual service, giving it the listener
spawner.spawn(SpawnRequest {
    binary: "/app/web-server",
    caps: caps!["listener" => listener, "log" => console.clone()],
    restart: RestartPolicy::Always,
});
}

Configuration

See Storage and Naming for the full storage, naming, and configuration model.

Summary: the system topology is currently defined in a capnp-encoded system manifest baked into the boot image. tools/mkmanifest compiles the human-authored system.cue, system-smoke.cue, or focused manifest sources such as system-spawn.cue, system-devicemmio-grant.cue, and system-wasi-random.cue into the binary manifest. Default boot uses standalone init and init-owned service-graph execution; focused shell-led manifests still grant login/session/broker caps directly to capos-shell for narrow smokes. Focused init-executor manifests let the separate init binary validate and execute the manifest through ProcessSpawner; the old generic kernel resolver has been replaced by first-service cap construction. Manifest-declared SpawnGrantSource::Kernel entries cover the bounded DDF authority surface (DeviceMmio, DMAPool, Interrupt, HardwareAuditLog) and the wasm-host’s optional EntropySource grant; the WASI host adapter (see WASI Host Adapter) and the POSIX adapter (see POSIX Adapter) both run as ordinary userspace processes spawned through this same path. Remaining cleanup is to move runtime configuration into a capability-based store service once that service exists. See also the layered CUE configuration model in System Configuration and Operator Extensibility.

Comparison with Traditional Approaches

Concernsystemd/LinuxcapOS
Service dependenciesWants=, After=, Requires=Implicit in cap graph
Sandboxingseccomp, namespaces, AppArmorDefault: zero ambient authority
Socket activationListenStream=, fd passing protocolPass TcpListener cap
Restart policyRestart=on-failureSupervisor process loop
Loggingjournald, StandardOutput=journalLog cap in granted set
Resource limitscgroups, MemoryMax=, CPUQuota=Bounded allocator caps
Network access controlfirewall rules (iptables/nftables)Scoped HttpEndpoint / TcpSocket caps
Config formatINI-like unit files (~1500 directives)Rust code or minimal manifest
Trusted computing basesystemd PID 1 (~1.4M lines)Init process (hundreds of lines)

Spawn Mechanism

Spawning is a capability-gated operation. The kernel provides a ProcessSpawner capability — only the holder can create new processes.

Implemented Kernel Slice

The kernel now provides:

  1. ProcessSpawner capability — a CapObject impl in kernel/src/cap/process_spawner.rs. Methods:

    • spawn(name, binaryName, grants) -> handleIndex — resolve a boot-package binary, load ELF, create address space (builds on existing elf.rs loader and AddressSpace::new_user() in mem/paging.rs), populate the initial cap table, schedule the process, and return the ProcessHandle through the ring result-cap list
    • the returned ProcessHandle cap lets the parent wait for child exit in the first slice; exported caps and kill semantics are later lifecycle work
  2. Initial cap passing — at spawn time, the kernel copies permitted parent cap references into the child’s cap table or mints authorized child-local kernel caps. Raw grants preserve the source legacy badge. Endpoint-client grants may mint a requested legacy badge only from an endpoint owner or trusted parent endpoint result source; delegated client facets must preserve their existing service identity. Child-local Endpoint, FrameAllocator, and VirtualMemory grants are created for the child’s process. Child-local endpoint grants return parent-side client facets as result caps instead of sharing the endpoint owner object. The parent’s references are unaffected. Legacy endpoint badges are transitional; new multi-client service identity should use session-bound invocation context plus broker-granted service roots/facets.

  3. Cap export — future lifecycle work will let a child register a cap by name in its ProcessHandle, making it available to the parent (or anyone holding the handle). This is the mechanism behind nic_driver.exported("nic").wait() once exported-cap lookup is added.

Schema

interface ProcessSpawner {
    spawn @0 (name :Text, binaryName :Text, grants :List(CapGrant)) -> (
        handleIndex :UInt16,
        capabilityManagerIndex :UInt16,
    );
    createPipe @1 (bufferBytes :UInt32) -> (readIndex :UInt16, writeIndex :UInt16);
}

struct CapGrant {
    name @0 :Text;
    capId @1 :UInt32;
    interfaceId @2 :UInt64;
    mode @3 :CapGrantMode;
    badge @4 :UInt64;
    source @5 :CapGrantSource;
}

struct CapGrantSource {
    union {
        capability @0 :Void;
        kernel @1 :KernelCapSource;
    }
}

enum CapGrantMode {
    raw @0;
    clientEndpoint @1;
    move @2;
    serviceObject @3;
}

interface ProcessHandle {
    wait @0 () -> (exitCode :Int64);
    terminate @1 () -> ();
}

Note on capability passing: Capabilities are referenced by cap table slot IDs (UInt32), not by Cap’n Proto’s native capability table mechanism. spawn() returns the ProcessHandle and a CapabilityManager cap through the ring result-cap list; handleIndex and capabilityManagerIndex identify those transferred caps in the completion. The first slice passes a boot-package binaryName instead of raw ELF bytes so the request stays within the bounded ring parameter buffer. terminate (deferred kill) is implemented on ProcessHandle; post-spawn grants and exported-cap lookup remain future lifecycle work until their authority semantics are implemented. capOS uses manual capnp dispatch (CapObject trait with raw message bytes, not capnp-rpc), so cap references are plain integers and typed result caps use the ring transfer-result metadata. See Userspace Binaries Part 7 for the surrounding userspace bootstrap schema context, Part 4 for the POSIX adapter surface that consumes ProcessSpawner.createPipe plus the recording-shim fork-for-exec successor posix_spawn over the same Move-grant path, and Part 5 for the WASI host adapter that runs as a userspace process spawned through this same ProcessSpawner with manifest-supplied capability grants (WASI Host Adapter).

Relationship to Existing Code

The current kernel has these pieces in place:

  • ELF loading (kernel/src/elf.rs) — parses PT_LOAD segments, validates alignment, and feeds the reusable spawn primitive behind ProcessSpawner.
  • Address space creation (kernel/src/mem/paging.rs) — AddressSpace::new_user() creates isolated page tables with the kernel mapped in the upper half.
  • Cap table (kernel/src/cap/table.rs) — CapTable with insert(), get(), remove(), transfer preflight, provisional insert, commit, and rollback helpers. Each Process owns one local table.
  • Process struct and scheduler (kernel/src/process.rs, kernel/src/sched.rs) — a process table plus round-robin run queue are in place for both legacy manifest-spawned services and init-spawned children.

Generic capability transfer/release and the reusable ProcessSpawner lifecycle path are complete enough for the focused init-owned spawn executor. Default startup now uses standalone init for service-graph execution, while focused shell-led startup remains for narrow smokes. ProcessSpawner.createPipe extends the lifecycle surface with a bounded SPSC kernel Pipe capability consumed by the POSIX adapter’s recording-shim fork-for-exec path (P1.3) and exposed as the posix_spawn successor on the same Move-grant path. The DDF Task 5 grant-source families (devicemmio_grant_source.rs, dmapool_grant_source.rs, and their interrupt/audit peers) extend SpawnGrantSource::Kernel with the bounded manager-issued DDF authority surface; production handle lifecycle, hardware- backed driver wait/ack dispatch beyond bounded route proofs, and the S.11.2 hostile-smoke gates remain open. Each spawned process also receives one immutable session context (default-inherited from the parent or broker-selected), used as the invocation subject for audit attribution and the identity-policy boundary. Remaining lifecycle gaps are post-spawn grants, runtime exported-cap lookup, restart supervision, and shrinking the transitional manifest schema. ProcessHandle.terminate (deferred kill) is implemented.

Prerequisites

PrerequisiteStatusWhy
ELF loading + address spacesDone (Stage 2-3)elf.rs, AddressSpace::new_user()
Capability ring + cap_enterDone (Stage 4/6 foundation)Ring-based cap invocation with blocking waits
Scheduling + preemption (core)Done (Stage 5)Round-robin, PIT 100 Hz, context switch
Cross-process Endpoint IPCDone (Stage 6 foundation)CALL/RECV/RETURN routing through Endpoint objects
Generic cap transfer/releaseDone (Stage 6, 2026-04-22/24)Copy/move transfer, result-cap insertion, CAP_OP_RELEASE, epoch revocation, and revoked endpoint Disconnected error surface
ProcessSpawner + ProcessHandleDone (Stage 6, 2026-04-22)Init-driven spawn with grants, wait completion, hostile-input coverage; kill/post-spawn grants still future
ProcessSpawner.createPipe + recording-shim fork-for-execDone (POSIX adapter P1.3, 2026-05-07 09:55 UTC)Bounded SPSC Pipe capability and Move-grant fork-for-exec successor; see POSIX Adapter §Phase P1.3 and Userspace Binaries Part 4
DDF bootstrap-grant sources (DeviceMmio, DMAPool, Interrupt, HardwareAuditLog)In progress (DDF Task 5)Bounded manager-issued authority over SpawnGrantSource::Kernel; production handle lifecycle and S.11.2 hostile smokes remain open. See device-driver-foundation.md Task 5
Immutable per-process session contextDone (kernel/src/session_context.rs)One session context per process, default-inherited or broker-selected; make run-session-context proof
Authority graph + quota design (Security Verification Track S.9)Done (2026-04-21)Defines transfer/spawn invariants, per-process quotas, and rollback rules; see docs/authority-accounting-transfer-design.md

This proposal describes the target architecture. Individual pieces (like Fetch/HttpEndpoint) are additive — they’re userspace processes that compose existing caps into higher-level ones. No kernel changes needed beyond Stages 4-6.

First Step After Transfer and ProcessSpawner — done 2026-04-23

The minimal demonstration of this architecture landed together with capability transfer and ProcessSpawner:

  1. ProcessSpawner cap in kernel/src/cap/process_spawner.rs wraps ELF loading and address-space creation behind a typed capability.
  2. Init spawns children — focused make run-spawn boots a single-init manifest; the kernel boots only the separate init binary from initConfig.init, then init spawns the focused demo graph from initConfig.services through ProcessSpawner, grants child-local endpoint owners and client facets, then releases parent endpoint facets before waiting on each ProcessHandle.
  3. Cross-process cap invocation — spawned client invokes the server’s Endpoint cap, server replies, both print to console.

This exercises: spawn cap, initial cap passing, manifest-declared export recording, cross-process cap invocation, hostile-input rejection, and per-process resource exhaustion paths. Deleting the unused legacy kernel resolver is post-milestone cleanup tracked in docs/tasks/.

Open Questions

  1. Restart supervision. Epoch-based cap revocation and generation-tagged stale reference detection are implemented for current grant/revoke flows. Restart policy still needs a supervisor contract that epoch-bumps caps served by the failed process, restarts from the manifest, and reconnects clients through explicit authority rather than ambient service lookup.

  2. Cap discovery. How does a process learn what caps it was given? Resolved: name→(cap_id, interface_id) mapping passed at spawn via a well-known page (CapSet). See Userspace Binaries Part 2. cap_id is the authority-bearing table handle. interface_id is the transported capnp TYPE_ID used by typed clients to check that the handle speaks the expected interface.

  3. Lazy spawning. Should the init process start everything eagerly, or should caps be backed by lazy proxies that spawn the backing service on first invocation?

  4. Cap persistence. If the system reboots, should the cap graph be reconstructable from saved state? Or is it always rebuilt from init code?

  5. Delegation depth. Can an application further delegate its HttpEndpoint cap to a subprocess? If so, the HTTP gateway needs to support fan-out. If not, how is this restriction enforced?

Proposal: Schema Registry Capability

Cap’n Proto is self-describing. When the compiler processes schema/capos.capnp it emits a CodeGeneratorRequest containing every interface id, every method name and ordinal, every parameter and result struct layout, every enum, and every doc comment. That machine-readable reflection data exists today; it just is not served at runtime. This proposal defines a SchemaRegistry capability that serves it.

Status: Proposal. No implementation. The prerequisite work – schema doc-comment authoring across schema/capos.capnp and preservation of those comments in the generated-bindings pipeline – is tracked separately and is also a prerequisite for the System Manual Phase 3. This proposal records the design and its authority model so it can be built once those prerequisites land.

Problem

Every capability interface in capOS has a precise machine-readable definition: method names, ordinals, parameter struct field names and types, result struct layouts, enums. Today that information lives only in the host-side compiler output, the checked-in generated Rust bindings, and in the heads of developers who have read the schema. A running capOS instance cannot answer:

  • “What methods does this interface expose?”
  • “What ordinal does listMethods map to?”
  • “What fields does the parameter struct for resolveMethod contain?”

This gap affects three categories of caller:

  1. Interactive shell. A user typing call @cap.method(args) in the capOS shell wants the shell to resolve the human method name to an ordinal, check argument types against the parameter struct schema, encode the capnp message, dispatch the call, and decode the result – all without requiring the user to have memorized ordinals or wire layouts.
  2. Dynamic and agent-driven callers. A process or agent that receives a capability without compile-time bindings cannot easily discover what methods are available. Today it must carry out-of-band schema knowledge or guess. A machine-readable registry eliminates that gap.
  3. Cross-language and network tooling. A host-side tool connecting to a running capOS instance via the remote-session gateway needs schema metadata to encode and decode capnp messages without shipping language-specific generated bindings for every possible capability type.

What Cap’n Proto’s Self-Description Provides

The capnp compiler’s CodeGeneratorRequest contains:

  • For interfaces: the 64-bit interface id, the interface name, and for each method: its ordinal (call slot number), method name, the type id of the parameter struct, and the type id of the result struct.
  • For structs: the struct’s 64-bit type id, name, and for each field: field name, ordinal slot, and the type of the value (a primitive, a struct type id, a list, a capability type id, etc.).
  • For enums: the enum type id, name, and for each enumerant: name and numeric value.
  • Doc comments: the raw doc-comment text attached to interfaces, methods, structs, and fields. These are preserved in the CodeGeneratorRequest when the compiler receives them; the current generated-bindings pipeline strips them. Preserving them is a tracked prerequisite.

The registry bakes this data into a boot-packaged blob at make time, exactly as the System Manual bakes its corpus. Both are read-only deliveries of build-time information; neither reflects the live state of a running object.

Relationship to the System Manual

The SchemaRegistry and the Manual capability share one substrate: the same CodeGeneratorRequest blob baked at build time. They are two delivery modes of that shared reflection data:

  • Manual (see System Manual Capability): human prose delivery. It renders the schema into man(2)-style interface pages with SYNOPSIS sections generated from method signatures and DESCRIPTION sections from doc comments. It serves text, structured for human reading.
  • SchemaRegistry: machine-readable metadata delivery. It serves structured SchemaNode values carrying interface ids, ordinals, type ids, and field layouts. It serves data, structured for programmatic consumption.

The two can share a service implementation that reads the same blob; the interface shape differs because the consumers differ.

Interface

struct MethodInfo {
    ordinal      @0 :UInt16;    # call slot number
    name         @1 :Text;      # as written in the .capnp source
    paramTypeId  @2 :UInt64;    # type id of the parameter struct
    resultTypeId @3 :UInt64;    # type id of the result struct
    docComment   @4 :Text;      # empty until doc-comment prerequisite lands
}

struct FieldInfo {
    slot         @0 :UInt16;
    name         @1 :Text;
    typeKind     @2 :TypeKind;
    structTypeId @3 :UInt64;    # set when typeKind == struct
    docComment   @4 :Text;
}

enum TypeKind {
    void @0; bool @1; int8 @2; int16 @3; int32 @4; int64 @5;
    uint8 @6; uint16 @7; uint32 @8; uint64 @9; float32 @10; float64 @11;
    text @12; data @13; list @14; enum_ @15; struct_ @16; interface_ @17;
    anyPointer @18;
}

struct SchemaNode {
    typeId      @0 :UInt64;
    displayName @1 :Text;
    union {
        interface @2 :InterfaceSchema;
        struct_   @3 :StructSchema;
        enum_     @4 :EnumSchema;
        other     @5 :Void;
    }
}

struct InterfaceSchema {
    methods     @0 :List(MethodInfo);
    docComment  @1 :Text;
}

struct StructSchema {
    fields      @0 :List(FieldInfo);
    docComment  @1 :Text;
}

struct EnumSchema {
    enumerants  @0 :List(EnumerantInfo);
    docComment  @1 :Text;
}

struct EnumerantInfo {
    value       @0 :UInt16;
    name        @1 :Text;
}

struct SearchResult {
    typeId      @0 :UInt64;
    displayName @1 :Text;
    kind        @2 :Text;        # "interface", "struct", or "enum"
    snippet     @3 :Text;        # first line of doc comment, if present
}

interface SchemaRegistry {
    # Resolve a method name on an interface to its ordinal and struct type ids.
    resolveMethod @0 (interfaceId :UInt64, name :Text)
        -> (ordinal :UInt16, paramTypeId :UInt64, resultTypeId :UInt64);

    # Fetch the full schema node for a given type id.
    lookupType    @1 (typeId :UInt64) -> (node :SchemaNode);

    # List all methods on an interface (for discovery without a known name).
    listMethods   @2 (interfaceId :UInt64) -> (methods :List(MethodInfo));

    # Keyword search across names and doc comments in the baked blob.
    search        @3 (query :Text) -> (candidates :List(SearchResult));

    # The build/commit this schema blob was produced from.
    buildInfo     @4 () -> (commit :Text, builtAt :Text);
}

The interface is additive-only; future methods append at higher ordinals, matching the convention already established by SystemInfo and Manual.

Authority Model

This is the most important section of this proposal. The registry embodies capOS Design Principle 4 – “the interface IS the permission” – from a different angle:

Discovery does not grant call authority.

Holding a SchemaRegistry capability lets the caller learn the shape of an interface: its method names, ordinals, parameter field names and types, and doc comments. It does not grant the caller permission to invoke those methods. To call Console.writeLine, the caller still needs a live Console capability in its CapSet. The registry answers “what can this type do in general?” – the live capability answers “what can I do right now, and am I permitted to do it?”

This split is not a weakening of the capability model; it is the correct expression of it. Consider the analogy: knowing that a bank offers a “transfer” operation does not give you a bank account. Learning that ProcessSpawner exposes a spawn method does not give you a ProcessSpawner.

Practical consequences:

  • SchemaRegistry is read-only and holds no authority beyond serving schema metadata from the build-time blob. It is safe to grant to any process or agent that needs dynamic method discovery.
  • The kernel remains the validation trust boundary. When a call arrives via the ring, the kernel dispatches it to the named CapObject. The registry is a client-side convenience for encoding the request correctly; the kernel validates the incoming message on dispatch and rejects malformed calls regardless of whether the caller used the registry.
  • A caller that uses the registry to build a call message is still subject to the normal ring dispatch path. The registry cannot bypass or relax kernel validation.
  • Fail-safe. If the registry is not granted, dynamic clients fall back to compile-time bindings or refuse to operate. The registry enhances ergonomics; it is not on the critical authority path.

Data Source: Build-Time, Not Live-Object Reflection

The registry does not introspect a running object. It serves the static schema baked from the CodeGeneratorRequest at build time. This has two consequences:

  • No live-object coupling. The registry knows nothing about which capabilities are currently allocated, which processes hold them, or what runtime state a live capability has. It knows only what the schema says all instances of a given interface can do.
  • Blob freshness is tied to the build. Like the System Manual blob, buildInfo @4 carries the commit and build timestamp so a caller can tell which schema version is loaded. A running instance with a stale blob reflects that build’s schema, not any live update.

Where It Lives

SchemaRegistry is a userspace service backed by a boot-packaged schema blob, consistent with the broader capOS policy of putting metadata and policy enforcement in userspace while the kernel handles dispatch and isolation. The implementation mirrors the System Manual service:

  • At make time, a host tool reads the compiler’s CodeGeneratorRequest output and produces a compact, read-only binary blob.
  • The blob is packaged in the boot image alongside the manifest, delivered like BootPackageCap entries.
  • A userspace schema-registry service reads the blob from the BootPackage, implements the SchemaRegistry interface, and is granted to processes that need it via the manifest cap grants or the AuthorityBroker bundle.

The shared blob between Manual and SchemaRegistry is a build artifact. Whether they share a single service binary or run as two services consuming the same blob is an implementation decision; the capability interface is the boundary.

Primary Use Cases

Shell call @cap.method(args) dispatch

The shell receives a human-typed method invocation. To dispatch it:

  1. Inspect the live capability to get its interface id (the interface_id surface is present today via capos-lib/src/cap_table.rs).
  2. Call SchemaRegistry.resolveMethod(interfaceId, methodName) to get the ordinal, parameter type id, and result type id.
  3. Use lookupType(paramTypeId) to get the parameter struct schema and validate or interactively prompt the user for each field.
  4. Encode the capnp message with the resolved ordinal and parameter encoding.
  5. Submit the call via the ring. On completion, use lookupType(resultTypeId) to decode the result message for display.

This eliminates the requirement for the shell to carry compile-time knowledge of every capability interface ordinal and struct layout.

Dynamic / Late-Bound Clients

A process or agent that receives a capability without compile-time bindings can call listMethods(interfaceId) to enumerate what the interface supports, then use resolveMethod for each call it intends to make. This enables generic capability explorers, cross-version bridges, and agent-driven automation that adapts to the interface rather than hardcoding ordinals.

Cross-Language and Network Tooling

A host-side tool connecting to a running capOS instance via the remote-session gateway fetches the schema blob via lookupType / listMethods calls relayed through the remote session, and uses the result to encode and decode capnp messages in any language that has a capnp parser. This decouples the tooling from language-specific generated bindings.

Schema-Driven Test Harnesses

A test harness can use the registry to enumerate all methods on an interface and generate exerciser calls with synthetic arguments, validating that the live capability handles all known methods without panicking – a form of schema-conformance fuzzing driven by the registry itself.

Sequencing and Prerequisites

Two prerequisites are shared with the System Manual Phase 3:

  1. Doc-comment authoring in schema/capos.capnp. The schema currently carries minimal doc comments. The docComment fields in the registry’s schema nodes will be empty until this authoring work lands. The registry is still useful without doc comments – method names, ordinals, and struct layouts are fully present – but the schema-as-documentation story depends on this work.
  2. Doc-comment preservation in the generated-bindings pipeline. The tools/capnp-build script currently strips doc comment text from the emitted Rust bindings. The registry’s blob builder must read the raw CodeGeneratorRequest before that stripping occurs, so this prerequisite is about pipeline ordering, not a new tool.

The registry interface and blob format can be designed and the boot-packaging infrastructure written before those prerequisites land; the docComment fields start empty and are populated once the prerequisite lands.

Relationship to Existing Proposals

  • System Manual (System Manual Capability): the human-readable twin. Both share the CodeGeneratorRequest blob source; neither is a prerequisite of the other. They can be built in either order or together.
  • SystemInfo proposal (System Info Capability): SystemInfo provides scalar system facts; SchemaRegistry provides interface metadata. No overlap.
  • Interactive command surfaces (Interactive Command Surfaces): a future typed CommandSession may use the registry to validate command arguments before dispatch.
  • Remote-session UI (Remote Session CapSet Clients): host-side tooling that relays capability calls through the remote session is a primary consumer of the registry’s cross-language tooling use case.

Open Questions

  • Blob sharing or dual instantiation? The System Manual and Schema Registry share a blob source. Whether they are implemented as one service that exposes two capability interfaces or two separate services that each read the blob at startup is an implementation choice. Two interfaces, one service is the likely outcome; this should be decided when the first implementation starts.
  • Schema node format evolution. As schema/capos.capnp evolves, the blob format must evolve with it. Whether the blob is a verbatim CodeGeneratorRequest wire encoding, a normalized subset, or a purpose-built indexed structure is a build-tool design question.
  • Search index. The search method needs a keyword index built into the blob at make time rather than a linear scan. The index strategy (inverted index over name tokens and doc comment words) should be decided when the blob builder is implemented.

Design Grounding

  • Cap’n Proto reflection model and CodeGeneratorRequest wire format: capnp crate documentation and the capnp language reference.
  • Interface id and interface_id() surface: capos-lib/src/cap_table.rs.
  • Boot-packaged blob delivery pattern: kernel/src/cap/boot_package.rs.
  • Shared substrate with the System Manual: System Manual Capability, particularly the “schemaReflection source” section.
  • Authority model grounding: docs/capability-model.md and Design Principle 4 in CLAUDE.md.

Proposal: Session Archive and Gantt Effort Pipeline

Development tasks in capOS each carry a real start and finish time. The autonomous development loop records these directly for tasks it executes; for earlier work the timing is recoverable from agent session transcripts. Collected together and attributed to branches and tasks, that timing data enables two things: a whole-history development Gantt and a dataset for predicting how long a future task will take.

Status: Proposal. The foundation is partially landed. A per-day task ledger exists in docs/tasks/done/, where each done entry carries the real branch commit SHAs and, for tasks executed by the autonomous development loop, real started and completed timestamps sourced from the run-telemetry log. A prepare-commit-msg hook stamps Plan-Item, Run-Id, and Agent-Kind trailers on commits so the commit-to-task-to-run mapping is native to git history. The session-transcript ETL, the derived dataset builder, and the duration-prediction model are future work this proposal scopes.

Goals

  • Predict how long a future task will take from historical effort patterns, using features derivable from the task’s commits and metadata.
  • Render a whole-history development Gantt over the landed branch and task ledger, attributing each interval to the task that produced it.
  • Feed that data back into planning: size estimates, milestone forecasting, and identification of subsystems or slice classes that consistently take longer than anticipated.

Timing Sources

Two sources provide per-task effort data, at different points in the project timeline:

Run-telemetry log (loop-era tasks). The autonomous development loop writes a record per task run to a local telemetry log. Each record carries: a run id, the task id, the agent kind, a session id, a started timestamp (when the agent began), and a completed timestamp (when the agent finished and the branch was merged or abandoned). These timestamps are exact wall-clock values, not estimates. They are written to the local run-telemetry log (ephemeral, not committed) and promoted to the task’s done/ file as started: and completed: front-matter fields when the task closes. That promotion is the boundary between local operational state and the durable public record.

Agent session transcripts (pre-loop history). For tasks worked before the autonomous development loop existed, timing must be reconstructed from agent session transcripts. Two transcript formats exist in the project history:

  • A Claude session JSONL format: one JSON object per turn, with a UTC timestamp, a role (user or assistant), message content, and tool-call records.
  • A Codex session-rollout format: a structured log of model turns with file edits, shell commands, and timestamps.

Both formats carry enough information to recover: when a session started, which files were touched, which repository and branch were active, and approximately when the session ended (last turn timestamp). Cross-tool interval merging (a task worked in two different tools during the same calendar day) is a rare edge case; in practice each task belongs primarily to one tool and one continuous session window.

Pipeline

The pipeline has four stages:

1. Collect

Gather transcript files from wherever they reside. The Claude JSONL transcripts are stored under a well-known local path per session. The Codex rollouts are scattered across machines and backup directories and must be enumerated by a manifest or directory scan. Neither format is committed to the repository; they are local/backup artifacts. The collect stage produces a manifest of transcript files keyed by session id and format type.

2. Normalize

Parse each transcript into a common event schema:

{
  "session_id": "...",
  "format": "claude-jsonl" | "codex-rollout",
  "started_at": "<UTC ISO timestamp>",
  "ended_at":   "<UTC ISO timestamp>",
  "repo":       "<repo name>",
  "branch":     "<branch name or null>",
  "files_touched": ["<relative path>", ...],
  "tool_calls": <count>,
  "role_turns": <count>
}

The started_at and ended_at values are the first and last turn timestamps in the session. For the duration estimate, idle time between turns (long pauses between user and assistant turns, or overnight gaps within a session file) is clipped: only contiguous active intervals – where consecutive turn timestamps are within a configurable idle threshold – count toward the active duration. The result is an idle-clipped active duration attributed to the session.

Per-task effort is the sum of idle-clipped active durations across all sessions whose branch matches the task’s task branch. For tasks with a single session this is trivial; for tasks where a session covered multiple branches, the attribution is prorated by file overlap or left to manual annotation.

3. Recap and Index

After normalization, a recap step produces a per-task effort index: task id, branch, real started/completed timestamps (from the run-telemetry promotions for loop-era tasks, from the session-normalized estimate for pre-loop tasks), idle- clipped active duration, agent kind, and the commit SHAs that belong to the task. This index is written to a structured file (JSON Lines, one record per task) under target/ during the build and is the input to the dataset builder and the Gantt renderer. It is a derived artifact; the sources of truth are git history, the docs/tasks/done/ ledger, and the transcript files.

4. Store in Object Storage

The normalized transcript archive and the per-task effort index are stored in object storage (GCS or S3) under a versioned prefix. This serves two purposes: it makes the archive portable across machines, and it provides a stable input for the prediction dataset builder that does not depend on the local transcript directory layout. The object storage upload is a manual or CI-triggered step, not part of every build.

Commit Provenance

The prepare-commit-msg hook (landed at tools/githooks/prepare-commit-msg) stamps three trailers on every commit:

  • Plan-Item: <task-id> – the task this commit belongs to.
  • Run-Id: <run-id> – the run-telemetry log entry for this work session.
  • Agent-Kind: <kind> – which implementation agent produced the commit.

These trailers make the commit-to-task and commit-to-run mappings native to git history and queryable by git log --grep. A Gantt renderer can walk git log and group commits by Plan-Item, attributing intervals to tasks without any external database. The run-telemetry log fills in wall-clock start/end; git provides the commit sequence and churn metrics.

Prediction Dataset

The prediction dataset is a derived artifact built by a script from git history and the per-task effort index. It is not stored in task front matter; the task front matter carries only the real timestamps and commit SHAs, not derived features.

Features (X): per-task git-derived metrics over the task’s commits: list:

  • Commit count.
  • Churn: insertions + deletions.
  • Files changed (total and unique).
  • Subsystems touched: a subsystem label per changed file, derived from the directory prefix (e.g. kernel/, capos-lib/, docs/, schema/).
  • Categorical fields: milestone/track, slice class (behavior, read-side-proof, harness-hardening, docs-status), hazard families checked.

Label (y): real effort in minutes – the idle-clipped active duration from the session archive, or the run-telemetry-derived interval for loop-era tasks.

Granularity: one record per branch merge (feature/work-unit granularity). This matches the size at which future tasks are dispatched and avoids the noise of per-commit or per-day fragments. Tasks that span multiple branches (a prerequisite branch plus a follow-up) are modeled as separate records linked by a dependency field; the prediction target is per-branch, and milestone forecasting aggregates across the dependency graph.

Model: a regression over the feature set above, using a simple baseline (linear regression or gradient-boosted trees) before investing in anything more complex. The first useful output is a p50/p90 interval per slice class and subsystem combination, not a precise point estimate.

Gantt Rendering

The Gantt is rendered from the per-task effort index: each task becomes a bar spanning its started_at to completed_at (or started_at plus active duration for pre-loop tasks where only the duration is reliable). Tasks are grouped by milestone and slice class, and bars are colored by subsystem. The output is a static SVG or a simple HTML/SVG file – not an interactive dashboard. The rendering script reads the per-task effort index from target/ and writes target/gantt.svg or target/gantt.html. It is not part of the default build.

Sequencing and Prerequisites

The following are already landed:

  • docs/tasks/done/ ledger with real started: and completed: fields for loop-era tasks.
  • prepare-commit-msg hook stamping Plan-Item, Run-Id, and Agent-Kind trailers.
  • Run-telemetry log entries for loop-era tasks (local, ephemeral).

The following are future work:

  1. Transcript collector and normalizer. Write parsers for the Claude JSONL and Codex rollout formats, the idle-clipping logic, and the per-task effort index builder. This is a standalone Python or Rust host tool; no kernel changes.
  2. Backfill pass. Run the normalizer over the existing transcript archive to populate pre-loop effort estimates for tasks in docs/tasks/done/. Where transcripts are unavailable, leave the duration field as null with a source: unavailable annotation; do not invent estimates.
  3. Object storage upload. Configure the archive upload to GCS or S3 and set up the versioned prefix scheme.
  4. Dataset builder. Write the script that joins the per-task effort index with git metrics to produce the prediction dataset.
  5. Baseline model. Train and evaluate the baseline duration-prediction model on the dataset. Publish the p50/p90 per-slice-class table as a static docs/ page once it has enough data to be meaningful.
  6. Gantt renderer. Write the script and add a make gantt target.

Steps 1-2 can proceed independently of 3-6 and are the highest-value items: the backfill populates the effort ground truth that all downstream uses depend on.

Authority and Privacy

  • The transcript archive and the run-telemetry log are not committed to the repository. They are local/private artifacts.
  • The per-task effort index written to target/ is a derived artifact and is gitignored; it may contain session ids and durations but no message content.
  • The docs/tasks/done/ entries carry only started:, completed:, and commits: fields sourced from the telemetry and git; they do not carry message content, file system paths, or host-identifying information.
  • The prediction dataset contains only git-derived metrics and duration labels; no transcript content.

Relationship to Existing Proposals

  • Task State and Agent Telemetry (task-state-and-agent-telemetry-proposal.md): the task file schema and run-telemetry structure that this proposal reads from. The two proposals are complementary: that proposal defines the task lifecycle and local operational state; this proposal defines what to do with the timing data once it exists.
  • agentic development experiment (capOS Agentic Development Experiment): the autonomous development loop whose run-telemetry log is the primary timing source for loop-era tasks.

Open Questions

  • Idle threshold. What inter-turn gap counts as idle and is excluded from active duration? A 30-minute threshold is a reasonable starting point; the right value depends on the observed gap distribution in the transcript archive.
  • Multi-branch tasks. Some tasks span a prerequisite branch plus a follow-up fix branch. The current model treats each merge as a separate record; a cleaner approach may be a parent-task: field in the task front matter so the effort can be rolled up.
  • Backfill completeness. Transcript files from early project history may be incomplete or unavailable. The normalizer must handle missing sessions gracefully; the dataset must mark incomplete records rather than imputing durations.
  • Model selection. Whether a simple linear baseline is sufficient or whether a richer model (gradient-boosted trees, conformal prediction intervals) is warranted depends on the dataset size and variance. Defer this decision until the backfill pass is complete and the distribution is known.

Design Grounding

  • Task ledger schema and run-telemetry promotion: docs/tasks/README.md and task-state-and-agent-telemetry-proposal.md.
  • Commit-provenance trailers: tools/githooks/prepare-commit-msg.
  • Slice-class vocabulary and hazard families: CLAUDE.md (Autonomous Slice Hygiene section) and REVIEW.md.

Proposal: Session-Bound Invocation Context

Current design authority now lives in Session Context, with endpoint transport details in IPC and Endpoints. This proposal is retained as the archival decision record for why capOS replaced caller-selected endpoint identity and the service-object migration with session-bound invocation context.

Replace caller-selected endpoint identity and the Service Object Identity Migration with a simpler invariant: every process runs in exactly one live session context. The kernel attaches that context to invocations and enforces privacy/transfer invariants, but does not reveal subject details to endpoint servers unless the call explicitly requests disclosure and policy allows the requested fields through a broker/service disclosure scope.

Capabilities decide what a process may call. The calling process’s session context says who invokes, subject to privacy rules. Services receive only the minimum routing/privacy metadata required by the invoked capability; request fields remain ordinary data and must not select authority or caller identity.

Problem

The prior service-object direction fixed a real bug: clients must not be able to choose a service-visible numeric badge during spawn or IPC delegation. The design then added service-minted object capabilities and a subject/proof open protocol so services could bind identity without trusting request payloads.

That is too much machinery for the intended capOS process model. Normal workload processes should not be bags of unrelated user sessions. They should have one immutable session context, assigned at spawn, and all invocations from that process should be attributable to that context. Delegated-subject on-behalf-of behavior is a separate design and is intentionally out of this first implementation path.

The target should therefore remove the caller-selected badge without replacing it with a second service-object identity system. For a service such as chat, holding ChatRoot already means the process may attempt to join chat under its own session. More granular authority can come from narrower capabilities granted by AuthorityBroker, not from client-selected receiver selectors or local proof tokens on every open call.

Decision

capOS adopts these invariants:

  • Each process has exactly one immutable SessionContext.
  • The session context is assigned at spawn and shared by all threads in that process.
  • System services run under explicit service/system sessions.
  • Network gateways create or select a session for each admitted connection and spawn per-session workers or shells; they do not run multiple user sessions as ambient subject context inside one ordinary workload process.
  • Endpoint CALL delivery includes a privacy-preserving caller-session reference and optional freshness result, not full subject metadata by default.
  • A held capability is the authority to invoke service root methods such as ChatRoot.join; the caller session supplies the invocation subject context. Services learn principal, profile, or display metadata only through explicit disclosure.
  • Request fields such as user, role, participant, principal, or session are data. Services may validate them against the caller session, but they do not identify the caller or authorize by themselves.
  • Subject disclosure is opt-in and policy-bounded. A call must explicitly ask for disclosure, and the requested fields must be allowed by a service-specific disclosure capability/scope. Without both signals, the server gets only an opaque session-local handle suitable for same-session state and audit correlation within that service.
  • Cross-session capability transfer is supported when the transferred cap’s transfer scope permits it. The transferred cap carries invoke authority; the receiver’s session remains the invocation subject. Session-local caps require an explicit broker or service regrant operation.

The existing synthetic service-object routing proof remains useful as evidence that request bytes cannot spoof endpoint receiver metadata, but the service object identity model is no longer the active design direction.

Normative Invariants

  • Every normal workload process has exactly one immutable SessionContext.
  • SessionContext is installed only by trusted spawn, session-manager, or broker paths; request payloads, shell strings, manifest data, endpoint receiver metadata, and copied UserSession caps cannot mutate or replace it.
  • Capability possession remains the authority to invoke an interface. A live session without the target capability cannot call the target service.
  • A normal endpoint call from a dead, revoked, or stale workload session fails closed, except for explicitly designated recovery, logout, or renewal caps.
  • Session liveness is a revocable lease state, not only a timestamp embedded in immutable process metadata. A session may be live, logged out, revoked, expired, or recovery-only.
  • Renewal must not relabel an existing process to a different session subject and must not blindly revive all previously issued grants. Renewal either extends the existing session liveness record under policy or returns fresh broker grants with distinguishable grant/session epochs.
  • Endpoint default delivery never includes global principal, profile, account, role, tenant, external-claim, auth-factor, display-name, or source-network fields.
  • Subject-detail disclosure requires both an explicit method/call disclosure request and a matching service-scoped disclosure scope.
  • Disclosure is field-granular and service-scoped; an opaque session reference from one service is non-portable and non-authority-bearing in another.
  • Cross-session raw cap transfer is rejected unless the cap’s transfer scope permits it.
  • After an allowed cross-session transfer, the receiver process session is the invocation context; raw transfer never implies act-on-behalf-of source session semantics.
  • service_regrant_only caps cannot cross sessions through raw copy, move, IPC, or spawn grants. A service or broker regrant path must mint the target session authority explicitly.
  • Legacy receiver metadata remains internal transport state. It must not be user-facing syntax, manifest policy, subject disclosure, or service identity.

Authority And Context

Capability possession answers one question:

May this process invoke this capability/interface at all?

It does not answer:

Which live session is this invocation attributable to?
Is that session still fresh?
Which resource/profile bucket should pay for server-side state?
What subject facts may this service learn?
May this capability be transferred into another session?

Those are invocation-context and disclosure questions. The split is deliberate. ChatRoot can mean “the holder may ask chat to join”; it does not by itself tell chat whether the call is from an operator, a guest, an anonymous Telnet session, or an expired session, nor whether chat may see a global principal id.

A service decision has three layers:

capability authority
+ invocation subject context
+ service-local policy/state

Only the first layer is authority to invoke. The session layer supplies information about who invokes, freshness, resource/accounting labels, and what may be disclosed to the service. Service-local policy may accept or reject the operation based on that information, but the session context is not a second capability.

Examples:

  • ChatRoot means the holder may ask chat to join, subject to chat policy and whatever session facts the call explicitly requests and broker/service policy makes available to chat.
  • ChatModerator means the holder may call moderator methods, again under the caller’s live session.
  • TerminalSession means the holder may read/write that terminal endpoint, but audit and policy still see the process session.

Session-bound invocation context exists so services can make those second-order decisions without trusting payload fields and without forcing the kernel to reveal private subject metadata to every endpoint server. The kernel can say “this call came from a live session and here is an opaque service-scoped reference”; the service or broker can decide whether that is enough, whether a guest-specific facet is required, or whether the user must explicitly disclose bounded subject facts.

The kernel enforces capability possession, process session assignment, and disclosure invariants. It may report freshness/liveness as invocation context. Session expiry should bound behavior through capability lifecycle, broker refusal, or service policy, not by treating the session context itself as a second authority. The kernel still does not interpret chat rooms, handles, moderator state, adventure players, account roles, OIDC claims, or tenant groups.

Privacy And Disclosure

Session-bound invocation context must not become ambient subject leakage. A service should not receive global principal identifiers, account names, display names, profile names, external issuer keys, group claims, auth factors, source network, or tenant metadata merely because a process called an endpoint.

The default endpoint metadata is privacy-preserving:

caller_session_ref = opaque, service-scoped, non-portable reference
session_live = true/false or epoch/freshness result

That is enough for a service to keep per-session state, reject stale sessions, and correlate its own audit events without learning a broader identity.

Current proof implementation:

  • scoped_ref: low 64-bit ABI field of the opaque reference.
  • scoped_ref_hi: high 64-bit ABI field of the opaque reference.
  • epoch: u64.
  • derivation: HMAC-SHA256 with an entropy-backed boot key, a non-reused endpoint service-scope id, and the kernel session id.

The ABI layout is preserved, but the old unkeyed low-half value is not. Both scoped_ref and scoped_ref_hi are halves of the keyed opaque reference. epoch is a separate domain-separated keyed value so service-local freshness/audit correlation rotates with the same boot key and endpoint scope without being folded into the opaque reference itself.

Current caller_session_ref derivation rules:

width:
  128 bits minimum for the opaque reference, separate from freshness epoch.

derivation:
  keyed opaque value over boot secret, service scope, and kernel session id.

scope:
  a non-reused endpoint service-scope id plus the boot-scoped key. Endpoint
  object replacement or boot-key replacement intentionally rotates the
  reference. Stable service-audit identity across upgrades remains future work.

reuse:
  logout/login or session recreation gets a new kernel session id and therefore
  a new service-scoped reference.

stale epoch:
  stale references may remain recognizable to the same service for bounded
  audit/denial correlation, but they must not become live again after expiry.

service move/upgrade:
  endpoint replacement currently breaks correlation. Retaining correlation
  across service replacement requires a future stable service-audit scope.

privacy:
  global principal, account, profile, display name, auth source, and tenant
  metadata are not derivable from the opaque reference without broker/audit
  disclosure authority.

Richer disclosure requires both an explicit act and an allowed policy scope:

  • the client calls a method whose contract requests disclosure, such as ChatRoot.join(discloseProfile = true, handle = "alice"), or transfers a SessionDisclosure capability as part of that call;
  • AuthorityBroker or service policy grants a root/facet with a matching disclosure scope, such as “chat may see display name and profile class”;
  • an administrator-configured system service may expose methods whose contract explicitly requests audit disclosure, but those methods still need bounded service policy for the fields they receive.

Disclosure should be minimized and service-scoped. A chat service may need a display name, guest/operator class, and per-service audit pseudonym. It does not need raw OIDC claims, credential identifiers, account-store records, or global principal ids unless a later policy explicitly grants that.

Session Context

A SessionContext is kernel-carried metadata minted through trusted session creation paths and installed by ProcessSpawner:

SessionContext {
    session_id,
    principal_id,
    principal_kind,
    auth_strength,
    policy_profile_id,
    resource_profile_id,
    created_at_ms,
    expires_at_ms,
    epoch,
}

The exact ABI can be smaller in the first implementation. The required properties are immutability for the process lifetime, a stable kernel-visible session id for enforcement, a service-scoped opaque reference for default endpoint delivery, and enough freshness metadata for brokers/services to fail closed or revoke/withhold capabilities when a session expires or is revoked. These conceptual fields may exist in trusted session storage. They are not endpoint-delivered default metadata. Endpoint delivery gets only a service-scoped opaque session reference and liveness/freshness result unless an explicit disclosure request and matching disclosure scope allow named fields.

The session context is not a replacement for capabilities. A process with a valid operator session but no ChatRoot cannot join chat. A process with ChatRoot but an expired session should lose or fail to refresh the capability authority that was issued for that session.

Session Lifecycle, Logout, And Renewal

The completed milestone proves fail-closed stale-session behavior for current shell and endpoint authority. Follow-up lifecycle slices now provide a kernel-backed mutable liveness record for SessionManager-minted sessions, remote gateway logout/close propagation, and endpoint RETURN cleanup for already-admitted calls after caller logout/session death. Fixed wall-clock expiry is still not a usable long-running interactive policy by itself: production session lifecycle also needs revocation, renewal/recovery, live proxy cleanup, audit reason separation, and a dedicated result-cap move-source rollback proof. Clean local owner-shell exit now calls the held UserSession.logout() before process exit; richer shell replacement and renewal UX remains future work.

The intended liveness model is:

SessionContext {
    session_id,
    principal_id,
    principal_kind,
    auth_strength,
    policy_profile_id,
    resource_profile_id,
    created_at_ms,
    liveness_cell_id,
}

SessionLivenessCell {
    session_id,
    session_epoch,
    state: live | logged_out | revoked | expired | recovery_only,
    not_before_ms,
    not_after_ms,
    policy_epoch,
    resource_profile_epoch,
    audit_record_id,
}

SessionContext remains immutable for the process lifetime. The liveness cell is trusted session-manager state that can be logged out today and later revoked, expired, or renewed. This preserves the one-session-per-process invariant while allowing usable session renewal and explicit logout. A process cannot install a different session id into itself; if policy requires a new subject, the broker launches a replacement process or shell with a new SessionContext.

This splits lifetime checks into three composable layers:

session liveness:
    Is this process's invocation subject still live?

grant lease:
    Is this broker-issued bundle or individual grant still valid?

object/facet epoch:
    Has the target live object/facet generation been revoked or replaced?

The kernel or trusted wrapper caps should check session liveness before normal endpoint enqueue, before local non-endpoint shell-bundle operations, and before installing fresh result caps into a caller. Broker-issued caps may additionally bind to grant leases. Service objects and endpoint-backed facets keep using object epochs or service-specific revocation for target invalidation. The current endpoint RETURN path rechecks caller liveness before copying result bytes, application-exception payloads, result-cap records, or returned caps into the caller; stale returns cancel the in-flight call and notify the caller with invoke-failed when a completion can be posted.

Renewal is a narrow recovery operation, not generic authority resurrection:

  • pre-expiry renewal may extend the liveness cell when account state, policy epoch, resource profile, auth freshness, and maximum lifetime permit it;
  • post-expiry calls are limited to explicit logout, renewal, recovery, and bounded self-diagnostic methods;
  • renewal returns fresh grant leases or wrapper caps when existing grants need a new policy decision;
  • old ordinary grants do not become fresh merely because the session renewed;
  • explicit revocation beats renewal except for a separately named recovery policy;
  • password-authenticated local shells should default to explicit logout, terminal/connection close, process-tree exit, or administrator revocation rather than an unavoidable short wall-clock TTL. Idle lock, step-up, or renewal prompts are policy options, not kernel authority rules.

Logout and clean owner-shell exit close the liveness cell for sessions owned by that shell or gateway through UserSession.logout(). Closing the shell process still releases local cap table edges through process-exit cleanup, but session logout is the operation that makes the session no longer live for retained session-bound grants, children, and future broker decisions.

Kernel Contract

The kernel should enforce generic mechanics only:

  • A process has one session context pointer or compact session descriptor.
  • Spawning a child requires selecting the child’s session context. The default is to inherit the parent’s session; creating a different session is broker or session-manager capability authority.
  • Session expiry is represented as freshness metadata and capability lifecycle: normal workload endpoint calls from dead, revoked, or stale sessions fail closed except for explicit recovery, logout, or renewal caps. The current implementation rejects stale normal endpoint invocations before transfer preparation or enqueue, rejects fresh shell-bundle minting for stale sessions, and expires retained broker-issued non-endpoint shell bundle caps at their bound session boundary. RestrictedLauncher rejects spawn/list calls after the session it was minted for expires, and broker-issued SystemInfo results are session-bound wrappers. The current endpoint RETURN path also rejects already-admitted returns after caller logout/session death before installing result bytes, application-exception payloads, result-cap records, returned caps, or move-source commits into the stale caller. The session context itself is not the authority being invoked. Remaining lifecycle work should extend the mutable liveness cell from logout to administrator revocation, recovery-only state, and pre-expiry renewal without relabeling a running process.
  • Endpoint delivery includes privacy-preserving caller session metadata alongside the existing method, params, transfer descriptors, and result target. It must not include subject details unless the SQE/method contract explicitly requests them and a granted disclosure scope permits them. The current implementation uses a CALL SQE disclosure mask intersected with cap-held disclosure scope for field-granular delivery; unsupported fields are rejected or narrowed, and global principal ids and display names remain absent from default endpoint metadata.
  • Capability transfer checks session scope. Same-session transfer preserves the held cap. Cross-session transfer is rejected unless the cap is explicitly cross-session-shareable or the transfer is the result of a broker/service delegation method.
  • Legacy receiver metadata remains transport state only. It must not be exposed as user-facing identity syntax, manifest policy, service capability, or a workaround for subject disclosure.

The kernel should not validate external tokens, parse account stores, evaluate roles, or choose application objects.

Broker And Service Contract

AuthorityBroker and related session services decide which capabilities a session receives:

SessionManager.login/guest/anonymous -> UserSession metadata/control cap
trusted broker/session-manager spawn path -> child SessionContext
AuthorityBroker.shellBundle(session) -> launcher fixed to that SessionContext,
                                   ChatRoot, SystemInfo, ...

For basic local service access, no additional subject/proof token is required. The process session context supplies caller information and a default service-scoped session reference, and the held capability supplies access to the service. Human-readable or policy-rich subject details are separate disclosure, not automatic endpoint metadata.

UserSession remains useful as an informational/control capability and broker input. It is not itself the ambient invocation subject, and copying it into a process cannot install a second process session. A trusted broker or session-manager path may use a verified UserSession to spawn a child with a matching immutable SessionContext; ordinary cap transfer only transfers that capability object.

External assertions still stop at the admission boundary. OIDC, passkey, certificate, cloud workload, or SSH-authenticated claims are validated by admission/session services, normalized into a local or pseudonymous session, and then disappear from ordinary application calls. Chat should not parse OIDC claims, and ChatRoot.join should not require a bearer proof object merely to learn who the caller is.

Chat Flow

The target chat flow is:

login/setup/guest
    -> UserSession metadata/control cap

trusted broker/session-manager spawn path
    -> child process with SessionContext(operator or guest)

AuthorityBroker.shellBundle(session)
    -> ChatRoot if the profile may use chat

spawn chat-client with inherited session and ChatRoot

chat-client:
    ChatRoot.join(channel = "general", handle = "alice")

The kernel delivers the endpoint call with privacy-preserving caller session metadata:

target = ChatRoot
method = join
caller_session_ref = chat-scoped opaque session reference
session_live = true
payload = { channel = "general", handle = "alice" }

chat-service checks:

  • the caller holds ChatRoot;
  • the caller session is live;
  • the requested channel and handle are syntactically valid request data.

Then it stores service-local state keyed by the caller session:

ParticipantRecord {
    caller_session_ref,
    service_assigned_member_label,
    optional_disclosed_display_name,
    joined_channels,
    quota_bucket,
    audit_context,
}

If chat needs to distinguish operator from guest, use explicit disclosure with a matching disclosure scope. If chat only needs narrower behavior, the broker may grant GuestChatRoot with behavior that encodes the policy without revealing subject fields. The service should not receive the global principal id by default.

Later calls can use the same root/facet capability:

Chat.send(channel = "general", text = "hi")
Chat.poll(max_events = 32)
Chat.who(channel = "general")

If the service permits multiple handles for one session, it may return a server-issued participant_id as data. That id must be scoped to the caller session and validated on every use:

Chat.send(participant_id = 7, channel = "general", text = "hi")

participant_id = 7 is not transferable authority. A different session cannot use it unless chat or the broker performs an explicit share/delegation operation.

Moderator behavior is a narrower capability, not a generic role bit in a payload:

AuthorityBroker.shellBundle(operator_session) -> ChatModerator
ChatModerator.kick(participant_id, channel)

The call still carries the operator session for audit and policy.

Transfer Rules

Same-session delegation is ordinary capability transfer:

operator shell -> child helper in the same session
    transfers ChatRoot or ChatModerator

The child acts under the same session context, so no subject ambiguity exists.

Cross-session transfer is where the distinction matters most:

capability transfer carries authority to invoke;
the receiver process session supplies who invokes.

If session A transfers a cap to session B and the transfer is allowed, later calls are made by session B, not by session A. The service sees the transferred capability as the invoked authority and session B as the invocation subject context. It must not infer that session B is impersonating session A merely because the cap originally came from A.

This is acceptable for caps whose semantics are deliberately shareable, such as a read-only document, a public chat invite, or a scoped terminal endpoint intended for handoff. It is wrong for caps that encode session-local standing, such as “my chat participant”, “my account settings”, or “my active adventure player”, unless the service explicitly defines what sharing means.

Therefore caps need an explicit transfer scope:

  • same_session: may move/copy only to processes with the same session context;
  • cross_session_shareable: may be transferred to another session and then invoked as the receiver’s session;
  • service_regrant_only: cannot be raw-transferred across sessions; the holder must ask the service or broker to issue a new cap for the target session.

Session-local services that want to share state across sessions should use an explicit regrant/share path:

Chat.share(participant_id, target_session_or_invitation)
AuthorityBroker.delegate(source_session, target_session, requested_cap)

The service or broker records the policy decision and mints or grants the appropriate capability for the target session. Raw transfer of a session-scoped cap across sessions must fail closed unless the cap has an explicit cross-session-shareable scope.

This keeps privacy and accountability aligned. The transferred cap is not a portable identity token for the source session. If the receiver invokes it, the receiver’s session context is used for audit/disclosure by default. If the service needs to preserve source attribution, it should encode that as service-local state during an explicit share/regrant operation, not rely on the kernel to attach source-session subject data to future receiver calls.

The useful matrix is:

cap transfer only:
    receiver gets authority to invoke;
    receiver invokes as its own process session.

service regrant:
    service or broker issues a new target-session capability;
    future calls still invoke as the target process session.

What Happens To Service Object Routing

The synthetic service-object routing proof added in commit a4655f0 should not drive the next design step. Its useful artifacts are narrower:

  • delegated-client relabeling is contained;
  • receiver-cookie spoofing through request bytes is tested;
  • close/revoke/stale-cookie paths have coverage;
  • internal receiver metadata can be generation-checked.

Those mechanics can remain as low-level transport tests. They are not the application authority model. The completed migration stopped before subject/proof root opening and shared-service conversion to service object capabilities.

Migration Plan

  1. Record this proposal as the selected Stage 6 direction and mark Service Object Identity Migration as superseded.
  2. Add the kernel/process invariant: every process has exactly one immutable session context, including explicit service/system sessions.
  3. Thread caller session metadata through endpoint CALL delivery.
  4. Define session freshness propagation and the cap lifecycle rule needed to close the open review finding: expired sessions must not continue to receive or refresh interactive capability authority.
  5. Define cap transfer scopes for same_session, cross_session_shareable, and service_regrant_only.
  6. Replace chat’s legacy receiver-selected member identity with session-keyed participant state and broker-granted ChatRoot/ChatModerator facets. The first chat migration is implemented for ordinary Chat membership: member records are keyed by the endpoint caller-session key, visible member labels are service-assigned, and join handles remain non-authority request data.
  7. Apply the same pattern to adventure and terminal/stdio bridges. Aurelian ordinary player state is keyed by live endpoint caller-session metadata instead of receiver badges. Terminal output requires live caller-session dispatch, and shell-serviced stdio bridge waits bind to opaque live caller-session metadata while rejecting mismatched callers. Focused adventure NPC/chat authority is broker- or manifest-issued rather than caller-chosen.
  8. Retire user-facing badge/receiver selector syntax. Keep receiver metadata only as internal endpoint transport state or hostile-test fixture.

Non-Goals

  • Reintroducing POSIX uid/gid authorization.
  • Allowing clients to choose identity through request bytes.
  • Making external tokens ordinary application-service credentials.
  • Delegated-subject or act-on-behalf-of semantics; those belong in a separate proposal and should not block this first implementation path.
  • Preserving Service Object Identity Migration as the active design.
  • Building network-transparent object references in this slice. Future remote-capability transport is grounded separately in Spritely, OCapN, and CapTP and must preserve this proposal’s local rule that sessions are broker/kernel-attached, not chosen by request bytes.

Open Questions

  • Whether all caps are same_session by default, or whether every cap entry should carry an explicit same_session, service_regrant_only, or cross_session_shareable scope.
  • How much session metadata should be copied into endpoint delivery headers versus looked up by session_id in a kernel/session table.
  • Whether multi-connection gateways must always spawn per-session workers, or may multiplex unauthenticated transport while delegating all session-bearing work to child processes.

Proposal: Storage, Naming, and Persistence

What replaces the filesystem in a capability OS where Cap’n Proto is the universal wire format.

The Problem with Filesystems

In Unix, the filesystem is the universal namespace. Everything is a path: /dev/sda, /etc/config, /proc/self/fd/3, /run/dbus/system_bus_socket. Paths are ambient authority — any process can open /etc/passwd if the permission bits allow. The filesystem conflates naming, access control, persistence, and device abstraction into one mechanism.

capOS has capabilities instead of paths. Access control is structural (you can only use what you were granted), not advisory (permission bits checked at open time). This means:

  • No global namespace needed — each process sees only its granted caps
  • No path-based access control — the cap IS the access
  • No distinction between “file”, “device”, “socket” — everything is a typed capability interface

A traditional VFS would reintroduce ambient authority through the back door. Instead, capOS needs a storage and naming model native to capabilities and Cap’n Proto.

Core Insight: Cap’n Proto Everywhere

Cap’n Proto is already used in capOS for:

  • Interface definitions.capnp schemas define capability contracts
  • IPC messages — capability invocations are capnp messages
  • Serialization — capnp wire format crosses process boundaries

If we extend this to storage, then:

  • Stored objects are capnp messages
  • Configuration is capnp structs
  • Binary images are capnp-wrapped blobs
  • The boot manifest is a capnp message describing the initial capability graph

No format conversion anywhere. The same tools (schema compiler, serializer, validator) work for IPC, storage, config, and network transfer.

Architecture

Three Layers

Target architecture after the manifest executor and process-spawner work:

Boot Image (read-only, baked into ISO)
  │
  │  capnp-encoded manifest + binaries
  │
  v
Kernel (creates initial caps from manifest)
  │
  │  grants caps to init
  │
  v
Init (builds live capability graph)
  │
  ├──> Filesystem services (FAT, ext4 — wrap BlockDevice as Directory/File)
  │
  ├──> Store service (capability-native content-addressed storage)
  │      backed by: virtio-blk, RAM, or network
  │
  └──> All other services (receive Directory, Store, or Namespace caps)

Layer 1: Boot Image

The boot image (ISO/disk) contains a capnp-encoded system manifest loaded as a Limine module alongside the kernel. The manifest describes:

struct SystemManifest {
    # Manifest schema version, validated before other fields
    schemaVersion @0 :UInt32;
    # Binaries available at boot, keyed by name
    binaries @1 :List(NamedBlob);
    # Init's config blob: first-process metadata plus service graph
    initConfig @2 :CueValue;
    # Kernel boot parameters
    kernelParams @3 :SystemConfig;
}

struct NamedBlob {
    name @0 :Text;
    data @1 :Data;
}

struct CueValue {
    union {
        null @0 :Void;
        boolean @1 :Bool;
        intValue @2 :Int64;
        uintValue @3 :UInt64;
        text @4 :Text;
        bytes @5 :Data;
        list @6 :List(CueValue);
        fields @7 :List(CueField);
    }
}

struct CueField {
    name @0 :Text;
    value @1 :CueValue;
}

Capability source identity is already structured in the bootstrap manifest, so source selection does not depend on parsing authority strings:

{
    name:                "client"
    expectedInterfaceId: 0xacf0c15a7b2e0041
    source: service: {
        service: "endpoint-server"
        export:  "client"
    }
}

Kernel and service source objects inside initConfig select the authority to grant. The expectedInterfaceId field carries the generated Cap’n Proto interface TYPE_ID and only checks that the granted object speaks the expected schema. It cannot replace source identity: many different objects may expose the same interface while representing different authority.

The build system (Makefile) generates this manifest from a human-authored description and packs it into the ISO as manifest.bin. Current code embeds every SystemManifest.binaries entry into that manifest as NamedBlob data, including the release-built init and smoke-demo ELFs. The kernel now boots only initConfig.init; focused init-executor manifests expose the manifest to the separate init binary as a read-only BootPackage capability, while default shell-led manifests boot capos-shell directly without a BootPackage executor. Remaining cleanup is to narrow the long-term boot package shape after the single-init split.

Using a CueValue tree instead of AnyPointer keeps the manifest directly decodable in no_std userspace without depending on Cap’n Proto reflection.

Transitional Schema Note

ServiceEntry, CapSource::Service, and ServiceEntry.exports are no longer kernel schema fields. ProcessSpawner, copy/move cap transfer, focused init-owned generic manifest execution, the default standalone-init service graph, focused shell-led login smokes, and the 15.4 initConfig schema split are implemented. The current boot manifest shape is:

struct SystemManifest {
    # Manifest schema version, validated before other fields
    schemaVersion @0 :UInt32;
    # Binaries available at boot, keyed by name
    binaries @1 :List(NamedBlob);
    # Init's config blob (replaces the service graph)
    initConfig @2 :CueValue;
    # Kernel boot parameters (serial policy, shell MOTD, feature flags)
    kernelParams @3 :SystemConfig;
}

ServiceEntry / CapRef disappeared from the schema and became plain CUE fields inside initConfig.services. Init reads them at runtime and calls ProcessSpawner directly. validate_manifest_graph, validate_bootstrap_cap_sources, and the remaining transitional service-graph schema are no longer kernel bootstrap checks. They remain in capos-config for mkmanifest and the focused init executor while that executor still accepts the transitional service graph. Kernel bootstrap already uses a first-service cap-table builder rather than the old multi-service resolver. See docs/proposals/service-architecture-proposal.md — “Legacy Manifest Fields After Stage 6” for the deprecation plan.

During the current transition, initConfig.init is still per-manifest launch metadata: it selects the single boot process binary and the kernel-sourced caps for that process. initConfig.services, cross-service cap sources, exports, and restart policy are init-owned configuration for focused executor manifests. Focused harnesses that boot a demo as init keep using that first-process cap bundle until those smokes are migrated behind a fixed generic init.

Layer 2: Kernel Bootstrap

Target design for the kernel’s boot role:

  1. Parse the system manifest (read-only capnp message from Limine module).
  2. Hash the embedded binaries for optional measured-boot attestation.
  3. Create kernel-provided capabilities: Console, Timer, DeviceManager, ProcessSpawner, FrameAllocator, VirtualMemory (per-process), and a read-only BootPackage cap exposing SystemManifest.binaries and initConfig.
  4. Spawn init — exactly one userspace process — with that cap bundle.

Current boot has reached the single-init split and the initConfig schema split. system.cue puts the standalone init binary in initConfig.init for the default service-graph process; init reads BootPackage and starts the shell, remote-session CapSet gateway, and resident services from initConfig.services. Focused shell-led manifests such as system-smoke.cue still put capos-shell in initConfig.init for narrow login proofs. Focused init-executor manifests such as system-spawn.cue also put the separate init binary in initConfig.init; that binary reads BootPackage and spawns the focused demo graph from initConfig.services through ProcessSpawner. The unused kernel resolver has been retired. The remaining cleanup is replacing per-manifest init bundles with a fixed generic-init bootstrap ABI.

Layer 3: Init and the Live Capability Graph

Target init reads initConfig from the BootPackage cap and executes it:

fn main(caps: CapSet) {
    let spawner = caps.get::<ProcessSpawner>("spawner");
    let boot = caps.get::<BootPackage>("boot");
    let config = boot.init_config()?;  // CueValue

    // Walk service entries from the config and spawn in dependency order
    for entry in config.field("services")?.iter()? {
        let binary = boot.binary(entry.field("binary")?.as_str()?)?;
        let granted = resolve_caps(entry.field("caps")?, &running_services, &caps);
        let handle = spawner.spawn(binary, granted, entry.field("restart")?.into())?;
        running_services.insert(entry.field("name")?.as_str()?.into(), handle);
    }

    supervisor_loop(&running_services);
}

In this target model, init is a generic manifest executor rather than a hardcoded service graph. The system topology is defined in the boot package’s initConfig, not in init’s source code. Changing what services run means rebuilding the boot image with a different config blob, not recompiling init. Manifest graph resolution stops being a kernel concern.

The current transition uses initConfig.services as the service graph; init reads the BootPackage manifest, validates a metadata-only ManifestBootstrapPlan, resolves kernel and service cap sources, records exported caps, spawns children in manifest order, and waits for their ProcessHandles.

Two Storage Models

capOS supports two complementary storage models, both exposed as typed capabilities:

Filesystem Capabilities (Directory, File)

For accessing traditional block-based filesystems (FAT, ext4, ISO9660) and for POSIX compatibility. A filesystem service wraps a BlockDevice and exports Directory/File capabilities.

BlockDevice (raw sectors)
    │
    └──> Filesystem service (FAT, ext4, ...)
              │
              ├──> Directory caps (namespace over files)
              └──> File caps (read/write byte streams)

This model maps naturally to USB flash drives, NVMe partitions, and network-mounted filesystems. The open() and sub() operations return new capabilities via IPC cap transfer (see “IPC and Capability Transfer” below).

Capability-Native Store (Store, Namespace)

For capOS-native data: configuration, service state, content-addressed object storage. A store service wraps a BlockDevice and exports Store/Namespace capabilities.

BlockDevice (raw sectors)
    │
    └──> Store service
              │
              ├──> Store cap (content-addressed put/get/list inventory)
              └──> Namespace caps (mutable name→hash mappings)

Content-addressing provides automatic deduplication, verifiable integrity, and immutable references. Store.list returns the live inventory of content hashes in that Store, so holders that need crash/reboot recovery can rediscover stored content without a separate mutable root pointer. Namespaces add mutable bindings on top when callers need stable names rather than inventory scans.

Bridging the Two Models

The models are composable. An adapter service can bridge between them:

  • FsStore adapter: exposes a Directory tree as a content-addressed Store (hash each file’s contents, directory listings become capnp-encoded objects)
  • StoreFS adapter: exposes Store/Namespace as a Directory tree (each name maps to a File whose contents are the stored object)
  • Import/export: a utility service reads files from a Directory and stores them in a Store, or materializes Store objects as files in a Directory

In both cases the adapter is a userspace service holding caps to both subsystems. No kernel mechanism needed — just capability composition.

File I/O Interfaces

Directory, File, Store, and Namespace caps may be scoped to a user session, guest profile, anonymous request, or service identity, but the cap remains the authority. POSIX ownership metadata is compatibility data inside these services, not a system-wide authorization channel. See User Identity and Policy.

BlockDevice

Raw sector access, served by device drivers (virtio-blk, NVMe, USB mass storage). The driver receives hardware capabilities (MMIO, IRQ, FrameAllocator for DMA) and exports a BlockDevice cap.

interface BlockDevice {
    readBlocks  @0 (startLba :UInt64, count :UInt32) -> (data :Data);
    writeBlocks @1 (startLba :UInt64, count :UInt32, data :Data) -> ();
    info        @2 () -> (blockSize :UInt32, blockCount :UInt64, readOnly :Bool);
    flush       @3 () -> ();
}

For bulk transfers, readBlocks/writeBlocks accept a SharedBuffer capability instead of inline Data (see “Shared Memory for Bulk Data” below). The inline-Data variants work for metadata reads and small operations; the SharedBuffer variants avoid copies for large I/O.

File

Byte-stream access to a single file. Served by filesystem services. Created dynamically when a client calls Directory.open() — the filesystem service creates a File CapObject for the opened file and transfers it to the caller via IPC cap transfer.

interface File {
    read     @0 (offset :UInt64, length :UInt32) -> (data :Data);
    write    @1 (offset :UInt64, data :Data) -> (written :UInt32);
    stat     @2 () -> (size :UInt64, created :UInt64, modified :UInt64);
    truncate @3 (length :UInt64) -> ();
    sync     @4 () -> ();
    close    @5 () -> ();
}

close releases the server-side state for this file (open cluster chain cache, dirty buffers). The kernel-side CapTable entry is removed by the system transport via CAP_OP_RELEASE when the local holder releases it; capos-rt owned handles queue local releases on final drop and expose explicit release flushing for ordinary userspace. CapabilityManager is management-only (list(), later grant()); it does not expose a drop() method because ordinary handle lifetime belongs to the transport, not to an application call on the same table that dispatches it.

Attenuation: a read-only File wraps the original and rejects write, truncate, sync calls. An append-only File rejects write at offsets other than the current size.

Directory

Namespace over files on a filesystem. Served by filesystem services. open() and sub() return new capabilities via IPC cap transfer.

interface Directory {
    open    @0 (name :Text, flags :UInt32) -> (file :File);
    list    @1 () -> (entries :List(DirEntry));
    mkdir   @2 (name :Text) -> (dir :Directory);
    remove  @3 (name :Text) -> ();
    sub     @4 (name :Text) -> (dir :Directory);
    create  @5 (name :Text) -> ();
    rename  @6 (from :Text, to :Text) -> ();
}

struct DirEntry {
    name  @0 :Text;
    size  @1 :UInt64;
    isDir @2 :Bool;
}

sub() returns a Directory scoped to a subdirectory — the analog of chroot. The caller cannot traverse upward or see the parent directory. open() with create flags creates a new file if it doesn’t exist.

The flags field in open() is a bitmask: CREATE = 1, TRUNCATE = 2, APPEND = 4. No READ/WRITE flags — those are determined by the Directory cap’s attenuation (a read-only Directory returns read-only Files).

Writable Directory Mutations and the Single-Writer Policy

create @5 makes a new empty file and rename @6 renames an entry within the same parent. Both have additive ordinals so the read-only Directory implementations stay wire-compatible — they simply reject the mutating methods (mkdir/remove/sub/create/rename) fail-closed, the way a read-only File rejects write. Unlike open with CREATE, create fails closed if the name already exists; rename fails closed if the source is absent or the destination already exists, and does not support cross-directory moves.

The first writable filesystem service adopts a fail-closed single-writer policy: a writable filesystem tree admits one writer at a time. The first granted cap to perform a mutation claims the writer slot; a mutation through any other concurrently granted cap fails closed with a typed Failed exception ("writable filesystem rejects a second concurrent writer (single-writer policy)") rather than racing. There is no lease/release lifecycle — the first writer keeps the slot — and list/sub reads are allowed for any holder. This deliberately closes the milestone’s concurrent-writer-policy decision without expanding scope to advisory locks, lock leases, or multi-writer coordination (see Open Question 6). The implementation (kernel/src/cap/writable_fs.rs, proof make run-storage-writable) is now disk-backed: it mounts a CAPOSWF1 sub-volume (a flat node-record array with parent pointers plus a bump-allocated data region) over the kernel-owned virtio-blk driver, keeps the RAM tree as the working copy, and write-through-commits every directory/file mutation in the order data sector → node-record sector → superblock (the ordering commit point), mirroring the disk-backed Store. The persistent Store CAPOSST1 sub-volume co-locates on the same disk image (at LBA 0; the filesystem superblock sits at a fixed higher LBA), so filesystem mutations and store object writes/deletes survive a reboot together — make run-storage-writable boots QEMU twice against one combined image and phase 2 verifies every surviving name, size, content, directory entry, and store object plus the deleted object’s absence.

Unclean-shutdown recovery is proven by make run-storage-writable-recovery. A slot becomes live on the next mount only once the superblock’s bumped node_count is observed, so a forced poweroff in the window between a node record’s durable write and that commit leaves an orphan slot the next mount ignores: the interrupted allocation is atomically absent, never a torn or half-live entry. The proof builds the kernel with the proof-only storage_writable_recovery feature, which arms an induced forced poweroff in exactly that window (recovery_crash_after_record); pass 1 commits durable mutations and a Store survivor and then triggers the window (the harness kill -9s QEMU after the kernel marker), and pass 2 re-mounts and verifies recovery to a consistent tree with the committed state intact, the interrupted allocation absent, no torn record, and a usable post-recovery write. The proof is bounded to that single record-vs-commit window under host-page-cache durability (the virtio driver negotiates no VIRTIO_BLK_F_FLUSH, and a kill -9 preserves the host page cache); it proves the superblock-commit ordering invariant, not a general media crash-consistency guarantee against host power loss or a lost write-back cache. The co-located CAPOSST1 Store now has bounded tombstone reclamation through make run-storage-persist; this does not add a new media power-loss guarantee or reclaim writable-file extents.

Writable File content paths layer onto the same tree. open with the CREATE/TRUNCATE/APPEND flags (or a write through the returned File) claims the same filesystem-wide writer slot, so file writes obey the single writer policy alongside directory mutations; a plain (flags == 0) open and the read/stat methods are reads allowed for any holder. write @1 overwrites or extends at the supplied offset, zero-filling any gap; a handle opened APPEND lands every write at end-of-file regardless of the offset argument. truncate @3 shrinks (discards the tail) or extends (zero-fills) the file, and close @5 releases only that handle — the file survives in the directory until Directory.remove, which marks the file node so any outstanding File cap fails closed. File content is bounded by MAX_FILE_BYTES (64 KiB) and persists to a bump-allocated disk extent on each mutation; a rewrite that outgrows the current extent allocates a fresh one and leaks the old (file-extent compaction deferred). Because each write/truncate already wrote through the block device (the virtio driver negotiates no VIRTIO_BLK_F_FLUSH, so there is no separate media barrier to issue), sync @4 succeeds as an honest write-side no-op (a read-only File still rejects it). Crash consistency rests on the superblock-commit ordering rather than a media barrier: an interrupted allocation is atomically absent on remount (proven by make run-storage-writable-recovery, above). A post-write media-durability flush against a write-back cache (for host power loss, not the guest-side forced poweroff that proof exercises) remains future hardening, not claimed here.

Syscall Trace: Reading a File from a FAT USB Drive

Four userspace processes: App, FAT service, USB mass storage, xHCI driver.

With promise pipelining (one submission):

Cap’n Proto promise pipelining lets the App chain dependent calls without waiting for intermediate results. The App submits a single pipelined request: “open this file, then read from the result”:

# Single pipelined submission (SQEs with PIPELINE flag):
#   call 0: dir.open("report.pdf")         → answer_id=200, user_data=100
#   call 1: answer 200 result_cap[0].read(offset=0, len=4096)

cap_submit([
    {cap=2, method=OPEN, answer=200, user_data=100, params={"report.pdf", flags=0}},
    {cap=PIPELINE(answer=200, result_cap=0), method=READ, user_data=101, params={offset:0, length:4096}},
])
  → kernel routes call 0 to FAT service via Endpoint
  → FAT service reads directory entry from BlockDevice
  → FAT service creates FileCapObject, replies with File cap as result cap 0
  → kernel sees pipelined call 1 targeting the File cap from call 0
  → kernel dispatches call 1 to the same FAT service (or direct-invokes
    the new File CapObject if it's a local endpoint)
  → FAT service maps offset → cluster chain → LBA
  → FAT service submits CALL SQE: {cap=blk_cap, method=READ_BLOCKS, params={lba, count}}
      → USB mass storage → xHCI → hardware → back up
  ← completion: {data: [4096 bytes]}, File cap installed as cap_id=5

One app-to-kernel transition. The kernel resolves the pipeline dependency internally through the sideband CapTransferResult record at index 0; it does not inspect the Cap’n Proto result payload. The App never needs a userspace round trip for the intermediate File cap, though the cap is installed and usable afterward.

This is a core Cap’n Proto feature: by expressing “call method on the not-yet-resolved result of another call,” the client avoids a round-trip for each link in the chain. For deeper chains (e.g., dir.sub("a").sub("b") .open("file").read(0, 4096)), the savings compound — one submission instead of four sequential syscalls.

The capability-ring version should follow the Cap’n Proto/CapTP prior-art shape captured in Cloudflare, Cap’n Proto, Workers RPC, and Cap’n Web and Spritely, OCapN, and CapTP: pipelined targets live in answer/result-cap namespaces, not in caller-selected global ids; result-cap metadata stays outside the Cap’n Proto payload; broken answers propagate failure to dependent calls; and answer slots, queued dependent calls, queued bytes, and remote references are charged to bounded resource ledgers. This is design grounding, not an OCapN or Cap’n Web wire-compatibility target.

Without pipelining (two sequential ring submissions):

Without promise pipelining, the App submits two separate CALL SQEs via the ring, blocking on each completion before submitting the next:

# 1. Open file (App holds Directory cap, cap_id=2)
# App writes CALL SQE: {cap=2, method=OPEN, params={"report.pdf", flags=0}}
cap_enter(min_complete=1, timeout=MAX)
  → kernel routes CALL to FAT service via Endpoint
  → FAT service reads directory entry from BlockDevice
  → FAT service creates FileCapObject for this file
  → FAT service posts RETURN SQE with [FileCapObject] in xfer_caps
  → kernel installs File cap in App's table → cap_id=5
  ← App reads CQE: result={file: cap_index=0}, new_caps=[5]

# 2. Read 4096 bytes from offset 0
# App writes CALL SQE: {cap=5, method=READ, params={offset:0, length:4096}}
cap_enter(min_complete=1, timeout=MAX)
  → kernel routes CALL to FAT service
  → FAT service maps offset → cluster chain → LBA
  → FAT service submits CALL SQE: {cap=blk_cap, method=READ_BLOCKS, params={lba, count}}
      → kernel routes to USB mass storage
      → mass storage submits CALL SQE: {cap=usb_cap, method=BULK_TRANSFER, params={scsi_cmd}}
          → kernel routes to xHCI driver
          → xHCI programs TRBs, waits for interrupt
          ← returns raw sector data
      ← returns sector data
  ← FAT service extracts file bytes, posts RETURN SQE with {data: [4096 bytes]}

This works but costs two round-trips where pipelining needs one. The synchronous path is useful for simple cases and bootstrapping; pipelining is the intended steady-state model.

In both cases, the intermediate IPC hops (FAT → USB mass storage → xHCI) are invisible to the App.

Capability-Native Store

The Store Capability

Once the system is running, persistent storage is provided by a userspace service — the store. It’s backed by a block device (virtio-blk), and exposes a content-addressed object store where objects are capnp messages.

interface Store {
    # Store a capnp message, returns its content hash
    put @0 (data :Data) -> (hash :Data);
    # Retrieve by hash
    get @1 (hash :Data) -> (data :Data);
    # Check existence
    has @2 (hash :Data) -> (exists :Bool);
    # Delete (if caller has authority — see note below)
    delete @3 (hash :Data) -> ();
}

Note on delete: In a content-addressed store, deleting a hash can break references from other namespaces pointing to the same object. delete on the base Store interface is dangerously broad — a StoreAdmin interface (separate from Store) may be more appropriate, with delete restricted to a GC service that can verify no live references exist. Open Question #3 (GC) should be resolved before implementing delete. The attenuation table below lists Store (full) as “Read, write, delete any object” — in practice, most callers should receive a Store attenuated to put/get/has only.

Content-addressed means:

  • Deduplication is automatic (same content = same hash)
  • Integrity is verifiable (hash the data, compare)
  • References between objects are just hashes embedded in capnp messages
  • No mutable paths — “updating a file” means storing a new version and updating the reference

Mutable References: Namespaces

A Namespace capability provides mutable name-to-hash mappings on top of the immutable store:

interface Namespace {
    # Resolve a name to a store hash
    resolve @0 (name :Text) -> (hash :Data);
    # Bind a name to a hash (if caller has write authority)
    bind @1 (name :Text, hash :Data) -> ();
    # List names (if caller has list authority)
    list @2 () -> (names :List(Text));
    # Get a sub-namespace (attenuated — restricted to a prefix)
    sub @3 (prefix :Text) -> (ns :Namespace);
}

A Namespace cap scoped to "config/" can only see and modify names under that prefix. This is the analog of a chroot — but structural, not a kernel hack. The sub() method returns a new Namespace cap via IPC cap transfer.

Future: union composition. The research survey recommends extending Namespace with Plan 9-inspired union semantics — a union(other, mode) method that merges two namespaces with before/after/replace ordering. This adds composability without a global mount table. See research survey §6.

IPC and Capability Transfer

Several storage operations return new capabilities: Directory.open() returns a File, Directory.sub() returns a Directory, Namespace.sub() returns a Namespace. This requires dynamic capability management — the kernel must install new capabilities in a process’s CapTable at runtime as part of IPC.

The Capability Ring

All kernel-userspace interaction goes through a shared-memory ring pair (submission queue + completion queue), inspired by io_uring. SQE opcodes map to capnp-rpc Level 1 message types. The ring is allocated per-process at spawn time and mapped into the process’s address space.

Syscall surface: 2 syscalls. New capabilities, operations, and transfer mechanisms are expressed as new SQE opcodes instead of expanding the syscall ABI.

#SyscallPurpose
1exit(code)Terminate current thread; process exits after its last live thread
2cap_enter(min_complete, timeout_ns)Process pending SQEs, then wait until enough CQEs exist or the timeout expires

Writing SQEs is syscall-free, but ordinary capability CALLs make progress through cap_enter. Timer polling handles non-CALL ring work and only CALL targets that explicitly opt into interrupt-context dispatch. cap_enter flushes pending SQEs and can block the process until min_complete completions are available or a finite timeout expires. An indefinite wait uses timeout_ns = u64::MAX; timeout_ns = 0 keeps the call non-blocking. A future SQPOLL-style worker can reintroduce a zero-syscall CALL-completion hot path without running arbitrary capability methods from timer interrupt context.

The ring structs and synchronous CALL dispatch are implemented and working. See capos-config/src/ring.rs for the shared ring structs and kernel/src/cap/ring.rs for kernel-side processing.

Ring Layout

One 4 KiB page per process, mapped into both kernel (HHDM) and user space:

┌─────────────────────────┐  offset 0
│ Ring Header              │  SQ/CQ head, tail, mask, flags
├─────────────────────────┤  offset 128
│ SQE Array (16 × 64B)    │  submission queue entries
├─────────────────────────┤  offset 1152
│ CQE Array (32 × 32B)    │  completion queue entries
└─────────────────────────┘

SQ: userspace owns tail (producer), kernel owns head (consumer)
CQ: kernel owns tail (producer), userspace owns head (consumer)

SQE Opcodes

Five opcodes handle everything — client calls, server dispatch, capability transfer, pipelining, and lifecycle:

Opcodecapnp-rpc analogPurpose
CALLCallInvoke method on a capability
RETURNReturnRespond to incoming call (server side)
RECV(implicit)Wait for incoming calls on Endpoint
RELEASEReleaseDrop a capability reference
FINISHFinishRelease pipeline answer state
TIMEOUTPost a CQE after N nanoseconds (io_uring-inspired)

TIMEOUT is an alternative to the timeout_ns argument on cap_enter: it works with zero-syscall polling (kernel fires the CQE on a timer tick) and composes with LINK/DRAIN for deadline-based chains.

SQE flags: PIPELINE (cap_id is a promise reference), LINK (chain to next SQE), MULTISHOT (keep generating CQEs), DRAIN (barrier).

Promise Pipelining

A CALL SQE can target either a concrete CapId or a PromisedAnswer reference (via the PIPELINE flag + pipeline_dep/pipeline_field fields). pipeline_dep names the earlier answer and pipeline_field is a zero-based CapTransferResult record index in that answer’s sideband result-cap list, not a Cap’n Proto schema field. The kernel resolves the dependency chain internally:

SQE[0]: CALL dir.open("report.pdf")        → answer_id=200, user_data=100
SQE[1]: CALL [PIPELINE: dep=200, result_cap=0].read(0, 4096)  → user_data=101

One cap_enter call. The kernel dispatches SQE[0], resolves result cap record 0 from the completion sideband, and dispatches SQE[1] against it without returning to userspace between steps or parsing the result payload.

The Endpoint Kernel Object

For cross-process IPC, an Endpoint connects client-side proxy caps to a server’s receive loop:

Client's CapTable                                   Server's CapTable
┌─────────────────┐                                 ┌──────────────────┐
│ cap 2: Proxy     │                                 │ cap 0: Endpoint   │
│   → endpoint ────────── Endpoint ◄──── RECV SQE ──│                  │
│   badge: 42      │     (kernel obj)                │                  │
└─────────────────┘                                 └──────────────────┘

The server posts a RECV SQE (with MULTISHOT flag). Incoming calls appear as CQEs with badge, interface_id, method_id, and a kernel-assigned call_id. The server responds by posting a RETURN SQE referencing the call_id.

interface_id is the transported schema ID for the interface being invoked. It should equal the generated TYPE_ID for that capnp interface. cap_id is the authority-bearing table handle; interface_id is only the protocol tag. The target capability entry owns one public interface; method_id selects a method inside that interface, while cap_id identifies the object being invoked. If the same backing state needs another interface, the transport should mint a separate capability entry for that interface rather than letting one handle accept multiple unrelated interface_id values.

Direct-Switch IPC

When a client’s CALL targets a cap served by a blocked server (waiting on RECV), the kernel marks that server as the direct IPC handoff target so the next context-switch path runs the callee before unrelated round-robin work. The current implementation still uses the ordinary saved-context restore path; small-message register transfer remains a future fastpath after measurement. See research survey §2.

Capability Transfer via Ring

Capabilities travel as sideband arrays (CapTransferDescriptor) alongside capnp message bytes:

  • CALL params: params buffer contains the capnp message bytes followed by xfer_cap_count transfer descriptors packed at addr + len, which must be aligned to CAP_TRANSFER_DESCRIPTOR_ALIGNMENT.
  • RETURN results: server result buffers carry the capnp reply bytes and may carry return transfer descriptors on addr + len; the kernel inserts destination capability records in the caller’s result buffer after the normal result bytes. Count is reported in CQE cap_count and those records are written as CapTransferResult { cap_id, interface_id } values at result_addr + result. The requested result buffer (result_len) must be large enough for both normal reply bytes and all appended cap_count records.

xfer_cap_count > 0 with malformed descriptor metadata (bad mode bits, reserved bits, _reserved0, or misalignment) fails closed as CAP_ERR_INVALID_TRANSFER_DESCRIPTOR. Kernels that have not yet enabled transfer handling should return CAP_ERR_TRANSFER_NOT_SUPPORTED for transfer-bearing SQEs.

The capnp wire format’s WirePointerKind::Other encodes capability indices in messages. The sideband arrays map these indices to actual CapIds. The kernel does not parse capnp messages — it transfers a list of caps alongside the opaque message bytes.

Dynamic Capability Management

Every open(), sub(), or resolve() creates and transfers a new capability at runtime. The kernel’s CapTable insert() and remove() are the primitives. Capabilities flow through RETURN SQE sideband arrays (and through the manifest at boot). No separate cap_grant mechanism needed — authority flow follows the ring’s IPC graph.

The CapTable generation counter handles stale references: when a File cap is closed (slot freed, generation bumps), any cached CapId returns StaleGeneration instead of accidentally hitting a new occupant.

Shared Memory for Bulk Data

Copying file data through capnp Data fields works for metadata and small reads, but is impractical for anything above a few KB. A 1 MB read through a capability CALL copies data four times: device → driver heap → capnp message → kernel buffer → client buffer.

SharedBuffer Capability

SharedBuffer is the service-facing name this proposal uses for bulk-transfer buffers. The implemented kernel/user substrate is MemoryObject: a capability backed by physical pages that can be mapped into multiple address spaces simultaneously. Zero copies between processes.

interface MemoryObject {
    # Size and page count of the backing object.
    info @0 () -> (pageCount :UInt32, sizeBytes :UInt64);
    # Map a page-aligned object range into the caller's address space.
    map @1 (hint :UInt64, offset :UInt64, size :UInt64, prot :UInt32) -> (addr :UInt64);
    # Unmap a caller-local borrowed mapping backed by this object.
    unmap @2 (addr :UInt64, size :UInt64) -> ();
    # Update caller-local page permissions for a borrowed mapping.
    protect @3 (addr :UInt64, size :UInt64, prot :UInt32) -> ();
}

The kernel creates MemoryObjects through the existing FrameAllocator capability. Held MemoryObject caps charge the holder’s frame-grant quota; mapped address-space pages are tracked as borrowed pages and keep the same backing alive until unmapped or process teardown. A later SharedBuffer alias or allocator may wrap this ABI for storage/network interfaces, but current code should use MemoryObject directly.

File I/O with SharedBuffer

File and BlockDevice interfaces support both inline-Data and SharedBuffer modes:

# Small read (< ~4 KB): inline in capnp message
file.read(offset=0, length=256) → {data: [256 bytes]}

# Large read: caller provides SharedBuffer, server fills it
let buf = frame_alloc.allocContiguous(256);  # 1 MB MemoryObject / SharedBuffer
file.readBuf(offset=0, buf, length=1048576) → {bytesRead: 1048576}
# Data is now in buf's mapped pages — no copy through kernel

Extended File interface with SharedBuffer support:

interface File {
    read      @0 (offset :UInt64, length :UInt32) -> (data :Data);
    write     @1 (offset :UInt64, data :Data) -> (written :UInt32);
    readBuf   @2 (offset :UInt64, buffer :SharedBuffer, length :UInt32) -> (bytesRead :UInt32);
    writeBuf  @3 (offset :UInt64, buffer :SharedBuffer, length :UInt32) -> (written :UInt32);
    stat      @4 () -> (size :UInt64, created :UInt64, modified :UInt64);
    truncate  @5 (length :UInt64) -> ();
    sync      @6 () -> ();
    close     @7 () -> ();
}

The readBuf/writeBuf methods accept a SharedBuffer cap, currently a MemoryObject cap transferred via IPC. The server maps the buffer, performs DMA or memory copies into it, then returns. The caller reads directly from the mapped pages.

For BlockDevice, the same pattern applies — the driver maps the SharedBuffer, programs DMA descriptors pointing to its physical pages, and the device writes directly into the shared memory.

When to Use Each Mode

ScenarioMechanismWhy
Reading a 64-byte config valueFile.read() inline DataCopy overhead negligible
Reading a 10 MB binaryFile.readBuf() SharedBufferAvoids 4× copy overhead
FAT directory entry (32 bytes)BlockDevice.readBlocks() inlineSmall metadata read
Streaming video framesFile.readBuf() + ring of SharedBuffersContinuous zero-copy
Network packet buffersSharedBuffer ring between NIC driver and net stackDMA-capable pages

Attenuation

Storage services mint restricted capabilities using wrapper CapObjects:

CapabilityAuthority
Directory (full)Open, list, mkdir, remove, sub
Directory (read-only)Open (returns read-only Files), list, sub only
File (full)Read, write, truncate, sync
File (read-only)Read and stat only
File (append-only)Read, stat, write at end only
Store (full)Read, write, delete any object
Store (read-only)Get and has only
Namespace (full)Resolve, bind, list under prefix
Namespace (read-only)Resolve and list only
Blob (single object)Read one specific hash
SharedBuffer (read-only)Map as read-only (page table: R, no W)

An application that only needs to read its config gets a read-only Directory scoped to its config path. It can’t write, can’t see other apps’ directories, can’t access the raw BlockDevice.

Naming Without Paths

Traditional OS: process opens /var/lib/myapp/data.db — a global path.

capOS: process receives a Directory or Namespace cap at spawn time, opens "data.db" within it. The process has no idea where on disk this lives. It can’t traverse upward. There is no global root.

# Traditional: global path namespace
/
├── etc/
│   └── myapp/
│       └── config.toml
├── var/
│   └── lib/
│       └── myapp/
│           └── data.db
└── sbin/
    └── myapp

# capOS: per-process capability set (no global namespace)
Process "myapp" sees:
  "config" → Directory(read-only, scoped to myapp's config files)
  "data"   → Directory(read-write, scoped to myapp's data files)
  "state"  → Namespace(read-write, scoped to myapp's store objects)
  "log"    → Console cap
  "api"    → HttpEndpoint cap

The process doesn’t know or care about the backing storage layout. It just uses the capabilities it was granted.

Configuration

Build-Time Config (Boot Manifest)

The system manifest is authored at build time. The human-writable source could be any format — TOML, CUE, or even a Makefile target that generates the capnp binary. What matters is that it compiles to a SystemManifest capnp message baked into the ISO.

Example source (TOML, compiled to capnp by a build tool):

[services.virtio-net]
binary = "virtio-net"
restart = "always"
caps = [
    { name = "device_mmio", source = { kernel = "device_mmio" } },
    { name = "interrupt", source = { kernel = "interrupt" } },
    { name = "log", source = { kernel = "console" } },
]
exports = ["nic"]

[services.net-stack]
binary = "net-stack"
restart = "always"
caps = [
    { name = "nic", source = { service = { service = "virtio-net", export = "nic" } } },
    { name = "timer", source = { kernel = "timer" } },
    { name = "log", source = { kernel = "console" } },
]
exports = ["net"]

[services.fat-fs]
binary = "fat-fs"
restart = "always"
caps = [
    { name = "blk", source = { service = { service = "usb-storage", export = "block-device" } } },
    { name = "log", source = { kernel = "console" } },
]
exports = ["root-dir"]

[services.my-app]
binary = "my-app"
restart = "on-failure"
caps = [
    { name = "api", source = { service = { service = "http-service", export = "api" } } },
    { name = "docs", source = { service = { service = "fat-fs", export = "root-dir" } } },
    { name = "data", source = { service = { service = "store", export = "namespace" } } },
    { name = "log", source = { kernel = "console" } },
]

A build tool validates this against the capnp schemas (does virtio-net actually export "nic"? does http-service support endpoint() minting?) and produces the binary manifest.

Runtime Config (via Store)

Once the store service is running, configuration can be stored there and updated without rebuilding the ISO. The store is just another capability — a config-management service could watch for changes and signal services to reload.

Connection to Network Transparency

If capabilities are the only abstraction, and capnp is the only wire format, then the transport is irrelevant:

  • Local IPC: capnp message copied between address spaces by kernel
  • Local store: capnp message written to block device
  • Remote IPC: capnp message sent over TCP to another machine
  • Remote store: capnp message fetched from a remote store service

A capability reference doesn’t encode where the backing service lives. The kernel (or a proxy) handles routing. This means:

  • A Directory cap could be backed by local FAT or a remote 9P server
  • A Namespace cap could be backed by local storage or a remote store
  • A Fetch cap could route through a local HTTP service or a remote proxy
  • A ProcessSpawner cap could spawn locally or on a remote machine

The system manifest could describe services that run on different machines, and the capability graph spans the network. This is the “network transparency” item in the roadmap — it falls out naturally from the model.

Persistence of the Capability Graph

The live capability graph (which process holds which caps) is ephemeral — it exists in kernel memory and is lost on reboot. The system manifest describes the intended graph, and init rebuilds it on each boot.

For true persistence (resume after reboot without re-initializing):

  1. Each service serializes its state to the store before shutdown
  2. On next boot, the manifest includes “restore from store hash X” hints
  3. Services read their saved state from the store and resume

This is application-level persistence, not kernel-level. The kernel doesn’t snapshot the capability graph — services are responsible for their own state. This avoids the complexity of EROS-style transparent persistence while still allowing stateful services.

Managed Cloud Backing

The local Store/Namespace interfaces define capOS persistence semantics. A cloud backend must be an adapter behind those interfaces, not a new ambient authority path. Services such as the adventure profile, expedition, and ledger services should serialize bounded Cap’n Proto records to a store capability; the caller should not know whether that store is backed by RAM, local disk, or a managed cloud service.

For cloud-first application data, use a narrow bridge service:

capOS service -> Store/Namespace or app-specific SaveStore cap -> Cloud bridge
              -> provider APIs

The bridge owns provider credentials and exposes only typed save/load/append operations. Ordinary clients never receive provider credentials, bucket names, database document paths, or broad write authority.

Recommended GCP mapping for game/profile style state:

  • Firestore Native mode for small mutable indexes and profile summaries that need transactional compare-and-set behavior.
  • Cloud Storage for larger immutable snapshots, evidence blobs, exports, and content-addressed objects. Object versioning and lifecycle policy should bound accidental overwrite recovery and storage growth.
  • Cloud Run for a small HTTPS or capnp-over-HTTP bridge endpoint when capOS cannot yet link provider SDKs directly.
  • Secret Manager for bridge-side service credentials and rotation; secrets do not enter ordinary capOS game clients.

Provider-specific records must still carry capOS-level schema version, content hash or release id, profile/tenant id, monotonic version, size limit, and migration policy. Writes that race on the same mutable profile or checkpoint must use an explicit version precondition and fail closed when stale. Append-only ledgers should append new records with previous-record hashes rather than rewriting history. Local QEMU tests should use a fake cloud bridge that enforces the same stale-write, append-only, wrong-profile, and size-bound rules before any real provider integration is accepted.

User-Owned Browser Transport

Some user data should be portable without giving the capOS service operator a database role over it. For private player backup/sync, a browser can act as the transport to user-owned storage:

capOS save service -> encrypted save capsule -> browser
browser OAuth/Firebase session -> Google Drive appDataFolder or Firebase user doc

This is not the same as the managed cloud bridge above. In the browser-transport model, the user grants Drive/Firebase access to the web app, the browser writes opaque encrypted capsules, and capOS never receives the provider tokens. The encryption key follows the storage domain: local capOS storage uses local capOS-host key material, while GCP-backed game-world state uses Cloud KMS envelope encryption: a per-world or per-shard KMS KEK wraps service-owned DEKs. Google Drive’s appDataFolder is a good fit for app-private backup files because it is hidden from ordinary Drive views and can use the narrow drive.appdata scope. Firebase/Firestore can also carry per-user encrypted capsule documents and provide offline cache/sync behavior, but the backend cannot validate encrypted game semantics beyond metadata and access rules.

Treat user-owned blobs as backup material, not authority:

  • The service validates signatures, profile id, content hash, schema version, monotonic version, previous hash, and size bounds before import.
  • Append-only ledgers, reward witness records, market receipts, and multiplayer outcomes remain service-owned or cloud-bridge-owned authoritative records.
  • A user may delete, duplicate, or roll back private blobs; restore code must handle that as an expected input, not as trusted history.
  • Game-world key capabilities, DEKs, and KMS decrypt/unwrap grants should not be exposed to the browser. For GCP-backed worlds, DEK unwrap and plaintext use are KMS/IAM-backed authority granted to the relevant game-world service. For local capOS storage, local key backup/recovery is a separate local-host policy.

For GCP-backed game-world state, provision one Cloud KMS key ring and symmetric CryptoKey KEK per world instance or shard. This follows the CloudKmsKeySource envelope model from the cryptography/key-management and volume-encryption proposals: Cloud KMS wraps or unwraps DEKs, and the game-world service uses the unwrapped DEK internally as service authority, modeled as a SymmetricKey capability. Grant Cloud KMS roles at the CryptoKey level where possible: roles/cloudkms.cryptoKeyEncrypter for encrypt-only writers that wrap new DEKs, roles/cloudkms.cryptoKeyDecrypter for restore or migration paths that unwrap existing DEKs, and roles/cloudkms.cryptoKeyEncrypterDecrypter only for the narrow game-world service that genuinely needs both operations. Do not model browser OAuth identities, Drive/Firebase handles, or capOS clients as holders of DEKs or KMS decrypt/unwrap grants, and do not rely on per-key-version IAM for this design.

Key rotation and world retirement are service operations, not browser-vault features. Rotation creates new Cloud KMS KEK versions for future DEK wrapping but does not re-encrypt existing capsules, rewrite wrapped DEK blobs, or disable/destroy old versions. Managed re-encryption or rewrapping must unwrap the old DEK while its KEK version remains usable, decrypt and validate the capsule inside the game-world service, then write a new capsule with a new DEK or a DEK rewrapped by the current primary KEK version. Old KEK versions should only be disabled or destroyed after inventory proves no accepted wrapped DEK depends on them. Retiring a world removes IAM decrypt authority first; disabling key versions can make protected capsules inaccessible, while destruction is delayed by the scheduled destruction period and irreversible once complete, so audit retention and recovery must be settled before destruction.

Phases

Phase 1: Boot Manifest (parallel with Stage 4)

  • Define SystemManifest schema in schema/
  • Build tool (tools/mkmanifest) that compiles system.cue into a capnp-encoded manifest and packs it into the ISO as a Limine module
  • Kernel parses the manifest and now creates only the initConfig.init process
  • Focused init-executor manifests pass the manifest to the separate init binary as bytes through the read-only BootPackage capability
  • The separate init binary is a generic manifest executor for the default system.cue path and focused init-executor smokes; focused shell-led smokes still use capos-shell as initConfig.init
  • No persistent storage yet — boot image is the only data source

Phase 2: File I/O Interfaces in Schema (parallel with Stage 6)

Depends on: IPC (Stage 6) for cross-process cap transfer. Endpoint, RECV, RETURN, capability transfer in CALL params, and capability transfer in RETURN results are already implemented. The BlockDevice / File / Directory / DirEntry / Store / Namespace schema has now landed in full. The File / Directory / Store / Namespace interfaces also have RAM-backed kernel CapObject implementations (Phase 3 slices 1-3); BlockDevice remains schema-only. Userspace services that export Directory / File / Store / Namespace caps over a real backing store have since landed (Phase 3 below), and the kernel RAM-backed caps are now qemu-only proof/fixture surface rather than a production persistence service – see Kernel Storage Cap Backers Are Fixtures. That history shaped two named downstream adapters:

  • POSIX adapter Phase P1.4 (vendored dash port) does not require the userspace service for its v0 smoke: the bootstrap-granted RAM-backed Directory + Namespace kernel caps from Phase 3 slices 1-3 are an adequate read-only in-rodata pseudo-fs backing, so P1.4 is now ready to start on the userspace libcapos-posix file/dir/stdio/env/printf surface and on dash vendoring; see POSIX Adapter Phase P1.4 and docs/backlog/posix-adapter-dash-port.md. P1.3 (pipe + recording ProcessSpawner-driven fork-for-exec) landed without storage caps, so P1.4 is the next surface that consumes this proposal.
  • WASI host adapter Phase W.5 (Preview 1 filesystem) similarly consumes the same kernel cap shape and is unblocked from the same cap-surface perspective; remaining W.5 work is on the wasi-host adapter side. See WASI Host Adapter Phase W.5.

Concrete work:

  • Add BlockDevice, File, Directory, and DirEntry to schema/capos.capnp, regenerate the checked-in capnp bindings, add the BLOCKDEVICE_INTERFACE_ID / FILE_INTERFACE_ID / DIRECTORY_INTERFACE_ID constants, and add a capos-config host roundtrip test. This was schema-only when it landed; kernel CapObject implementations followed in Phase 3 slices 1-3 (the Store / Namespace interfaces were added in slice 3). SharedBuffer is not a separate interface – bulk transfers reuse the existing MemoryObject capability, and the inline-Data read / write / readBlocks / writeBlocks variants are the v0 surface.
  • Demo: two-process file server (in-memory File/Directory service + client) that the POSIX and WASI adapters can resolve preopens against

Phase 3: RAM-backed Store (after Phase 2)

Depends on: IPC (Stage 6) for cross-process store access. Same downstream blockers as Phase 2 – the POSIX adapter v0 plan resolves /etc / /lib under a read-only Namespace once this lands.

Concrete work:

  • Slice 1: minimal RAM-backed File CapObject (kernel/src/cap/file.rs). FileCap is backed by a single in-kernel Vec<u8> byte buffer and implements the inline-Data surface of the landed File interface – read / write / stat / truncate / sync / close – with per-call payloads bounded at 64 KiB. close() invalidates the cap: the cap-table get_slot path consults validate_live() (which returns Revoked once closed), and an in-call() guard is the defense-in-depth backup, so a post-close call fails closed with an application exception. A new KernelCapSource::file grant source lets a manifest grant the cap; the make run-file-server-smoke QEMU smoke (demos/file-server-smoke/, system-file-server-smoke.cue) drives write/read/stat/close round-trips and asserts the closed-cap rejection. Bulk-buffer / MemoryObject-mapped variants are later slices.
  • Slice 2: minimal RAM-backed Directory CapObject (kernel/src/cap/directory.rs). DirectoryCap is an in-memory namespace (BTreeMap<String, DirectoryEntry>, where each entry is a FileCap or a sub-DirectoryCap) implementing the landed Directory interface – open / list / mkdir / remove / sub. open / mkdir / sub mint a File / Directory result capability through the existing IPC result-cap transfer machinery (no new transfer authority); file read/write goes through the transferred File caps, never through the Directory. remove deletes an entry and revoke()s the backing object so every cap already handed out for it fails closed on its next dispatch, and refuses a non-empty sub-directory; close() invalidates the cap and recursively revokes the subtree. sub() has no attenuation beyond the structural scoping every sub-Directory already has – per-method read-only attenuation is deferred. A new KernelCapSource::directory grant source lets a manifest grant the cap; the make run-directory-server-smoke QEMU smoke (demos/directory-server-smoke/, system-directory-server-smoke.cue) drives open/list/mkdir/remove/sub with cap transfer and asserts the post-remove fail-closed rejection.
  • Slice 3: Store and Namespace interfaces in schema/capos.capnp plus minimal RAM-backed Store / Namespace kernel CapObjects (kernel/src/cap/store.rs, kernel/src/cap/namespace.rs). The schema additions are purely additive (Store / Namespace interfaces and the store @34 / namespace @35 KernelCapSource ordinals); the STORE_INTERFACE_ID / NAMESPACE_INTERFACE_ID constants and a capos-config host roundtrip test landed alongside. StoreCap is a content-addressed blob store (BTreeMap<[u8; 32], Vec<u8>> keyed by the SHA-256 content hash from capos_lib::content_hash) implementing put / get / has / delete; put is idempotent for identical content, blob and count bounds keep one Store from ballooning the kernel heap, and delete is kept on the base interface for this focused proof (the StoreAdmin split and a GC-verified delete remain deferred – see the delete note above). NamespaceCap is a name->hash binding map (BTreeMap<String, Vec<u8>> for bindings plus a BTreeMap<String, Arc<NamespaceCap>> of sub children) implementing resolve / bind / list / sub; bind overwrites an existing name (mutable references are the point), sub(prefix) mints a structurally scoped child node and transfers it through the existing IPC result-cap machinery (no new transfer authority, idempotent for a repeated prefix), and the parent->child recursive revoke() reuses the same finite-tree lock-ordering invariant DirectoryCap documents. The bindings are opaque hash bytes – a NamespaceCap does not hold a StoreCap reference or verify the hash names a live blob in this slice. New KernelCapSource::store / KernelCapSource::namespace grant sources let a manifest grant the caps; the make run-store-namespace-smoke QEMU smoke (demos/store-namespace-smoke/, system-store-namespace-smoke.cue) drives Store put/has/get/delete and Namespace bind/resolve/list/sub with cap transfer and asserts two fail-closed rejections (a Store.get of an unknown hash and a Namespace.resolve of an unbound name).
  • Implement Store as a userspace service over an exported Endpoint, moving it out of the kernel data path: a two-process provider->consumer demo (demos/store-service/, system-userspace-store-smoke.cue, make run-userspace-store-smoke) serves put/get/has/delete from an in-RAM BTreeMap<[u8;32], Vec<u8>> – no kernel Store cap in the data path. It mirrors the kernel StoreCap blob-count bound and publishes a narrower 4 KiB service-specific inline blob limit because the endpoint-framed request must fit in the service receive buffer; the smoke proves the largest accepted inline blob and the first rejected over-limit blob. The client uses the stock capos-rt StoreClient over the service endpoint relabelled to STORE_INTERFACE_ID via the manifest expectedInterfaceId. Still RAM, not yet a real store.
  • Implement a persistent Store + Namespace userspace service backed by a granted BlockDevice, moving the durable serve boundary out of the kernel: a three-process demo (demos/storage-persist-service/, system-storage-persist-service.cue, make run-storage-persist-service) serves Store (put/get/has/delete/list) and Namespace (resolve/bind/list/sub) from a single service that owns the on-disk CAPOSUS1 whole-state snapshot over a virtio-blk BlockDevice – no kernel Store/Namespace cap in the data path. The snapshot stores content-addressed blob bytes (keys recomputed and re-verified on load) and name->hash bindings; a superblock names the live snapshot length, its content hash, and a monotonic generation, and every mutation writes the new payload fully into the standby of two alternating A/B payload regions (selected by generation parity) and FLUSHes it before the single-sector superblock write flips the generation, so the previously committed snapshot survives a crash at any write boundary. Namespace.sub returns a scoped Namespace cap by pre-minting a bounded pool of Namespace-typed service-object facets of the service’s own namespace endpoint (each a distinct receiver cookie, minted through a spawned sub-helper) and transferring one through the IPC result-cap path; scoped calls route back to the same endpoint by cookie. The client reaches both interfaces through manifest-granted service caps relabelled to STORE_INTERFACE_ID / NAMESPACE_INTERFACE_ID, and the two-boot make run-storage-persist-service proves the marker and note objects and their bindings survive a reboot (the service reloads them before the second boot writes anything) even after the harness garbages the standby payload region between the boots, simulating a commit interrupted mid payload write (torn-commit recovery proof).
  • Serve the result-cap-returning userspace Directory + File filesystem interfaces from userspace: a three-process demo (demos/storage-fs-service/, system-storage-fs-service.cue, make run-userspace-directory-file-smoke) runs a service (the init process) that owns an in-memory filesystem tree and serves Directory (open/list/mkdir/remove/sub/create/rename) and File (read/write/stat/truncate/sync/close) over a single endpoint, dispatched by the call’s stamped interface id and receiver-cookie badge – no kernel readonly_fs/writable_fs/installable_image cap in the data path. Directory.open (-> File), mkdir/sub (-> Directory) transfer result caps from bounded pools of pre-minted typed service-object facets of the same endpoint (minted through the spawned subhelper, each a distinct cookie). The client reaches the tree through a writable root (a Directory client-endpoint facet) and a read-only root (a Directory service-object facet over the same tree); read-only attenuation is structural – the read-only root and the read-only File handles it returns fail mutation methods closed by routing on the cookie, not a rights flag. The proof drives the positive surface plus fail-closed cases (closed/stale File handle, path traversal via ..//, absent paths, read-only mutation, oversize writes). The existing kernel-backed WASI filesystem smoke (make run-wasi-fs) stays green as the explicitly fixture-labeled kernel Directory/File path. The follow-up cleanup retiring the kernel storage cap backers as production routes has landed – see Kernel Storage Cap Backers Are Fixtures below.
  • Backed by RAM (no disk driver yet, data lost on reboot)
  • Backed by a real store (persistent userspace service over BlockDevice, survives reboot)
  • Services can store and retrieve capnp objects at runtime
  • Demonstrate the naming model with a userspace Namespace service
  • Namespace.sub() returns new caps via IPC cap transfer

Kernel Storage Cap Backers Are Fixtures

The kernel Store, Namespace, File, Directory, readOnlyFsRoot, persistentStore, and writableFsRoot grant sources were the proof paths that landed the typed storage interfaces. Now that the userspace services above own the production serve boundary – the RAM Store service (demos/store-service, make run-userspace-store-smoke), the disk-backed Store + Namespace service (demos/storage-persist-service, make run-storage-persist-service), and the Directory + File filesystem service (demos/storage-fs-service, make run-userspace-directory-file-smoke) – the kernel backers are explicitly proof/fixture surface, not production storage routes. Production storage is userspace-served; no production manifest grants kernel-owned storage state ownership (the default system.cue boot grants none).

The kernel grant sources are gated accordingly:

  • The RAM-backed file / directory / store / namespace sources are gated behind the qemu feature in both the bootstrap cap-table builder (kernel/src/cap/mod.rs) and the ProcessSpawner spawn-grant path (kernel/src/cap/process_spawner.rs). The default non-qemu production kernel fails closed on these sources. They remain available only as the in-RAM pseudo-fs backing for the qemu interface proofs (make run-store-namespace-smoke, make run-file-server-smoke, make run-directory-server-smoke, make run-storage-naming) and for the POSIX/WASI/dash adapter smokes (make run-posix-*, make run-wasi-fs).
  • The disk-backed virtio read_only_fs_root / persistent_store / writable_fs_root sources (kernel/src/cap/readonly_fs.rs, persistent_store.rs, writable_fs.rs) were already gated behind qemu (with storage_fat_read / cloud_*_over_nvme_proof variants for the FAT and NVMe proof arms) and fail closed in the default production kernel. They back the storage regression proofs make run-storage-fs, make run-storage-persist, and make run-storage-writable (plus the FAT and NVMe proof targets), which stay green as explicitly fixture-labeled kernel paths.

In short: the kernel keeps these backers only as named qemu/cloud-proof fixtures; a default production build has no kernel storage grant route, so the typed storage interfaces are served from userspace.

Phase 4: BlockDevice Drivers and Filesystem (after virtio infrastructure)

  • virtio-blk driver (userspace, reuses virtqueue infrastructure from networking smoke test)
  • BlockDevice trait implementation
  • FAT filesystem service: wraps BlockDevice, exports Directory/File caps
  • SharedBuffer integration for bulk reads (depends on Stage 6 MemoryObject)
  • Store service uses BlockDevice for persistence (the persistent userspace Store + Namespace service above, make run-storage-persist-service)
  • System state survives reboot via the persistent userspace store (make run-storage-persist-service); manifest restore hints remain future work

Phase 5: Network Store (after networking)

  • Store service can replicate to or fetch from a remote store
  • Capability references transparently span machines
  • Directory cap backed by a remote filesystem (9P-style)
  • Managed cloud bridges can back selected Store/Namespace or app-specific SaveStore capabilities without changing caller authority. First target: GCP-backed profile/ledger/snapshot storage for the adventure demo, with local fake-cloud tests and no provider credentials in ordinary clients.
  • User-owned browser transport can store encrypted save capsules in Google Drive appDataFolder or Firebase user documents. This is for private backup/sync, not authoritative shared state.

Relationship to Other Proposals

  • Networking proposal — the NIC driver and net stack are services described in the manifest, not hardcoded. The store could be backed by network storage once networking works. A remote Directory cap (9P over capnp) reuses the same File/Directory interfaces.
  • Service architecture proposal — the manifest replaces code-as-config for init. ProcessSpawner, supervision, and cap export work as described there, but driven by manifest data instead of compiled Rust code. IPC Endpoints are the mechanism for service export.
  • Capability model — IPC cap transfer (Endpoint + RETURN SQE) is the mechanism that makes open() and resolve() work. SharedBuffer is the bulk data path that makes file I/O practical. Both are tracked in docs/roadmap.md Stage 6.
  • POSIX Adapter — Phase P1.4 (vendored dash port) consumes the Namespace + File + Directory cap surface defined here; that surface landed as RAM-backed kernel CapObjects in Phase 3 slices 1-3 and is the v0 backing for the dash smoke’s read-only in-rodata pseudo-fs. P1.3 (recording-shim pipe + fork-for-exec) has already landed without storage caps, so P1.4 is the next adapter consumer. The POSIX path resolver, open/read/write/stat/unlink, /etc and /lib preopen scoping, and the dash port itself all sit on this proposal’s Phase 2/3 schema.
  • WASI Host Adapter — Phase W.5 (Preview 1 filesystem: fd_read/fd_write/fd_seek/fd_pread/fd_pwrite/ fd_filestat_get/path_open/path_filestat_get/path_unlink_file) consumes the same cap shape and is unblocked from the cap-surface side (Phase 3 slices 1-3 land the RAM-backed Directory / Namespace / File caps). Preopened-dir fds map to Namespace caps from the manifest; path_open resolves through that namespace’s Store / File capability. Phases W.2/W.3/W.4 (stdout, argv-grant, random_get) shipped without storage caps, so W.5 is the next adapter consumer alongside POSIX P1.4.
  • Userspace Binaries Parts 4 and 5 — the POSIX adapter (Part 4) and the WASI host adapter (Part 5) both describe their filesystem stories as translations onto this proposal’s Namespace / Directory / File / Store surface. Part 4 sketches the Namespace-rooted POSIX fd table and the Namespace + Store -> file I/O translation; Part 5 maps each preopened-dir fd to a Namespace cap.
  • Adventure game proposal — profile, expedition, ledger, and content persistence use application-level save records through Store/Namespace or an app-specific cloud bridge. The game should not persist by snapshotting a live process or exposing provider credentials to clients.
  • Cryptography/key-management and volume-encryption proposals — the Cloud KMS path uses envelope encryption. KMS wraps DEKs under KEKs; capOS services use local SymmetricKey authority for plaintext operations.

Open Questions

  1. Manifest validation. How much can the build tool verify statically? Cap export names depend on runtime behavior of services. Should services declare their exports in their own metadata (like a package manifest)?

  2. Schema evolution. When a service’s capnp interface changes, stored objects referencing the old schema need migration. Cap’n Proto has backwards-compatible schema evolution, but breaking changes need a story.

  3. Garbage collection. Content-addressed store accumulates unreferenced objects. Who GCs? A separate service with Store read + delete authority? Reference counting in the namespace layer?

  4. Large objects. Storing multi-megabyte binaries as single capnp Data fields is wasteful (capnp allocates contiguously). SharedBuffer partially addresses this for I/O, but the Store’s put/get interface still takes Data. Options: chunked storage (Merkle tree of hashes), a streaming Blob interface, or SharedBuffer-aware Store methods.

  5. Trust model for the manifest. The boot manifest has full authority to define the system. Who signs it? How do you prevent a tampered ISO from granting excessive caps? Secure boot integration?

  6. File locking and concurrent access. Multiple processes opening the same file through the same filesystem service need coordination. Options: mandatory locking in the filesystem service (rejects conflicting opens), advisory locking via a separate Lock capability, or single-writer enforcement at the Directory level (open with exclusive flag).

  7. RETURN+RECV atomicity. When a server posts a RETURN SQE followed by a RECV SQE, there must be no window where a client call can arrive but the server isn’t listening. SQE LINK chaining (RETURN → RECV) should provide this atomicity — the kernel processes both SQEs as a unit.

Proposal: Standard App Capabilities (AppData, Powerbox, Attenuated Sharing)

Status: future design. No implementation. This proposal defines three app-facing capability patterns; the AppData cap is the nearest-term, self-contained piece, the powerbox and sharing-mint depend on a trusted display path and the attenuation wrappers respectively.

Summary

Google Drive, examined closely, spends a lot of effort reluctantly re-inventing capabilities on top of an ambient REST API: the drive.file scope plus the Picker (an app may touch only files the user explicitly hands it), the appDataFolder space (per-app private storage invisible to the user and other apps), and the role lattice (reader/writer/…) for sharing. Each is a workaround for the fact that the base API is ambient-by-default and gated by OAuth scopes – a category rights bitmask re-checked server-side.

capOS does not have that base problem: there is no ambient authority, no path VFS, and access is narrowed by handing a more-restricted typed capability. So capOS can express Drive’s three good ideas as the native mechanism rather than the exception, and more cleanly:

  • AppData – a per-process private storage root, granted at spawn and never duplicated. Isolation is structural (only one holder), not a server scope check keyed to an OAuth client id.
  • Powerbox (a FilePicker/resource-picker broker) – a user-mediated grant where a trusted selector the app cannot script returns a real, fresh, method-narrowed capability for exactly what the user chose. This is what drive.file + Picker is trying to be.
  • Attenuated sharing – “share read-only” means handing a File wrapper that lacks write; escalation is impossible by construction, not by per-request ACL evaluation.

The goal is to make application development both simpler (apps ask for a private scratch space or a user-picked file instead of negotiating a global namespace and scope ladder) and more secure (least authority by default, enforced structurally). These caps are backend-independent: they sit unchanged in front of RAM, local disk, and a future Google Drive backend (docs/proposals/drive-storage-backend-proposal.md).

What capOS already has (build on, do not reinvent)

  • Storage caps Store / Namespace / Directory / File exist in schema/capos.capnp and as RAM-backed kernel CapObjects. Directory.sub() / Namespace.sub(prefix) already return structurally-scoped child caps that cannot traverse upward (the chroot analog). See docs/proposals/storage-and-naming-proposal.md and kernel/src/cap/.
  • An attenuation table is already designed in storage-and-naming-proposal.md (read-only / append-only File, read-only Directory/Store/Namespace wrappers) but is not yet implemented – current sub() has structural scoping with no per-method attenuation. This proposal’s sharing pattern depends on landing those wrappers.
  • authority_broker (kernel/src/cap/authority_broker.rs) is already a decision point that mints a bundle of capabilities for a session based on its SessionContext principal/profile (the login -> shellBundle / remoteClientBundle flow). It is the proto-powerbox; the powerbox below generalizes it from session-establishment-time to a per-request, user-confirmed grant.
  • session_context (kernel/src/session_context.rs) binds one immutable identity per process. An AppData root and a powerbox grant can both key on SessionContext.principal_id, exactly as authority_broker already does.
  • Manifests grant exactly the caps an app receives, with a grant mode (Raw/ClientEndpoint/Move/ServiceObject). Per-app scoping today is “the manifest grants a sub()-scoped Directory.”

The genuinely new surface is: a per-app AppData interface, a per-request powerbox/file-picker mechanism (the term “powerbox” is currently unused in the repo), and a service that mints attenuated caps for sharing to another principal.

Design lessons from Google Drive

Drive conceptWhat it really iscapOS pattern
appDataFolder + drive.appdata scopePer-app hidden storage, server-scope-gatedAppData cap: one holder, structural isolation
drive.file + PickerUser-mediated per-file grant (ACL expansion)Powerbox broker mints/returns a per-object cap
OAuth scope ladder (drive vs drive.readonly)Category rights bitmask on a principal(rejected) method-narrowed wrapper caps
Roles (reader..owner)ACL lattice entries, re-checked per requestAttenuated wrapper caps (subset of methods)
expirationTime permissionServer-enforced time-boxed ACL entryRevocation/expiry membrane held by the grantor
anyone / link sharingBearer grant (authority = possession)Bearer cap – deliberately flagged, audited
Shortcut (pointer file)Reference to a target idNamespace name -> cap binding
Revisions / keepForeverPer-file version listContent-addressed Store blobs + mutable pointer + GC pin

The recurring lesson: Drive’s least-privilege features are the ones where it was forced to approximate object capabilities (drive.file, Picker, appDataFolder); its scope ladder and server-side ACL are the ambient base it is working around. capOS should adopt the former natively and not import the latter.

1. AppData – per-app private storage

Every process can be granted, at spawn, a private storage root that no other principal holds a copy of. Isolation requires no policy check: the cap is simply never handed to anyone else.

interface AppData {
  open   @0 (name :Text) -> (file :File);   # create-or-open within this app's root
  list   @1 () -> (entries :List(Text));
  remove @2 (name :Text) -> ();
}
  • Backing: an AppData cap is a thin role over a Directory (or Namespace) scoped to the app – in the simplest form a manifest-granted Directory.sub("<app>"). It can be backed by RAM today, local disk later, or the Drive appDataFolder space (see the backend proposal).
  • Isolation vs Drive: Drive enforces appData isolation with a server-side scope check keyed to the OAuth client id (ambient identity gating a shared namespace). capOS hands each process a private cap and never duplicates it – cross-app leakage is not possible, not merely disallowed.
  • Quota: attach a storage budget to the cap (per the resource-accounting-proposal.md ledger model) instead of charging a global per-user pool. This is a deliberate divergence from Drive’s unified per-human quota (see Non-Goals).
  • Lifecycle: the root and its storage are reclaimed when the principal is destroyed – the cap analog of Drive deleting appDataFolder on app uninstall.

AppData is the nearest-term piece: over RAM it is a small userspace service plus a manifest grant, with no dependency on the powerbox or the attenuation wrappers.

2. Powerbox – user-mediated capability grants

A powerbox is a trusted broker that, on an app’s request, presents the user a selector the requesting app cannot script or read through, and on the user’s confirmation mints and returns a fresh capability for exactly the chosen object – optionally method-narrowed. It generalizes authority_broker from “mint a bundle at login” to “mint one cap per user gesture.”

interface FilePicker {
  pickFile  @0 (mode :AccessMode) -> (file :File);
  pickFiles @1 (mode :AccessMode) -> (files :List(File));
  pickDir   @2 (mode :AccessMode) -> (dir  :Directory);
}
enum AccessMode { readOnly @0; readWrite @1; }
  • Why better than drive.file + scope: the returned File is a real handle scoped to one object, narrowed at mint time (no drive.readonly string), revocable locally by dropping it, with no “+ files the app created” fuzzy second clause and no server ACL round-trip. The user gesture is the grant.
  • Prior art: this is the Genode “parent routes the session request according to policy” pattern (docs/research/genode.md §Session Routing) and the Sculpt/nitpicker user-mediated resource model. capOS’s authority_broker is the analog of Genode parent-routing; the powerbox is its per-request generalization.
  • Hard prerequisite – trusted display: a powerbox is only as trustworthy as the path that shows the selector. The user must be able to trust that the selector UI is the system’s, not a spoof drawn by the requesting app. capOS does not yet have a multiplexed trusted-display primitive (the nitpicker analog); today the trusted surface is the shell/session/terminal. The file-picker powerbox therefore depends on either (a) a text-mode trusted selector hosted by the session/shell, or (b) a future trusted display service. This is the powerbox’s gating dependency and is called out as an open question.
  • The powerbox is not file-specific in principle – the same broker shape can mediate user-confirmed grants of other resource caps. This proposal scopes the first instance to file/dir/storage selection; a general Powerbox is future work.

3. Attenuated sharing – wrapper caps + revocation membrane

Sharing is delegating a capability, optionally narrowed to a smaller interface, optionally through a revoker.

interface File {
  # ... existing read/write/stat/...
  shareAs @N (role :ShareRole, expiresAt :UInt64)
          -> (handle :File, revoke :Revoker);
}
enum ShareRole { reader @0; commenter @1; writer @2; }

interface Revoker { revoke @0 () -> (); }
  • Roles as method subsets: a reader is a File cap exposing only the read-side methods (today read/stat); writer additionally exposes write/truncate. Escalation is impossible because the method literally is not on the object the grantee holds – not because an ACL is re-evaluated. This is a monotone lattice expressed structurally. (A commenter role, as in the ShareRole enum below, implies a comment surface the current File interface – read/write/stat/truncate/sync/close – does not yet have; it is illustrative of the lattice, not of an existing method.)
  • Depends on the attenuation wrappers already designed in storage-and-naming-proposal.md but not yet implemented. Landing those read-only/narrowed File/Directory wrappers is the prerequisite for shareAs.
  • Clawback is the one place capabilities are weaker than Drive’s mutable ACL: a handed-out cap cannot be unilaterally downgraded later. shareAs therefore mints the shared handle through a revocation membrane and returns the Revoker to the grantor, so “un-share later” and expiresAt are supported – at the cost of an interposed membrane and a trusted clock on the sharing path.
  • Shared directories / group ownership (Drive shared drives) map to a group-owned Directory with per-member role wrappers; deferred to future work.

Uniformity across storage backends

All three caps are defined over the existing typed storage interfaces, so they are identical whether the backing is RAM, local disk (docs/proposals/storage-and-naming-proposal.md), or Google Drive (docs/proposals/drive-storage-backend-proposal.md). An app that uses an AppData cap and a FilePicker does not know or care which backend serves it. This is the same backend-agnosticism the storage proposal already states for Store (“backed by virtio-blk, RAM, or network”).

Honest mismatches and non-goals

  • Bearer / link sharing (anyone): capabilities are bearer tokens, so link-sharing maps “cleanly” – which is exactly the risk. It drops user mediation entirely (anyone with the bytes has access). Treat it as a deliberately-flagged, audited exception, never a default; prefer a powerbox grant or shareAs to a named principal.
  • Clawback / instant global revoke: Drive’s owner can demote any grantee at any time via the central ACL. capOS gets this only where caps were minted through a revoker; there is no zero-cost equivalent of Drive’s org-wide instant revoke for already-forwarded caps.
  • Unified human quota: Drive charges one per-user quota across spaces. capOS uses per-cap budgets; reconciling “this app’s AppData counts against the human’s storage” is a policy question with no clean cap answer. Per-cap budgets are the default; a unified human-facing view is out of scope.
  • Scope tiering is administrative, not technical: Drive’s restricted-scope verification is a business/review gate, not a security mechanism. It has no capability analog and is explicitly not imitated; structural narrowing replaces it.
  • Trusted display is a real gap: without it, the powerbox selector can be spoofed by the requesting app. This proposal does not deliver a trusted display; it depends on one (open question below).

Relationship to existing proposals

  • storage-and-naming-proposal.md – owns the storage caps, the attenuation table this proposal’s sharing depends on, and the existing “Managed Cloud Backing” / “User-Owned Browser Transport” sections. A small reconciling update there should cross-reference AppData and the powerbox; this proposal is the standalone home for the three patterns.
  • userspace-authority-broker-proposal.md – proposes moving broker policy into init-owned userspace; the powerbox should live wherever the broker lands.
  • oidc-and-oauth2-proposal.md – the OAuth consent screen is itself a powerbox grant; the patterns are consistent.
  • docs/research/{genode,plan9-inferno,eros-capros-coyotos}.md – Genode parent-routing/powerbox, Plan 9 per-process namespaces (an AppData mounted alongside other storage is the union-namespace pattern), and the EROS persistence contrast (capOS keeps application-level persistence, not transparent single-level store).

Phasing

  1. AppData over RAM (near-term, self-contained): a userspace AppData service plus a manifest grant; QEMU proof that two apps cannot see each other’s data. No powerbox/wrapper dependency.
  2. Attenuation wrappers (implement the already-designed read-only/narrowed File/Directory wrappers): prerequisite for sharing.
  3. shareAs + Revoker (sharing-mint): once wrappers exist; adds the revocation membrane and a trusted clock on the sharing path.
  4. FilePicker powerbox (gated on a trusted display path): start with a session/shell-hosted text-mode selector; generalize to a Powerbox and a trusted display service later.

Open questions

  • What is the trusted-display primitive the powerbox selector renders through – a shell/session-hosted selector, or a new multiplexed display service (the nitpicker analog)?
  • Should AppData quota integrate with resource-accounting-proposal.md ledgers, and how does it relate to a future unified human-facing storage view?
  • Does shareAs belong on each storage interface (File/Directory/Store) or on a separate Sharing minting service that takes a cap and returns a narrowed one?
  • Is the first powerbox instance file/dir/storage-only, or should the general Powerbox shape (mediating any resource cap) be defined up front?

Proposal: Google Drive Storage Backend

Status: future design. No implementation. The native backend is gated behind the userspace-driver authority gate, a userspace network stack, an outbound TLS client, an HTTP client, and the OAuth2 service – none of which exist yet. The browser-transport model is the near-term path and is already partially specified in storage-and-naming-proposal.md.

Summary

Let a Google-authenticated user use their own Google Drive as a capOS storage backend, exposed behind the same storage capabilities apps already use (Store / Namespace / Directory / File, and the AppData cap from docs/proposals/standard-app-capabilities-proposal.md). The user’s Drive – specifically the per-app appDataFolder space – becomes the backing for an app’s AppData cap, and selected user files become File caps minted through the powerbox.

There are two delivery models, and this proposal keeps them explicit because they have very different trust and readiness profiles:

  • Browser-transport (near-term): the user’s browser holds the Google OAuth session and does the TLS/HTTP to Drive; capOS never sees Google tokens and stores only encrypted capsules in appDataFolder. This is already sketched in storage-and-naming-proposal.md (“User-Owned Browser Transport”) and is feasible without a capOS network/TLS/OAuth stack.
  • Native backend (deep-future): a capOS userspace service holds the OAuth refresh token and performs outbound HTTPS to the Drive API itself. This is the more capable model and the more demanding one – it sits behind the full network/TLS/HTTP/OAuth dependency chain.

In both models, Drive sits behind a backend adapter, not pretended to be a set of first-class local object caps (see Trust Model).

Why Drive

  • The user already owns the storage and the quota; capOS does not provision a server.
  • The appDataFolder space is a near-exact fit for the per-app AppData cap: Google already provides per-app private storage invisible to the user and other apps under the narrow, non-sensitive drive.appdata scope.
  • Drive’s drive.file + Picker consent model maps onto the capOS powerbox, so user-selected files become capabilities without granting the all-files scope.
  • It is a concrete, widely-available validation of the storage caps’ backend-agnosticism that storage-and-naming-proposal.md already asserts (“Store service backed by virtio-blk, RAM, or network”).

Architecture

A userspace Drive storage service implements the standard storage cap interfaces and translates their methods into Drive REST calls:

app  --(File/Directory/Store/AppData cap)-->  Drive storage service
                                                |  uses
                                                v
                              DriveAccount cap (OAuth tokens)  -->  OAuthClient / AccessToken
                                                |                    (oidc-and-oauth2-proposal)
                                                v
                              OutboundHttpRequest  -->  TLS client  -->  userspace net stack
                              (networking)            (certificates-and-tls)
  • The service consumes, does not redefine the OAuth capabilities from oidc-and-oauth2-proposal.md (OAuthClient, AccessToken.authorize / attenuate, RefreshToken), passing each Drive request as the OutboundHttpRequest struct that AccessToken.authorize decorates with the bearer credential. The refresh token lives in the OAuth service; the Drive service holds a DriveAccount cap that exposes only the typed operations the user consented to.
  • It consumes the outbound TLS client from certificates-and-tls-proposal.md and the HTTP client / userspace network stack from networking-proposal.md Phase C.
  • It is the network analog of the virtio-blk-backed FS service in docs/proposals/storage-and-naming-proposal.md: same Directory/File/Store caps in front, a different backend behind.

Concept mapping (Drive -> capOS standard caps)

DrivecapOS standard cap (Proposal A)
appDataFolder space (drive.appdata)AppData cap backing
drive.file + Picker selectionpowerbox FilePicker returns a File cap
File idthe File cap handle
FolderDirectory cap
Roles (reader/writer)shareAs wrapper caps
ShortcutNamespace binding
Revisionscontent-addressed Store blobs + pointer
OAuth scopes(not modeled internally) method-narrowed DriveAccount

A key consequence: capOS does not model OAuth scopes internally. A DriveAccount cap exposes only the methods the user consented to; “read-only Drive access” is a DriveAccount whose wrapper omits write methods, not a drive.readonly scope string re-checked server-side.

Dependency stack and gating

A native Drive client backend needs, bottom-up:

LayerNeedcapOS state
NIC / virtio-netpacket I/Opartly present (virtio-net MSI-X + delivery); driver must move to userspace
TCPreliable streampresent (smoltcp, in-kernel/transitional); must move to a userspace net process
TLS 1.2/1.3confidentiality + server auth (X.509 chain, trust roots, AEAD/ECDHE)not implementedcertificates-and-tls-proposal.md is future design (rustls + webpki-roots planned); the hardest single piece
HTTP/1.1 or HTTP/2Drive REST transport (Google prefers HTTP/2)not implemented
JSONrequest/response + metadata; resumable upload state machinetractable (serde_json no-std)
OAuth2 token flowPKCE/device-flow handshake, refresh->access exchange, sealed refresh-token storagedesigned but unimplemented (oidc-and-oauth2-proposal.md)
Trusted wall-clocktoken expiry, cert validity, permission expirationTimeweak today; needed for TLS cert validity

The native backend is therefore gated on, in order: docs/backlog/hardware-boot-storage.md Task 5 (userspace-driver authority gate) -> networking-proposal.md Phase C (userspace net stack + NIC driver) -> certificates-and-tls-proposal.md (outbound TLS) -> an HTTP client -> oidc-and-oauth2-proposal.md (OAuth service). This is the same authority gate that blocks userspace networking generally; the Drive backend is one of its downstream consumers, not a way around it.

Delivery models

Browser-transport (near-term)

The user’s browser, already authenticated to Google, holds the OAuth session and performs the TLS/HTTP to Drive. capOS hands the browser an opaque, client-side-encrypted capsule to store in the app’s appDataFolder; capOS never sees Google tokens. This reuses the remote-session / browser-capability surface and the KMS envelope-encryption pattern in storage-and-naming-proposal.md (“User-Owned Browser Transport”). It is feasible before any capOS network/TLS/OAuth stack exists, and is the recommended first delivery. This proposal’s role here is to reconcile that existing section with the AppData/powerbox vocabulary, not to redefine it.

Native backend (deep-future)

A capOS Drive service holds the OAuth refresh token and does outbound HTTPS itself. For a headless/embedded OS the realistic OAuth flows are authorization-code + PKCE with a loopback redirect (http://127.0.0.1:port) when a same-host browser is reachable, otherwise the device flow (show URL + code on one device, poll for tokens). PKCE is non-negotiable – capOS has no trustworthy on-device confidential client secret. Token lifecycle: persist only the refresh token in a sealed cap (an AppData-style or credential_store secret), exchange for short-lived access tokens on demand, and treat the access token as an ephemeral bearer credential passed to the HTTP path, never persisted.

Trust model and honest mismatches

A Drive-backed File is not a true local object capability – it is a bearer credential to a remote, server-authoritative ACL. The authority lives on Google’s server, which re-checks it on every request against a mutable table. Consequences the design must respect:

  • No local revocation/attenuation guarantee. Dropping a local handle does not revoke access Google still grants; narrowing a DriveAccount wrapper does not change Google’s server-side scope. capOS can wrap Drive behind the adapter but cannot give a remote file the local revocation/attenuation semantics of a true cap.
  • Offline = non-functional. Unlike a local cap, a Drive-backed cap is dead without network.
  • Global mutable namespace / instant org-wide revoke are Drive server-authoritative features with no clean local-cap equivalent; they stay behind the adapter.
  • Quota is Drive’s per-user pool, not a per-cap budget; an app’s appDataFolder usage counts against the human’s Drive quota.

Therefore Drive is exposed strictly as a backend adapter that serves the storage caps with documented remote semantics, never as a drop-in for local object caps. Apps that need local revocation/attenuation/offline guarantees should use a local backend; apps that want the user’s Drive accept the remote semantics.

Phasing

  1. Reconcile + capsule model (near-term, browser-transport): align the existing “User-Owned Browser Transport” section of storage-and-naming-proposal.md with the AppData/powerbox vocabulary; define the encrypted-capsule format and the appDataFolder capsule lifecycle. No capOS network/TLS/OAuth dependency.
  2. OAuth service + outbound HTTPS prerequisites (deep-future): land the gated chain (userspace net stack, TLS, HTTP, OAuth service) per their own proposals. This proposal only consumes them.
  3. Native DriveAccount + Drive storage service (deep-future): implement the service that maps AppData/File/Directory/Store onto Drive REST using the OAuth/TLS/HTTP caps; prove an appDataFolder round-trip and a powerbox-picked file read in a QEMU smoke against a Drive API stand-in.
  4. Sharing bridge (future): map shareAs to Drive permissions where the remote semantics allow, with the bearer/clawback caveats flagged.

Relationship to existing proposals

  • docs/proposals/standard-app-capabilities-proposal.md – defines the AppData/powerbox/shareAs caps this backend serves.
  • docs/proposals/storage-and-naming-proposal.md – owns the storage caps, the “Managed Cloud Backing” and “User-Owned Browser Transport” sections (the near-term Drive path), and the backend-agnosticism this validates.
  • docs/proposals/oidc-and-oauth2-proposal.md – the OAuth token capabilities this backend consumes; the refresh token lives there.
  • docs/proposals/certificates-and-tls-proposal.md – the outbound TLS client.
  • docs/proposals/networking-proposal.md – Phase C userspace net stack + the HTTP client; the shared authority gate.
  • docs/research/{eros-capros-coyotos,plan9-inferno}.md – application-level persistence (vs transparent single-level store) and per-process namespaces (a Drive backend unioned into an app namespace alongside local storage).

Open questions

  • Is the encrypted-capsule (browser-transport) model sufficient for the first user-facing Drive feature, deferring the native backend until the network stack is real?
  • Where does the refresh token live – the OAuth service’s own sealed store, a credential_store extension, or a dedicated DriveAccount object?
  • Does the native backend target the Drive REST API directly, or go through a capOS-hosted proxy that holds the Google credentials (narrowing the on-device trust surface, at the cost of running a proxy)?
  • How are Drive’s server-side semantics (revocation, quota, mutable ACL) surfaced to apps so they are not surprised by a File cap that behaves unlike a local one?

Proposal: Error Handling for Capability Invocations

How capOS communicates errors from capability calls back to userspace processes.

Current design authority now lives in Error Handling. This proposal is retained as the archival decision record and original rationale for the implemented two-level model.

This proposal defines a two-level error model: transport errors (the invocation mechanism itself failed) and application errors (the capability processed the request and returned a structured error). The design aligns with Cap’n Proto’s own exception model and the patterns used by seL4, Zircon, and other capability systems.

Status note: The shared-memory capability ring + cap_enter has replaced cap_call as the invocation surface, and the two-level error model described below is implemented for the current ring, runtime, and endpoint IPC surface. Transport errors arrive as negative CapCqe.result codes (see “Current CQE Error Namespace”); application errors arrive as a serialized CapException with CAP_ERR_APPLICATION_EXCEPTION. The CapException schema and ExceptionType taxonomy live in schema/capos.capnp (enum ExceptionType and struct CapException near the bottom of the schema), the kernel side serializes them through kernel/src/cap/ring.rs (including the INVALID_ARGUMENT_SENTINEL channel for the capOS-only invalidArgument variant), and capos-rt/src/client.rs decodes them into ClientError::Application(ApplicationException).

Related documents:

  • docs/architecture/error-handling.md is the current design authority for the implemented error layers.
  • docs/architecture/capability-ring.md owns the current ring transport contract that carries the CQE status values.
  • docs/proposals/service-architecture-proposal.md captures the cross-process spawn and revoked-endpoint surface that exercises Disconnected and the endpoint RETURN exception flag end-to-end.
  • docs/design-risks-register.md records the open contracts that flow into this proposal: R6 (deferred CAP_OP_RELEASE) and R15 (application-exception serialization depends on result-buffer capacity).
  • docs/capability-model.md describes the broader capability model the error layers sit inside; this proposal owns only the error model.

The “Problem Statement”, “Syscall Return Convention”, “Kernel Implementation”, “Userspace API”, and “Migration Path” sections below describe the original cap_call-era design that motivated the model. They are kept as historical context; the “Current CQE Error Namespace”, “CapException Schema”, and “Application-Level Errors in Interface Schemas” sections describe current behavior.

Current CQE Error Namespace

The capability ring uses signed 32-bit CapCqe.result values. Non-negative values are opcode-specific success results; negative values are kernel transport errors defined in capos-config/src/ring.rs:

CodeNameMeaning
-1CAP_ERR_INVALID_REQUESTMalformed request metadata or an opcode value not reserved in the ABI.
-2CAP_ERR_INVALID_PARAMS_BUFFERSQE parameter buffer is unmapped, out of range, or not readable.
-3CAP_ERR_INVALID_RESULT_BUFFERSQE result buffer is unmapped, out of range, or not writable.
-4CAP_ERR_INVOKE_FAILEDCapability lookup or invocation failed before a successful result was produced.
-5CAP_ERR_UNSUPPORTED_OPCODEOpcode is reserved in the ABI but not yet dispatched. Currently returned for CAP_OP_FINISH; CAP_OP_RELEASE has kernel dispatch and reports stale/non-owned caps as request/invoke failures.
-6CAP_ERR_TRANSFER_NOT_SUPPORTEDTransfer mode or sideband descriptor layout is recognized as unsupported by this kernel.
-7CAP_ERR_INVALID_TRANSFER_DESCRIPTORxfer_cap_count descriptor layout malformed or contains reserved bits.
-8CAP_ERR_TRANSFER_ABORTEDTransaction-in-progress transfer failed and must not produce partial capability state.
-9CAP_ERR_APPLICATION_EXCEPTIONA structured CapException was serialized into the caller-provided result buffer.
-10CAP_ERR_APPLICATION_EXCEPTION_TRUNCATEDAn application exception occurred, but no detail fit in the available result buffer.

This is deliberately a small transport namespace. Interface-specific failures should be encoded in the result payload once the target capability successfully handles the request.

Revoked capabilities use the same application-exception path when the caller provided a result buffer. Ordinary capability CALLs and endpoint CALL/RECV on a revoked cap serialize a Disconnected CapException and complete with CAP_ERR_APPLICATION_EXCEPTION. Runtime clients decode that CQE into ClientError::Application(ApplicationException { type: Disconnected, ... }).

Endpoint RETURN is asymmetric because the result belongs to the original caller, not the returning receiver. A receiver can set CAP_SQE_RETURN_APPLICATION_EXCEPTION on CAP_OP_RETURN to return a serialized CapException to the original caller; the receiver’s own RETURN CQE still reports only whether the RETURN transport succeeded. If a receiver tries to RETURN through a revoked endpoint while an in-flight caller still has a result buffer, the kernel first preflights completion-queue space for both caller and receiver, then removes the in-flight call, serializes a Disconnected exception into the caller’s buffer, and posts the caller completion with CAP_ERR_APPLICATION_EXCEPTION. The receiver always gets CAP_ERR_APPLICATION_EXCEPTION_TRUNCATED because revoked RETURN has no receiver-owned result payload. If the caller did not provide a result buffer, the caller also receives the truncated code. Lookup or CQ-space failures that cannot be tied to a result buffer remain transport failures.

Revoking an endpoint cap through a child CapabilityManager also cancels endpoint wait state on that object: owner endpoint revoke cancels all queued calls, pending receives, and in-flight calls, while non-owner endpoint facet revoke cancels entries tied to the managed child pid. Those cancellation completions use the existing endpoint-cancel transport result because they describe already-pending SQEs, not a fresh invocation with a result buffer.

Current Implementation Inventory

Implemented typed exception paths:

  • Ordinary CAP_OP_CALL capability implementations that return capnp::Error are serialized as CapException payloads when the SQE has a writable result buffer. capnp::ErrorKind::{Failed, Overloaded, Disconnected, Unimplemented} map to the matching ExceptionType; all other Cap’n Proto decode/validation kinds map to Failed.
  • Ordinary revoked-cap calls serialize Disconnected when a result buffer is present.
  • Endpoint CALL and RECV on a revoked endpoint serialize Disconnected when a result buffer is present.
  • Live endpoint CALL target errors that arise after a valid endpoint cap is identified serialize as CapException when the caller supplies a result buffer. Endpoint queue-capacity, parameter-slot, call-id, and in-flight capacity failures are reported as Overloaded.
  • Endpoint RETURN through a revoked endpoint reports Disconnected to the original caller when that caller has a result buffer, and reports the receiver-side no-payload/truncated application-exception code.
  • Endpoint RETURN with CAP_SQE_RETURN_APPLICATION_EXCEPTION copies the receiver-provided serialized CapException to the original caller and posts CAP_ERR_APPLICATION_EXCEPTION; if no payload fits, the original caller gets CAP_ERR_APPLICATION_EXCEPTION_TRUNCATED.
  • capos-rt decodes CAP_ERR_APPLICATION_EXCEPTION into ClientError::Application(ApplicationException) and treats Disconnected as breaking the local capability handle. Truncated application exceptions decode as Failed with an empty diagnostic message. Endpoint servers can use capos-rt’s submit_endpoint_return_exception() helper to produce that RETURN shape.

Intentional generic transport paths:

  • Capability lookup failures before a target object is identified still return CAP_ERR_INVOKE_FAILED; these remain transport errors.
  • Malformed SQE metadata, bad params/result buffers, unsupported opcodes, and malformed transfer descriptors remain transport errors.
  • Endpoint delivery/receive/return rollback failures that arise while restoring queues, committing sideband transfers, posting to completion queues, or writing endpoint payloads still use CAP_ERR_INVOKE_FAILED, CAP_ERR_TRANSFER_ABORTED, or CAP_ERR_INVALID_RESULT_BUFFER. Result-buffer validation and endpoint payload copy failures are transport errors because no safe payload destination exists.
  • Existing QEMU coverage proves Disconnected for revocation and one ordinary local Unimplemented runtime path. The endpoint-roundtrip QEMU demo proves local live-endpoint Overloaded serialization for endpoint queue saturation. Cross-process Disconnected is covered for revoked endpoint use, and make run-spawn now proves cross-process endpoint RETURN propagation for Failed, Overloaded, and Unimplemented application exceptions. The same focused spawn proof runs ring-reserved-opcodes, which checks that the RETURN exception flag is rejected outside its valid shape and that an endpoint caller with no result buffer receives CAP_ERR_APPLICATION_EXCEPTION_TRUNCATED.

Target Contract

For this milestone, a kernel path should produce a typed CapException when all of the following are true:

  1. A capability invocation target was identified, or an endpoint operation is acting on an already accepted call/receive relationship.
  2. The failure is attributable to invocation semantics rather than malformed ring transport metadata.
  3. The affected caller supplied a result buffer that can hold a serialized exception.

If the same invocation-level failure occurs with no result buffer or an insufficient result buffer, the CQE result is CAP_ERR_APPLICATION_EXCEPTION_TRUNCATED. If no target capability or accepted IPC relationship exists, the failure stays in the transport namespace. Result buffer validation failure also stays transport-level because no safe payload destination exists.

The exception serialization path respects two per-process resource-profile limits wired from the manifest ResourceProfile fields (both defaulting to 65 536 bytes, the kernel ceiling):

  • ringScratchLimitBytes – bounds the ring input and output scratch buffers. Any CALL with params_len exceeding the effective input limit is rejected with CAP_ERR_INVALID_REQUEST at the transport layer before capability dispatch.
  • replyScratchLimitBytes – bounds the reply scratch used by serialize_application_exception_to_user and serialize_disconnected_exception_to_process. The effective reply limit is min(replyScratchLimitBytes, ringScratchLimitBytes); if the serialized exception exceeds this limit, the caller receives CAP_ERR_APPLICATION_EXCEPTION_TRUNCATED instead. Prior to this wiring, reply scratch was unconstrained at the global 64 KiB ceiling regardless of the process’s ringScratchLimitBytes, which caused spurious TRUNCATED results for tightly constrained processes. Both limits are enforced as of commit 4fc0466d (replyScratchLimitBytes) and commit 1bcfbad4 (ringScratchLimitBytes).

The exception types keep their Cap’n Proto client-response meaning; InvalidArgument is the capOS-only addition introduced with Scheduler Phase D Task 1 (commit cb8c58b1, 2026-05-07). The canonical worked example is SchedulingPolicyCap.setWeight in schema/capos.capnp, whose schema comment states the cap rejects out-of-range or zero values with a CapException of type invalidArgument and does NOT silently clamp:

  • Failed: deterministic invocation failure, deserialization error, or a target-side invariant failure. New caps that validate parameters at the cap boundary should return InvalidArgument instead of Failed for caller bugs; Failed is for “the cap tried and could not”.
  • Overloaded: temporary resource exhaustion after a valid target invocation has begun.
  • Disconnected: target object, endpoint facet, or peer relationship is gone.
  • Unimplemented: target object is live but does not implement the requested method.
  • InvalidArgument: the cap accepted the call (target lives, message parsed) but a parameter value violates the documented contract. Distinct from Failed because the caller is expected to correct its input and retry, not back off or treat the cap as broken. Carried on the wire today through INVALID_ARGUMENT_SENTINEL in kernel/src/cap/ring.rs; userspace decode in capos-rt::client::ApplicationException returns ExceptionType::InvalidArgument.

Exception messages are diagnostic only. They must not include kernel pointers, secret payload bytes, or other process-private data.

Schema Style Guide

Use the three error layers consistently:

LayerUse forDo not use for
CQE statusRing, transport, kernel dispatch, malformed SQE, missing target, invalid buffer, unsupported ABI/version, and other failures where no safe capability-level payload exists.Normal service/domain outcomes.
CapExceptionCapability-level infrastructure failure after a target or accepted endpoint relationship exists: decode failure, unknown method, target gone, temporary overload after dispatch, or target invariant failure.Expected application/domain rejection.
Schema result unionOrdinary application or domain outcome: not found, permission denied by service policy, invalid business object, quota denied as a declared operation result, or accepted conditional failure.Ring/transport failure or generic catch-all exceptions.

Generated clients and future capos-service helpers should preserve this split: CQE status is transport failure, decoded CapException is capability infrastructure failure, and method result unions are the normal application error surface.

Use CQE status for ring transport errors, invalid SQE layout, invalid cap slot, kernel dispatch failure, buffer access failure, unsupported ring ABI/SQE version, malformed transfer descriptors, and other transport-level failures where no safe typed payload boundary exists.

Use CapException for capability infrastructure failure: unknown method, revoked capability, stale endpoint/session, permission or authority failure, resource exhaustion at a capability boundary, service unavailable, and unimplemented method.

Use schema result unions for normal domain/application outcomes: notFound, permissionDenied as a domain decision, invalidInput with domain meaning, alreadyExists, conflict, validation failure, and accepted/rejected business results.

Anti-rules:

  • Do not encode ordinary application outcomes as CapException.
  • Do not expose internal traces, filesystem paths, kernel pointers, or service-local details in cross-service exceptions by default.
  • Do not use generic Text errors where a stable union variant is possible.
  • Do not overload CapException::failed for every domain-level failure.

Preferred schema shape for ordinary domain outcomes:

struct OpenResult {
  union {
    file @0 :File;
    notFound @1 :Void;
    permissionDenied @2 :Void;
    invalidPath @3 :Void;
    unsupported @4 :Void;
  }
}
  • CAP_ERR_TRANSFER_NOT_SUPPORTED is used for transfer-bearing SQEs that the kernel currently dispatches but does not yet process (xfer_cap_count != 0 on kernels where sideband transfer is off).
  • CAP_ERR_INVALID_TRANSFER_DESCRIPTOR is used for structurally validly dispatched transfer SQEs where transfer metadata is malformed:
    • descriptor transfer_mode is not exactly CAP_TRANSFER_MODE_COPY or CAP_TRANSFER_MODE_MOVE;
    • any descriptor reserved bits are set;
    • any descriptor _reserved0 field is non-zero;
    • descriptor region placement (addr + len) is misaligned;
    • descriptor range overflows or cannot be safely bounded.
  • CAP_ERR_TRANSFER_ABORTED is reserved for transaction failure after partial transfer side effects are prepared and must not be observed (all-or-nothing rollback boundary).
  • CAP_ERR_INVALID_REQUEST remains for non-transfer transport malformation (unsupported opcodes for today, unsupported SQE fields not part of the transfer path, and malformed result/payload buffer pairs).

Historical: Pre-Ring cap_call Design

The sections from “Problem Statement” through “Migration Path” describe the original cap_call synchronous syscall that preceded the capability ring. They are preserved for design context; see the “Current CQE Error Namespace” and “CapException Schema” sections above for current behavior.

Problem Statement

Currently, cap_call returns u64::MAX on any error and prints the details to the kernel serial console. The userspace process receives no information about what went wrong – it cannot distinguish “invalid capability ID” from “method not implemented” from “out of memory inside the service.”

Every other capability system separates transport-level errors (bad handle, message validation failure) from application-level errors (the service processed the request and returned a meaningful error). capOS needs both.


Background: How Other Systems Do This

Cap’n Proto RPC Protocol

The Cap’n Proto RPC specification defines an Exception type in rpc.capnp:

struct Exception {
  reason @0 :Text;
  type @3 :Type;
  enum Type {
    failed @0;        # deterministic failure, retrying won't help
    overloaded @1;    # temporary resource exhaustion, retry with backoff
    disconnected @2;  # connection to a required capability was lost
    unimplemented @3; # method not supported by this server
  }
  trace @4 :Text;
}

These four types describe client response strategy, not error semantics. The capnp Rust crate maps them to capnp::ErrorKind::{Failed, Overloaded, Disconnected, Unimplemented}.

Cap’n Proto’s official philosophy (from KJ library and Kenton Varda’s writings): exceptions are for infrastructure failures, not application semantics. Application-level errors should be modeled as unions in method return types.

Cloudflare Workers RPC and Spritely/OCapN CapTP reinforce the network-boundary rule: remote promise breakage and error values are diagnostic material, not authority inputs, and debug details such as traces or internal paths can leak sensitive information. Future Workers RPC, Cap’n Web, CapTP, or OCapN-style adapters must deliberately map remote errors into CapException or schema result unions and strip or seal debug detail at the boundary. See Cloudflare, Cap’n Proto, Workers RPC, and Cap’n Web and Spritely, OCapN, and CapTP.

Capability OS Error Models

SystemTransport errorsApplication errors
seL4seL4_Error enum (11 values) from syscall returnIn-band via IPC message payload (user-defined)
Zirconzx_status_t (signed i32, ~30 values) from syscallFIDL per-method error type (union in return)
EROS/CoyotosKernel-generated invocation exceptionsOPR0.ex flag + exception code in reply payload
Plan 9 (9P)Connection loss (no in-band transport error)Rerror message with UTF-8 error string
GenodeIpc_error exceptionDeclared C++ exceptions via GENODE_RPC_THROW

Common pattern: a small kernel error code set for transport failures, combined with service-specific typed errors for application failures.

POSIX errno: Why Not

POSIX errno is a global flat namespace of ~100 integers that conflates transport errors (EBADF) with application errors (ENOENT). In a capability system:

  • EACCES/EPERM don’t apply – if you have the capability, you have permission; if you don’t, you can’t even name the resource.
  • A global error namespace conflicts with typed interfaces where errors should be scoped to the interface.
  • No room for structured information (which argument was invalid, how much memory was needed).
  • Not composable across trust boundaries – a callee’s errno has no meaning in the caller’s address space without explicit serialization.

Design

Principle: Two Levels, One Wire Format

Level 1 – Transport errors are returned in the syscall return value. These indicate that the capability invocation mechanism itself failed before the target CapObject was reached. No result buffer is written.

Level 2 – Application errors are returned as capnp-serialized messages in the result buffer. The capability was found and dispatched; the implementation returned a structured error. The syscall return value distinguishes this from a successful result.

Both levels use Cap’n Proto serialization for the error payload (level 2 always, level 1 when there’s a result buffer available). This keeps one parsing path in userspace.

Syscall Return Convention

The cap_call syscall (number=2) currently returns:

  • 0..N – success, N bytes written to result buffer
  • u64::MAX – error (undifferentiated)

New convention:

Return valueMeaning
0..=(u64::MAX - 256)Success. Value = number of bytes written to result buffer.
u64::MAXTransport error: invalid capability ID or stale generation.
u64::MAX - 1Transport error: invalid user buffer (bad pointer, unmapped, not writable).
u64::MAX - 2Transport error: params too large (exceeds MAX_CAP_CALL_PARAMS).
u64::MAX - 3Application error: the capability returned an error. A CapException message has been written to the result buffer. The message length is encoded in the low 32 bits of the value at result_ptr (the capnp message itself).
u64::MAX - 4Application error, but the result buffer was too small or NULL. The error detail is lost; the caller should retry with a larger buffer or treat it as an opaque failure.

The transport error codes are a small closed set (like seL4’s 11 values). New transport errors can be added, but the set should remain small and stable.

CapException Schema

Added to schema/capos.capnp:

enum ExceptionType {
    failed @0;
    overloaded @1;
    disconnected @2;
    unimplemented @3;
    invalidArgument @4;
}

struct CapException {
    type @0 :ExceptionType;
    message @1 :Text;
}

This mirrors Cap’n Proto RPC’s Exception struct, plus a capOS-only invalidArgument variant added with the Scheduler Phase D Task 1 schema slice (commit cb8c58b1, 2026-05-07). Capnp’s upstream Exception.Type remains a closed four-value set; capOS extends CapException because a capability boundary that validates arguments needs a typed signal distinct from failed. The five types describe client response strategy:

  • failed – deterministic failure on the callee side, retrying won’t help. Covers invariant violations, deserialization errors, and any capnp::ErrorKind variant not in the other categories. As of the Phase D Task 1 slice, callee-side argument rejection no longer maps here – new caps that validate inputs at the cap boundary should return invalidArgument instead.
  • overloaded – temporary resource exhaustion (out of frames, table full). Client may retry with backoff.
  • invalidArgument – the request was syntactically a well-formed capnp message but a parameter value violated the cap’s documented contract (e.g. SchedulingPolicyCap.setWeight rejecting weight = 0 or values outside [MIN_WEIGHT, MAX_WEIGHT]). The kernel does not silently clamp; the caller is expected to fix its input and retry, not back off. Today this is signalled by kernel cap modules through a small sentinel-prefix channel in kernel/src/cap/ring.rs (INVALID_ARGUMENT_SENTINEL) because capnp 0.25 has no ErrorKind::InvalidArgument and the enum is #[non_exhaustive]. The dispatcher strips the sentinel before serializing the CapException so the wire form is identical to the four upstream-aligned variants.
  • disconnected – the capability’s backing resource is gone (device removed, process exited). Client should re-acquire the capability.
  • unimplemented – unknown method ID for this interface. Client should not retry.

The message field is a human-readable string for diagnostics/logging. It must not contain security-sensitive information (internal pointers, kernel addresses) since it crosses the kernel-user boundary.

Application-Level Errors in Interface Schemas

Following Cap’n Proto’s philosophy, expected error conditions that a caller should handle programmatically belong in the method return type, not in the exception mechanism.

Example – FrameAllocator can legitimately run out of memory:

struct AllocResult {
    union {
        ok @0 :UInt16;       # result-cap handle index for a MemoryObject
        outOfMemory @1 :Void;
    }
}

interface FrameAllocator {
    allocFrame @0 () -> (result :AllocResult);
    allocContiguous @1 (count :UInt32) -> (result :AllocResult);
}

The caller can pattern-match on the result union without parsing an exception. This is the Zircon/FIDL model: transport errors at the syscall layer, application errors as typed return values.

When to use each:

SituationMechanism
Bad cap ID, stale generation, bad bufferTransport error (syscall return code)
Deserialization failure, unknown methodCapException with failed/unimplemented
Temporary resource exhaustion in dispatchCapException with overloaded
Expected domain-specific errorUnion in method return type
Bug in capability implementationCapException with failed

Kernel Implementation

CapObject trait change

The ring SQE does not carry a caller-supplied interface ID. The trait shape below keeps interface selection out of capability implementations because each capability entry owns one public interface:

#![allow(unused)]
fn main() {
pub trait CapObject: Send + Sync {
    fn interface_id(&self) -> u64;
    fn label(&self) -> &str;
    fn call(
        &self,
        method_id: u16,
        params: &[u8],
        result: &mut [u8],
        reply_scratch: &mut dyn ReplyScratch,
    ) -> capnp::Result<CapInvokeResult>;
}
}

Implementations serialize directly into the caller’s result buffer and return a completion containing the number of bytes written, or Pending for async endpoint calls. Dispatch uses the interface assigned to the target capability entry; normal CALL SQEs do not need to repeat that interface ID. capnp::Error carries ErrorKind with the four RPC exception types. The kernel’s dispatch handler converts Err(capnp::Error) into a serialized CapException message and writes it to the result buffer.

Syscall handler changes

In cap_call(), the error path changes from:

#![allow(unused)]
fn main() {
Err(e) => {
    kprintln!("cap_call: ... error: {}", e);
    u64::MAX
}
}

to:

#![allow(unused)]
fn main() {
Err(CapError::NotFound) => ECAP_NOT_FOUND,
Err(CapError::StaleGeneration) => ECAP_NOT_FOUND,
Err(CapError::InvokeError(e)) => {
    // Serialize CapException to result buffer
    let exception_bytes = serialize_cap_exception(&e);
    if result_ptr != 0 && result_capacity >= exception_bytes.len() {
        copy_to_user(result_ptr, &exception_bytes);
        ECAP_APPLICATION_ERROR
    } else {
        ECAP_APPLICATION_ERROR_NO_BUFFER
    }
}
}

The serialize_cap_exception function maps capnp::ErrorKind to ExceptionType:

capnp::ErrorKindExceptionType
Failedfailed
Overloadedoverloaded
Disconnecteddisconnected
Unimplementedunimplemented
All other variants (deserialization, validation)failed

This matches how capnp-rpc maps exceptions to the wire format.

Userspace API

The init crate (and future userspace libraries) wraps cap_call in a helper that interprets the return value:

#![allow(unused)]
fn main() {
pub enum CapCallResult {
    Ok(Vec<u8>),
    Exception(ExceptionType, String),
    TransportError(TransportError),
}

pub enum TransportError {
    InvalidCapability,
    InvalidBuffer,
    ParamsTooLarge,
}

pub fn cap_call(
    cap_id: u32,
    method_id: u16,
    params: &[u8],
    result_buf: &mut [u8],
) -> CapCallResult {
    let ret = sys_cap_call(cap_id, method_id, params, result_buf);
    match ret {
        ECAP_NOT_FOUND => CapCallResult::TransportError(TransportError::InvalidCapability),
        ECAP_BAD_BUFFER => CapCallResult::TransportError(TransportError::InvalidBuffer),
        ECAP_PARAMS_TOO_LARGE => CapCallResult::TransportError(TransportError::ParamsTooLarge),
        ECAP_APPLICATION_ERROR => {
            let (typ, msg) = deserialize_cap_exception(result_buf);
            CapCallResult::Exception(typ, msg)
        }
        ECAP_APPLICATION_ERROR_NO_BUFFER => {
            CapCallResult::Exception(ExceptionType::Failed, String::new())
        }
        n => CapCallResult::Ok(result_buf[..n as usize].to_vec()),
    }
}
}

Future: Batched Calls

When capOS adds batched capability invocations (async rings, pipelining), each request in the batch gets its own result status. The same two-level model applies per-request:

  • Transport error for the batch envelope (invalid ring descriptor, bad capability table) fails the whole batch.
  • Per-request transport errors (individual bad cap_id) fail that request.
  • Application errors are per-request, written to each request’s result slot.

This matches how NFS compound operations and JSON-RPC batch requests work: a transport error on the batch vs per-operation results.


What This Does NOT Cover

  • Error logging/tracing infrastructure. How errors get collected, aggregated, or displayed is a separate concern, owned by docs/proposals/system-monitoring-proposal.md. The kernel currently prints to serial; a future ErrorLog / audit-log capability captures structured error streams there.
  • Retry policy. The ExceptionType hints at retry strategy (overloaded -> retry, failed -> don’t, invalidArgument -> fix input and retry), but the retry logic itself belongs in userspace libraries, not the kernel.
  • Error propagation across capability chains. When capability A calls capability B which calls capability C, and C fails – how does the error propagate back through A? The single-hop transport-vs-application split is defined here; the cross-process spawn and endpoint-return surface that exercises it end-to-end is owned by docs/proposals/service-architecture-proposal.md together with the CAP_SQE_RETURN_APPLICATION_EXCEPTION shape in capos-config/src/ring.rs.
  • Result-buffer sizing. Truncation of serialized CapException payloads when callers under-size their result buffer is tracked as R15 in docs/design-risks-register.md. The per-process ringScratchLimitBytes and replyScratchLimitBytes resource-profile fields now bound the reply scratch used at both serialization call sites, eliminating spurious TRUNCATED results for constrained processes. Each cap contract should still document its expected result-buffer capacity rather than relying on truncation behavior.
  • Deferred release vs revocation. Owned-handle Drop in capos-rt enqueues CAP_OP_RELEASE rather than running synchronously; resource- pressure or revocation-sensitive flows that depend on a Disconnected surface must follow R6 in docs/design-risks-register.md and prefer CapabilityManager.revoke or epoch revocation rather than relying on Drop ordering.
  • Transactional semantics. Whether a failed operation has side effects (partial writes, allocated-but-not-returned frames) is per-capability semantics, not a kernel-level concern. The transfer-rollback boundary carried by CAP_ERR_TRANSFER_ABORTED is the only transport-level all-or-nothing guarantee.

Migration Path

Phase 1: Transport error codes (minimal, no schema changes)

Change cap_call to return distinct error codes instead of u64::MAX for all failures. Update the init crate to interpret them. No new schema types needed – application errors still use u64::MAX - 3 but without a structured payload (treated as opaque failure).

This is backward-compatible: existing userspace code that checks == u64::MAX sees different values for different errors, but any >= u64::MAX - 255 check catches all errors.

Phase 2: CapException serialization

Add ExceptionType and CapException to the schema. Implement serialize_cap_exception in the kernel. Update init to deserialize and display errors. Now userspace gets the exception type and message string.

Phase 3: Per-interface application errors

As interfaces mature, add typed error unions to method return types for expected error conditions. FrameAllocator::allocFrame returns AllocResult instead of bare UInt64. The exception mechanism remains for unexpected failures.


Design Rationale

Why mirror capnp RPC’s Exception type instead of inventing our own? Cap’n Proto already defines a well-thought-out exception taxonomy. The four types (failed, overloaded, disconnected, unimplemented) map directly to capnp::ErrorKind in Rust. Using the same vocabulary means capOS capabilities can eventually participate in capnp RPC networks without translation. It also means the Rust compiler enforces exhaustive matching on ErrorKind variants that matter.

Why not put error codes in the syscall return value only (like seL4)? seL4’s 11 error codes work because seL4 kernel objects are simple and fixed-function. capOS capabilities are arbitrary typed interfaces – a file system, a network stack, a GPU driver. The error vocabulary is open-ended. Encoding all possible errors as syscall return values would either require an ever-growing enum (fragile) or lose information (back to errno’s problems). The capnp-serialized CapException in the result buffer gives unbounded expressiveness without changing the syscall ABI.

Why not use capnp exceptions for everything (skip the transport error codes)? Because transport errors happen before the capability is reached. There’s no CapObject to serialize an exception. The kernel would have to synthesize a capnp message on behalf of a non-existent capability, which is wasteful and semantically wrong. A small integer return code is cheaper and more honest about what happened.

Why not define a generic Result(Ok) wrapper in the schema? Cap’n Proto generics only bind to pointer types (Text, Data, structs, lists, interfaces), not to primitives (UInt32, Bool). A Result(UInt64) for allocFrame wouldn’t work. Per-method result structs with unions are more flexible and don’t hit this limitation. The cost is a bit more schema boilerplate, which is acceptable given that capOS has a small number of interfaces.

Why string-based messages (like Plan 9) instead of structured error fields? String messages are adequate for diagnostics and logging. Structured error data belongs in the typed return unions (Phase 3), where the schema enforces what fields exist. Putting structured data in CapException would duplicate the schema’s job and encourage using exceptions for flow control, which Cap’n Proto explicitly warns against.

Security Review and Formal Verification Proposal

How to reason about the correctness and security of the capOS kernel and its trust boundaries in a way that fits a research OS – pragmatic tooling now, targeted verification where it pays off, no aspirational seL4-style full- kernel proofs. The docs/research/sel4.md survey already concluded that Isabelle/HOL-over-C verification does not transfer to Rust and that the design constraints matter more than the proof artefact. This proposal codifies that conclusion into a concrete tooling and process plan.

This proposal uses CWE for concrete vulnerability classes, CAPEC for attacker patterns, Rust language rules / unsafe-code guidance for low-level coding rules, Common Criteria protection-profile concepts for OS security functions, ITU-T X.800/X.805 security-services taxonomy as a completeness checklist, and capability-kernel practice (seL4/EROS-style invariants) for authority, IPC, object lifetime, and scheduler properties. Web-application checklists are not the baseline for OS design review.

Grounding sources:

  • MITRE CWE for root-cause weakness labels: CWE-20 explicitly covers raw data, metadata, sizes, indexes, offsets, syntax, type, consistency, and domain rules; CWE also marks broad classes such as CWE-20 and CWE-400 as discouraged for final vulnerability mapping when a more precise child fits.
  • MITRE CAPEC for attacker behavior, especially input manipulation (CAPEC-153), command injection (CAPEC-248), race exploitation (CAPEC-26 / CAPEC-29), and flooding/resource pressure (CAPEC-125).
  • Rust Reference and Rust 2024 Edition Guide for unsafe-block and unsafe_op_in_unsafe_fn obligations.
  • seL4 MCS and the existing capOS research notes for capability-authorized access to kernel objects and CPU time.
  • Common Criteria General Purpose Operating System Protection Profile for OS access-control, security-function, trusted-channel/path, and user-data protection concepts. capOS is not trying to certify against it; the PP is a vocabulary check for what an OS security review should not omit.
  • ITU-T Rec. X.800 (03/91) Security architecture for OSI and X.805 (10/03) Security architecture for systems providing end-to-end communications for the layered security-services taxonomy: authentication, access control, non-repudiation, data confidentiality, data integrity, availability, privacy × infrastructure/services/ applications planes × end-user/control/management planes. Used as a completeness matrix: if a proposal claims to cover security but leaves one cell unaddressed (e.g. “we have confidentiality but no non-repudiation story for the management plane”), review should flag the gap. Also ITU-T X.810-X.816 for the individual framework breakdowns — authentication (X.811), access control (X.812), non-repudiation (X.813), confidentiality (X.814), integrity (X.815), audit and alarms (X.816).

1. Philosophy and Scope

capOS is explicitly a research OS whose design principle is “schema-first typed capabilities, minimal kernel, reuse the Rust ecosystem.” Three consequences shape this proposal:

  1. The schema is part of the TCB. A bug in the .capnp schema, or in the way generated code is patched for no_std, is exactly as dangerous as a bug in the kernel. The schema, the capnpc build pipeline, and the generated code all need review attention – not only hand-written kernel code.
  2. The kernel should stay small. “Everything else is a capability” means the TCB is naturally bounded. Verification effort scales with TCB size, so resisting kernel bloat is itself a security property.
  3. The interface is the permission. Access control lives in capnp method definitions and in userspace cap wrappers (a narrow cap is a different CapObject), not in kernel rights bitmasks. Review must confirm that the kernel never short-circuits this: no ambient authority, no method that bypasses CapObject::call, no syscall that exposes an object without a capability handle.

Non-goals:

  • Full functional-correctness proof of the kernel à la seL4. Infeasible in Rust today, and the payoff is low for a research system whose surface area is still changing.
  • Proving information-flow / confidentiality properties end-to-end.
  • Certifying a specific configuration for external deployment.

2. Trust Boundaries and Threat Model

Enumerating the boundaries forces every future review to ask “which boundary does this change touch?” and picks out the code paths that matter.

TCB Statement

Current demo/proof TCB is broader than the target production TCB. Security claims must name which one they rely on.

Current demo/proof TCB:

  • kernel, including scheduler, memory management, capability dispatch, endpoint IPC, in-kernel networking, smoltcp runtime, line discipline, Telnet IAC filtering, PCI/virtio-net smoke code, and kernel-owned DMA buffers;
  • capos-config, schema/codegen output, manifest validator, and checked-in generated bindings;
  • capos-rt runtime transport, userspace entry/panic/allocator glue, and typed handle release behavior;
  • standalone init, AuthorityBroker, SessionManager, CredentialStore, shell launcher, restricted launcher, and demo services used by the active manifest;
  • focused QEMU manifests, host harnesses, and build tools used to construct and validate each proof image;
  • QEMU virtio devices and host-local loopback forwarding for networking proofs.

Target production TCB:

  • kernel primitives that enforce address-space isolation, capability tables, generation/epoch checks, ring transport validation, scheduler/thread safety, interrupt/timer correctness, and explicit DMA/IOMMU policy;
  • schema definitions, generated-code owner, shared ABI constants, and the build/signature path for production boot images;
  • minimal init/supervisor authority needed to assemble the service graph, grant narrowed caps, restart services, and expose scoped status/audit;
  • credential, session, broker, key-vault, audit, and remote-ingress services that directly decide authentication, authorization, disclosure, and key use;
  • production device managers, network stack, and storage services only to the extent they hold the corresponding device, network, or persistence authority.

Target non-TCB components should include ordinary applications, untrusted service binaries, domain libraries without privileged caps, shell children, and network peers. The target is not reached while default networking runs in the kernel TCB, the focused Telnet terminal-hosting fixture still relies on kernel TCP terminal handoff, SSH uses fixture/dev key material, or remote shells share pre-auth and post-auth process authority.

Current boundaries

BoundaryWho trusts whomCode that enforces it
Ring 0 ↔ Ring 3kernel trusts nothing from userkernel/src/mem/paging.rs, kernel/src/mem/validate.rs, arch/x86_64/syscall.rs; exercised by init/ and demos/*
Kernel ↔ user pointerkernel validates address + PTE perms under the process VM lockAddressSpace::validate_user_buffer, copy_from_user, copy_to_user, and legacy validate_user_buffer for current-CR3 diagnostics
Manifest ↔ kernelkernel parses capnp manifest at bootcapos-config::manifest, called from kmain
Build inputs ↔ TCBkernel trusts schema/codegen/build artifactsschema/capos.capnp, build.rs, Cargo.lock, Makefile
Host tools ↔ filesystem/processtools must not let manifest/config input escape intended host boundariestools/mkmanifest, generators, CI scripts
ELF bytes ↔ kernelkernel parses user ELF to map segmentscapos-lib::elf
User ring ↔ kernel dispatchkernel trusts no SQ statekernel/src/cap/ring.rs
CapObject::call wire formatkernel trusts no params bytesgenerated capnp decoders + impls
Process ↔ process IPCkernel routes calls between mutually isolated address spaces and trusts neither side’s bufferskernel/src/cap/endpoint.rs, kernel/src/cap/ring.rs, kernel/src/sched.rs
Device DMA ↔ physical memorykernel and device-manager trust no userspace driver-supplied device address, stale DMA handle, or stale interrupt routekernel/src/dma_backend.rs, kernel/src/device_dma.rs, kernel/src/device_manager/, and the DDF cap objects select a DMA backend at boot, expose manager-owned bounce-buffer handles when no trusted remapping domain exists, hide host physical addresses/IOVAs from userspace providers, and bind DeviceMmio/DMAPool/Interrupt lifecycle to generation-checked ownership ledgers. The QEMU Intel path has bounded per-device remapping evidence; current no-IOMMU cloud/GCE paths are brokered bounce-buffer authority and still do not claim hostile bus-master isolation.
WASI host adapter sandboxuserspace wasm-host runs untrusted Preview 1 payloads inside the vendored wasmi interpreter; capOS trusts no wasm import beyond the explicit grant set on HostStatecapos-wasm/src/wasi/preview1.rs translates wasm calls into typed Console/Timer/BootPackage/EntropySource invocations; per-instance argv text grants and random_get against the kernel EntropySource cap honor manifest-declared scope. Ungranted Preview 1 calls return ERRNO_NOSYS rather than fabricating authority. The boundary surface today covers W.1-W.4 (substrate, stdout-only stubs, argv grant, random_get production wiring); wasi_args and entropy fills are bounded by WASI_ARGS_MAX_* and RANDOM_GET_MAX_BYTES. Filesystem, environment beyond argv, full clocks, and remaining Preview 1 surface remain un-implemented refusals.
POSIX adapter v0 substratelibcapos-posix exposes a narrow fork-for-exec / pipe / socket / clock surface to C code; capOS trusts that the recording-shim window stays scoped to the synthetic child branch and that explicit grants pass through ProcessSpawner.spawnlibcapos-posix per-process static fd table, single-thread errno cell, kernel UdpSocket/Timer/Pipe clients, and the recording-shim Move-grant stdio_<N> path. The pseudo-child branch never calls _exit() on execve() failure; surface remains research/v0, not a full POSIX TCB.
Persistent config overlay ↔ initinit trusts no bytes from the on-disk system/config/overlay.bin; it validates the overlay version, SHA-256 content hash, the base manifest’s declared extension points (allowed service caps, max additional services, minOverlayEpoch, settings allowances), and base-pin non-collision before composing, and rejects the whole overlay (booting the base manifest floor) on any violationcapos-config::manifest (SystemConfigOverlay::from_capnp_bytes + compose_onto), init/src/main.rs apply_config_overlay; proof make run-installable-overlay
Hardware cap teardown auditkernel must record every acquire, release, rollback-detach, Drop-detach-failure, explicit driver-crash trigger, reset/disable trigger, interrupt-waiter trigger, and explicit bounded proof-buffer free for the DDF caps so post-mortem review can correlate device-manager state with cap lifecyclekernel/src/cap/hardware_audit.rs emit helper invoked from device_mmio.rs, interrupt.rs, dma_pool.rs, dma_buffer.rs, plus the devicemmio_grant_source.rs, dmapool_grant_source.rs, and interrupt_grant_source.rs userspace-grant rollback paths. The bounded DMAPool grant source emits DmaPool acquire for its manifest grant; DMAPool.allocateBuffer can mint one manager-attached proof DMABuffer result cap with its own acquire/free-buffer/release-after-free audit, while duplicate proof-buffer allocation and real DMA allocation remain blocked. Parent-first DMAPool release records a pending parent detach and completes after typed DMABuffer.freeBuffer frees the proof page, after cap release frees the proof page, or after successful DMABuffer driver-crash/reset-disable cleanup frees that page, preserving the one final DmaPool release audit. The real driver-crash teardown trigger entry points on DeviceMmioCap, InterruptCap, DmaPoolCap, and DmaBufferCap (device_manager::trigger_driver_crash_for_* plus each cap’s on_driver_crash) emit event=driver-crash exactly once per successful detach; stale rerun stays silent. The reset/disable trigger entry points on all four cap types (trigger_reset_disable_for_* plus each cap’s on_reset_disable) mirror that single-emit policy with event=reset-disable. The first cap-specific interrupt-waiter trigger on InterruptCap (trigger_interrupt_waiter_for_interrupt plus InterruptCap::on_interrupt_waiter) mirrors the same policy with exactly one event=interrupt-waiter audit record for the first successful detach. DMABuffer.freeBuffer emits exactly one event=free-buffer record on the successful explicit proof-buffer free, invalidates later DMABuffer.info, and leaves the later cap release as a no-op detach. The DMA pool reset path keeps the zero-live/quiesced/scrubbed evidence precondition, and the DMA buffer reset path reuses the bounded FreeBuffer page cleanup path before evidence-gated parent-pool cleanup. All explicit-trigger impls use the load-bearing exhaustive match outcome.detach_label() { "ok" => emit event, "noop" => silent, label => kprintln + DropDetachFailed } shape, so any future non-"ok"/non-"noop" outcome label still surfaces as a DropDetachFailed audit rather than being silently dropped. Each event is a cap-audit: key=value line on COM1 carrying the cap tag (interface id), event class, BDF, owner, and the relevant generation fields, and emit_cap_audit also appends it to a bounded volatile ring. HardwareAuditLog.snapshot exposes the latest retained records to userspace with drop-oldest retention while reporting volatile-only persistence, unsigned signatures, manifest-granted read-only snapshot access, production subscriber admission policy not implemented, and the volatile snapshot truncation contract; a QEMU-only local-ring proof asserts all four truncation labels without mutating the live ring. Durable storage, signing, and production subscriber admission remain future work. The legacy hardware-cap-release: line is retained alongside the audit line.

Attacker model

  • Untrusted service binaries. Today’s services are checked into the repo, but the manifest pipeline is meant to load arbitrary binaries eventually. Assume every byte of a service’s SQEs, params buffers, result buffer pointers, and return addresses is attacker-controlled.
  • Untrusted manifest. Once manifests are produced outside the repo (e.g. generated from CUE fragments, passed in as a Limine module), the manifest parser must reject every malformed input without panicking.
  • Resource exhaustion. Once multiple mutually-untrusting services run, a service can attack by filling rings, endpoint queues, capability tables, frame pools, scratch arenas, logs, or CPU time. Boundedness and accounting are security properties, not performance polish.
  • Build input drift. The schema/codegen path is already part of the TCB. External build inputs such as the bootloader checkout, Rust dependencies, capnp code generation, and generated-code patching must be reproducible enough that review can tell what changed.
  • Host tooling input. Build tools and generators run with developer/CI filesystem access. Treat manifest/config-derived paths and command arguments as untrusted until bounded to the intended directory and execution context.
  • Residual state and disclosure. Kernel logs, returned buffers, recycled frames, endpoint scratch space, and generated artifacts must not expose kernel pointers, stale bytes from another process, secrets, or build-system paths that increase attacker leverage.
  • Hostile interrupts / preemption. The scheduler preempts at arbitrary points. Any kernel invariant that is only transiently true must be held under the right lock or with interrupts disabled.
  • Out of scope (for now): physical attacks, speculative-execution side channels, malicious hardware, IOMMU bypass from DMA devices. These become in-scope once the driver stack lands; revisit the threat model then.

Threat Actor Matrix

ActorCurrent scopeCurrent treatmentProduction gate
Local physical attackerOut of scope.The prototype does not claim protection against physical memory access, bus probing, evil-maid boot replacement, cold boot, firmware compromise, or direct console access.Secure/measured boot, sealed storage keys, physical console policy, and hardware-rooted attestation before production claims.
Malicious DMA deviceOut of scope for hostile hardware; in scope only as confused userspace around cooperative QEMU virtio.The virtio-net smoke assumes QEMU-provided cooperative virtio hardware and kernel-owned bounce buffers. Without an IOMMU, a bus-mastering device can DMA arbitrary RAM.IOMMU-backed DMA domains or a documented hardware policy that forbids untrusted bus-mastering devices before userspace drivers or production hardware claims.
Malicious boot manifestPartially in scope.Manifest decoding/validation must fail closed and not panic. A manifest accepted by the kernel/init is still trusted to define the initial service graph and bootstrap grants.Signed/authorized manifest policy, boot-package integrity, and review-visible payload hashes before accepting manifests from outside the repo or operator-controlled build path.
Compromised init/supervisorPartially out of scope for current proofs.Current demo TCB includes init and manifest-declared trusted services. If init is compromised, it can misgrant authority within the bootstrap service graph.Minimize init, split supervisors, require narrow grant construction, audit graph changes, and make restart/update authority explicit.
Compromised service with narrow capsIn scope.Address-space isolation, cap-table lookup, generation checks, ring validation, transfer checks, and resource ledgers should constrain it to granted authority.Complete hostile smokes for transfer modes, resource exhaustion, panic surfaces, and revoke/epoch behavior per service class.
Hostile network peerIn scope only for loopback demo robustness, not production remote access.Telnet is plaintext loopback-only. SSH gateway work is fixture/prototype status without complete encrypted transport, durable key/account storage, full OpenSSH userauth/channel handling, or complete audit gates.Non-loopback remote shells stay blocked until SSH transport/auth/key/audit/storage gates pass and pre-auth/post-auth authority is isolated or otherwise proven constrained.
Hostile local web client of the remote-session-ui bridgeIn scope.Today’s bridge shares one upstream capOS session across all loopback HTTP clients with no per-browser session, missing-Origin short-circuit, and non-constant-time secret comparison.Per-browser BrowserSession cookie, CSRF/Host/Content-Type guards, strict CSP with the matching inline-script/style refactor, constant-time comparators, rate limiting, and the carry-over Tauri capability-allowlist minimization, all per remote-session-ui-security-proposal.md.
Malicious build dependency or toolPartially in scope.Lockfiles, generated-code checks, pinned Cap’n Proto/Limine/docs tools, and dependency-policy checks make drift review-visible, but Rust nightly, QEMU/xorriso/OVMF, and final image hashes are not fully pinned.Date/hash-pinned toolchains, recorded host tool versions, image/payload hashes, and reproducible production build path.

ITU-T X.800 security-services completeness matrix

X.800 enumerates five security services; X.805 extends the list with availability and privacy. Each review of a proposal or kernel change should be able to say which service it touches, or that it touches none. The point is not to implement every cell — capOS explicitly defers some (end-to-end non-repudiation, for example) — it is to make gaps explicit.

X.800/X.805 servicecapOS surface that provides it
Authentication (peer entity, data origin)user-identity-and-policy-proposal.md X.1254 LoA tiers; passkey + password credentials in boot-to-shell-proposal.md; certificate-based peer auth in certificates-and-tls-proposal.md (mTLS); future attestation in cryptography-and-key-management-proposal.md AttestationKeySource.
Access controlStructural: the capability model itself. The interface is the permission; wrapper caps attenuate; CapTable cannot be bypassed. Policy layer: AuthorityBroker (X.812 ADF) over CapObject::call (X.812 AEF).
Data confidentialityTransport: certificates-and-tls-proposal.md TlsSocket. At rest: volume-encryption-proposal.md. In memory: address-space isolation + SMAP + SMEP.
Data integrityTransport: TLS AEAD. At rest: authenticated block encryption (SymmetricAlgorithm.aes256GcmSiv etc.). Manifest/boot: signed manifests (storage-and-naming-proposal.md Open Q #5). In-transit schema: Cap’n Proto wire format + bounds-checked decoders.
Non-repudiation (origin, delivery)Partial. Signed audit records (system-monitoring-proposal.md + cryptography-and-key-management-proposal.md audit key purpose). End-to-end non-repudiation for user actions is deferred until signed sessions exist.
Availability (X.805)Resource ledgers, bounded rings, CAP_OP_RELEASE, supervisor restart policy, rate limiters on monitoring ingestion. DoS resistance is a review dimension, not a separate subsystem.
Privacy (X.805)Principal pseudonymity (user-identity-and-policy-proposal.md pseudonymous profile), audit-record redaction, monitoring “payload capture is exceptional” default.

The matrix is a checklist, not a claim of completeness: individual proposals remain authoritative about what they do and don’t provide.

3. Tiered Approach

Four tiers, cheapest first. Each tier is independently useful, and later tiers assume earlier ones are in place.

Tier 1 – Hygiene and CI (cheap, high value)

These are the controls that make every other tier work. The only checked-in GitHub Actions workflow is .github/workflows/ci.yml; it runs formatting, host tests, cargo build --features qemu, make capos-rt-check, make generated-code-check, make dependency-policy-check, and make workflow-check. The QEMU smoke job installs its own boot tools and runs make plus make run, but remains non-blocking, so it is not yet a required boot assertion. No separate clippy, miri, fuzz, or Kani workflow files exist yet – those are scheduled per the track table below.

  • Continuous integration via GitHub Actions (or equivalent). Current baseline: make fmt-check, cargo test-config, cargo test-ring-loom, cargo test-lib, cargo test-mkmanifest, cargo build --features qemu, make capos-rt-check, make generated-code-check, make dependency-policy-check, and make workflow-check. Remaining CI work: treat QEMU boot as a required CI gate once runtime flakiness is acceptable, then add the security policy jobs below.
  • cargo clippy --all-targets -- -D warnings across workspace members, with a curated set of clippy::pedantic / clippy::nursery lints that pay off for kernel code (clippy::undocumented_unsafe_blocks, clippy::missing_safety_doc, clippy::cast_possible_truncation, etc.). Do NOT enable all of pedantic blindly – review each lint and either enable it or add a rationale comment.
  • cargo-deny for license and advisory gating; cargo-audit for the RustSec advisory DB against Cargo.lock. Dependencies include capnp, spin, x86_64, limine, linked_list_allocator – all externally maintained.
  • cargo-geiger report of unsafe surface area per crate, checked in as a snapshot and diffed in CI so growth is visible in PRs.
  • Deny unsafe_op_in_unsafe_fn (already required by edition 2024; make sure it stays on) and missing_docs on public kernel items where it is not already the case.
  • Dependency review discipline: every new dep needs a one-line rationale in the commit message and a check that it is no_std-capable, maintained, and does not pull in a surprise async runtime or heavy transitive graph.
  • No-std dependency rubric: kernel/no_std additions require an explicit compatibility check that core/alloc paths do not regress to std through default feature drift, and class ownership is recorded against docs/trusted-build-inputs.md.
  • Boot/build input pinning: pin external bootloader/tool downloads to an auditable revision or checksum. Branch names are not enough for TCB inputs. CI should fail when generated capnp bindings or no-std patching change outside an intentional schema/codegen update.
  • Untrusted-path panic audit: panic!, assert!, .unwrap(), and .expect() are acceptable during bring-up, but every path reachable from manifest bytes, ELF bytes, SQEs, params buffers, result buffers, and future IPC messages needs either a fail-closed error or a documented halt policy.
  • Hardware protection smoke tests: boot under QEMU with SMEP/SMAP-capable CPU flags and assert CR4.SMEP/CR4.SMAP once paging is initialized. Every explicit user-memory dereference must be wrapped in a short STAC/CLAC window once SMAP is enabled.

Tier 2 – Targeted dynamic analysis

Aimed at the host-testable pure-logic crates (capos-lib, capos-config) where the Rust toolchain just works. No kernel changes required.

  • Miri on the cargo test-lib and cargo test-config suites. Catches UB in pure-logic code: invalid pointer arithmetic, uninitialized reads, bad provenance, unsound unsafe. The FrameBitmap and CapTable tests in particular push against slot indexing, generation counters, and raw &mut [u8] handling – exactly what miri is good at.
  • proptest (or quickcheck) on:
    • capos-lib::elf::parse – random bytes / random perturbations of a valid header must never panic and must refuse anything that isn’t a correctly formed user-half ELF64.
    • capos-lib::frame_bitmap – interleaved sequences of alloc, alloc_contiguous, free, mark_used preserve the invariant free_count == popcount(bitmap == 0) and never double-free.
    • capos-lib::cap_table – insert/remove/lookup sequences preserve “every returned id resolves to its insertion-time object, and stale ids are rejected.”
    • capos-config::manifest encode/decode round trip on arbitrary manifests.
    • Schema round-trip tests in capos-config/tests/: today remote_capnp_rpc_dto_roundtrip.rs pins the remote capnp-rpc DTO wire shape, and remote_paperclips_dto_roundtrip.rs (10 tests) pins the Remote Session Paperclips DTO wire shape ahead of the future gateway/worker/browser bridge that will marshal traffic through them. New shared-DTO families should land alongside similar round-trip coverage so schema drift is review-visible.
  • cargo fuzz harnesses (libFuzzer). The current fuzz/fuzz_targets/ set is seven targets: elf_parse.rs, manifest_capnp.rs, mkmanifest_json.rs, sqe_validation.rs (ring SQE wire validator via capos_config::ring::sqe_wire_validation_error), telnet_filter.rs, telnet_filter_roundtrip.rs, and line_discipline.rs. The Telnet round-trip oracle exists alongside the structural Telnet filter target because the round-trip variant found a real EXOPL parsing bug (docs/changelog.md). These run outside CI (they never terminate) but have seed corpora under fuzz/corpus/ and can be exercised in fixed budgets via make fuzz-build and make fuzz-smoke.
  • Sanitizers on host tests: make sanitizer-host-tests runs AddressSanitizer over the capos-lib and capos-config host suites under the repo-pinned nightly (zero findings to date). ASan is indeed cheap – it needs no -Zbuild-std. ThreadSanitizer (make sanitizer-host-tests-tsan) is wired but currently blocked by an upstream cargo -Zbuild-std + build-script limitation when the sanitizer target equals the host triple; see Track S.17 for the recorded reproduction.

Tier 3 – Concurrency model checking

The capability ring is a lock-free single-producer / single-consumer protocol using volatile reads, release/acquire fences, and a shared head/ tail pair. It is the most likely source of subtle memory-ordering bugs and is also the most isolated – a perfect fit for model checking.

  • Loom on a host-buildable wrapper of the ring protocol. Extract the producer/consumer state machine from capos-config::ring into a form where atomics can be swapped for loom::sync::atomic, and write Loom tests that enumerate all interleavings of producer/consumer for small ring sizes (2–4 slots). Properties to check:
    • No CQE is lost.
    • No CQE is double-delivered.
    • The sq_head/sq_tail and cq_head/cq_tail pointers never observe a state that implies tail - head > SQ_ENTRIES.
    • The userspace ring “corrupted producer state” fail-closed policy from prior review-finding task records holds under adversarial interleavings.
  • Shuttle as a lighter alternative for regression-style tests once the specific bugs are known; cheaper per run, randomised rather than exhaustive. Good for long-running overnight jobs.

Loom coverage here is disproportionately valuable: it substitutes for the SMP-hardness work the project has explicitly deferred, and it exercises exactly the ordering that TOCTOU-style bugs hide in.

Tier 4 – Bounded verification of specific invariants

Not a full-kernel proof. Targeted, property-specific, one-module-at-a-time.

  • Kani (bounded model checking for Rust, via CBMC). Good fit for small, heap-free, arithmetic-heavy functions. Candidate modules:
    • capos-lib::cap_table – prove that for all insert; remove; insert' sequences under a u8 generation counter, a stale CapId never resolves. Bound: table size ≤ 4, generation window ≤ 256.
    • capos-lib::frame_bitmap – prove that for all bitmap sizes up to N bytes, alloc_frame followed by free_frame of the same frame restores the original bitmap and free_count.
    • capos-lib::elf::parse bounds checks: prove that every index into the program header table is < len, given the validated phentsize and phnum.
  • Verus (SMT-based Rust verifier, active development at MSR) for invariants that Kani can’t handle ergonomically, particularly those involving loops and ghost state. Worth tracking but don’t commit to it yet – the proof-engineering cost is real, and the tool is still young. Revisit once IPC lands and the kernel has stable public APIs.
  • Creusot / Prusti are alternatives in the same space. Do not invest in more than one SMT-based verifier; pick whichever has the best story for no_std + alloc code when Tier 4 starts.

Deliberately out of scope: Isabelle/HOL, Coq proofs, Frama-C. They would require re-encoding Rust in a foreign semantic framework with no established Rust front-end mature enough for kernel code.

4. Security Review Process

REVIEW.md is the rules document and docs/tasks/** is the open remediation and review-finding ledger. REVIEW.md contains the common security checklist that applies across kernel, userspace, host tooling, generators, and CI. The per-boundary prompts below are an expansion of that common checklist for OS-specific code paths.

CWE/CAPEC tagging policy

Security findings should carry CWE metadata when the mapping is specific enough to help a reviewer or future audit. Do not force a CWE into every title.

  • Prefer Base/Variant CWE IDs when the root cause is known: CWE-770 for unbounded allocation, CWE-88 for argument injection, CWE-367 for a concrete validation-to-use race, CWE-416 for a real use-after-free.
  • Use Class IDs as temporary or umbrella labels: CWE-20 for “input was not validated enough” before the missing property is known; CWE-400 for general resource exhaustion only when the enabling mistake is not more precise.
  • Use capability-kernel invariants instead of weak CWE mappings for design properties such as “no ambient authority”, “cap transfer happens exactly once”, “revocation cannot leave stale authority”, and “scheduling context donation cannot fabricate CPU authority”. Cite CWE-862/CWE-863 only when the issue is actually a missing or incorrect authorization check.
  • Use CAPEC for the attacker pattern when useful: input manipulation, command injection, race exploitation, flooding, or path/file manipulation. CAPEC is not a substitute for the CWE root-cause tag.

Current checklist coverage:

AreaPrimary tagsReview intent
Structured input validationCWE-20, CWE-1284–CWE-1289 when preciseValidate syntax, type, range, length, indexes, offsets, and cross-field consistency before privileged use
Filesystem pathsCWE-22, CWE-23, CWE-59Keep host-tool paths inside intended roots across absolute paths, traversal, symlinks, and file-type confusion
Commands/processesCWE-78, CWE-88Avoid shell interpolation; constrain binaries and arguments
Numeric/buffer boundsCWE-190, CWE-125, CWE-787Check arithmetic before pointer, slice, copy, ELF segment, and page-table use
Resource exhaustionCWE-770 preferred; CWE-400 broadBound queues, allocations, retries, spin loops, frames, scratch arenas, cap slots, and CPU budget
Exceptional pathsCWE-703, CWE-754, CWE-755; CWE-248 only for uncaught exceptionsFail closed on malformed or adversarial input; avoid trust-boundary panic/abort
Authorization/cap authorityCWE-862, CWE-863 plus capOS invariantsVerify capability ownership, generation, object identity, address-space ownership, and transfer policy
Concurrency/TOCTOUCWE-362, CWE-367, CWE-667Preserve lock ordering, interrupt masking, page-table stability, and validation-to-use assumptions
Lifetime/reuseCWE-416, CWE-664, CWE-672Prevent stale caps, stale kernel stacks, stale frames, and expired IPC state from being used
Disclosure/residual dataCWE-200, CWE-226Prevent logs, result buffers, frames, scratch arenas, and generated artifacts from leaking stale or sensitive data
Supply chain / generated TCBcapOS TCB invariant; use CWE only for concrete bugPin or review-visible drift for bootloader, dependencies, schema/codegen, generated code, and patching

Per-boundary review checklist

  • Syscall surface change (arch/x86_64/syscall.rs):
    • Every register-passed argument is treated as attacker-controlled.
    • No user pointer is dereferenced without an AddressSpace-locked copy/read helper or an explicitly documented equivalent stability guarantee.
    • Numeric conversions, copy lengths, and pointer arithmetic are checked before constructing slices or entering any direct user-access scope.
    • Kernel stack pointer and TSS.RSP0 invariants are preserved.
    • The syscall count stays bounded; a new syscall has an SQE-opcode alternative considered and explicitly rejected with rationale.
  • Ring dispatch change (kernel/src/cap/ring.rs):
    • SQ bounds check and per-dispatch SQE limit still enforced.
    • Corrupted SQ state fails closed (never re-processes the same bad state on the next tick).
    • No allocation in the interrupt-driven path beyond what the owning task record or panic-surface inventory explicitly accepts.
    • Result buffers and endpoint scratch buffers cannot leak stale bytes beyond the returned completion length.
  • User buffer validation change (kernel/src/mem/paging.rs, kernel/src/mem/validate.rs):
    • Address range check precedes PTE walk.
    • PTE flags checked: present, user, and write (if the buffer is written).
    • For process-owned buffers, validation and copy/read hold the process AddressSpace mutex. Any current-CR3 validator caller must document its own page-table stability guarantee.
  • ELF loader change (capos-lib::elf):
    • Every field bounded before use (phentsize, phnum, p_offset, p_filesz, p_memsz, p_vaddr).
    • Segments confined to the user half.
    • Overlap check preserved.
    • Integer arithmetic uses checked add/subtract before deriving mapped addresses, file slices, or zero-fill ranges.
  • Manifest change (capos-config::manifest):
    • Every optional field is either present or the service is rejected.
    • Name / binary / cap source strings are length-bounded.
    • Unknown / unsupported numbers in CUE input fail-closed with a path- specific error.
    • Capability grants are checked as an authority graph before any rejected graph can start a service.
  • Schema change (schema/capos.capnp):
    • Backward-compatible with existing wire format, or migration documented.
    • Every new method has an explicit capability-granting story (who mints the cap that lets this method be called?).
    • Generated code no_std patching still applies.
  • Host tool or generator change (tools/*, build.rs, CI scripts):
    • Manifest/config-derived paths cannot escape intended directories through absolute paths, traversal, symlinks, or file-type confusion.
    • External command execution uses explicit binaries and argument APIs, not shell interpolation of untrusted strings.
    • Generated outputs are review-visible and fail closed on malformed inputs.
    • Generated files and diagnostics do not disclose secrets, absolute paths, or stale build outputs beyond what the developer intentionally requested.
  • Unsafe block added or expanded: Tier 1 clippy lints plus REVIEW.md §“Unsafe Usage” checklist already cover this; the review should cite the specific invariant being maintained in the commit message.

Threat-model refresh

On every stage completion (Stage 6 IPC, Stage 7 SMP, first driver landing, first time a manifest comes from outside the repo), re-run §2 of this document and update it. The list of trust boundaries grows over time; the proposal decays if it doesn’t grow with the code.

Periodic full audit

Once per stage, schedule a focused audit pass:

  1. Re-verify every boundary’s code is still enforced at its documented entry point (no new bypass path).
  2. Re-run all Tier 2/3 jobs with the latest toolchain (catches tool-upgrade regressions).
  3. Walk through open review-finding task records and confirm each is still correctly classified (still open, fixed, explicitly accepted, blocked, or on-hold).
  4. Record the audit date and outcome in the relevant task records or a focused closeout task, matching the repository timestamp convention.

5. Concrete Verification Targets

Ordered by value and feasibility. Each one is a specific, bounded piece of work a contributor can pick up without needing to redesign the kernel.

#TargetTierPropertyBlocker
1capos-lib::cap_table4 (Kani)Stale CapId never resolves after slot reuse within the generation windowNone
2capos-lib::frame_bitmap4 (Kani)alloc/free preserve free_count invariant; no double-allocNone
3capos-lib::elf::parse2 (proptest + fuzz)No panic on arbitrary input; only well-formed user-half ELF64 acceptedNone
4capos-config::manifest2 (proptest + fuzz)Decode/encode round-trip; malformed input rejected without panicNone
5Ring SPSC protocol3 (Loom)No lost/doubled CQEs; fail-closed on corruption under all interleavingsExtract protocol into Loom-testable wrapper
6AddressSpace user-buffer helpers4 (Kani)Every accepted buffer lies entirely in user half with correct PTE flags, and validation/use happens under the address-space lockFormalise PTE and locking model
7Ring dispatch path3 (Loom + proptest)SQE poll is bounded per tick; no allocation on the dispatch pathInitial alloc-free synchronous path landed; async transfer/release paths still need coverage
8IPC routing3Capabilities transferred exactly once; no duplication under direct-switchCapability transfer
9Direct-switch IPC handoff2 + 3Scheduler invariants preserved when a blocked receiver bypasses normal run-queue orderLoom-testable scheduler/ring model
10SMEP/SMAP + user access windows1 + QEMU integrationKernel cannot execute user pages; direct user-memory touches either use audited access windows or the AddressSpace/HHDM copy pathWire existing x86_64 helper into init path
11Manifest authority graph2 (property tests)Every granted cap source resolves, every export is unique, and no service starts after a rejected graphManifest executor path
12Resource accounting2 + 3Rings, endpoints, cap tables, scratch arenas, frames, and CPU budget fail closed under exhaustionSecurity Verification Track S.9 design complete; implementation hooks pending
13Build/codegen TCB1Bootloader/deps/codegen inputs are pinned and generated output changes are review-visibleCI bootstrap
14Device DMA boundary (future)1 + design reviewNo driver or device can DMA outside explicitly granted buffersPCI/device work; IOMMU or bounce-buffer decision

Targets 1–4 are feasible today and should be the first batch of work. Target 10 is the security gate before treating Stage 6 services as untrusted. Targets 11–12 should be designed before capability transfer lands, otherwise the first IPC implementation will bake in ambient resource authority. Target 14 gates user-mode or semi-trusted drivers.

Current status as of 2026-05-16:

  • Targets 1–2 are part of the completed Verified Core visible milestone: commit d43b691 at 2026-04-23 22:09 UTC made make kani-lib the bounded local/GitHub proof gate for cap-table and frame-bitmap invariants, and commit c5968ee at 2026-04-23 22:12 UTC recorded the high-memory make kani-lib-full Cloud Build gate.
  • Target 3 has arbitrary-input proptest coverage and a cargo-fuzz target for ELF bytes. The current Kani harness still only proves the short-input early-reject path because fully symbolic ELF parsing reaches allocator and sort internals before there is a sharper proof obligation.
  • Target 4 has cargo-fuzz coverage for manifest decoding/roundtrip and mkmanifest exported-JSON conversion.
  • Target 5 has a feature-gated Loom model for the shared ring protocol.
  • Target 13 has an initial CI baseline plus generated-code drift checking, dependency audit/deny gates, and required QEMU boot still open. Remaining supply-chain provenance work is tracked by docs/tasks/trusted-build-inputs-pr-blocking-provenance.md; panic-surface hardening remains tracked by its owning task records across IPC/scheduler guarded unwraps, rollback restoration, stale queues, blocking waits, process/thread exit, endpoint cancellation, TLB shootdown send failures, and scheduler hot-path expects. Scheduler hot-path panic surface fully closed (2026-05-17, REVIEW_FINDINGS commit 1b295cb3): all .expect() / .unwrap() in block_current_on_cap_enter, next_start_context, schedule, exit_current, exit_current_thread, capos_block_current_syscall, and retain_endpoint_queue hardened per the established let-else + log + drop-lock + hcf() / return None / break pattern (per-function closures at 7f86796f / 777e0b3a / 0af439d4 / 7d93aea4 / b04d6d65 / 2bea189c).
  • Out-of-band scheduler/runtime hazards tracked in review-finding task records but not yet expressed as Concrete Verification Targets above: current post-AP kernel upper-half page-table mutation through the MMIO/firmware helper path is closed by kernel-wide TLB shootdown plus preseed/fail-closed PML4-slot handling (../tasks/done/2026-06-07/kernel-upper-half-pml4-propagation-hardening.md); future helper windows or allocator-growth paths that need a new kernel-half PML4 slot still require boot preseed or synchronized live-root propagation. ParkSpace unmap/reuse cleanup still owes shared park-word cleanup and address-space generation cleanup; resource quota fields for scratch bytes, outstanding calls, endpoint queues, and in-flight calls need real wiring or removal. Each is owned by its respective subsystem proposal; the consolidated routing index lives in docs/design-risks-register.md.

6. Security Verification Track Registry

The S.x labels are registry identifiers for this proposal’s security-verification track. They are not product stages and should be expanded as “Security Verification Track S.x” when cited outside this proposal.

TrackNameStatusPrimary document or evidence
S.1CI bootstrapLanded 2026-04-21.github/workflows/ci.yml
S.2Miri + proptest on capos-libLanded 2026-04-21cargo test-lib, cargo miri-lib
S.3Manifest + mkmanifest fuzzingLanded 2026-04-21fuzz/ manifest and mkmanifest targets
S.4Ring Loom harnessLanded 2026-04-21capos-config/tests/ring_loom.rs
S.5Kani on capos-libInitial landed 2026-04-21, expanded bounded gate landed 2026-04-23make kani-lib
S.6Security review docs stay alignedOngoingREVIEW.md, CLAUDE.md
S.7Stage-6-aware refreshPlanned/ongoingTrust-boundary inventory after Stage 6 changes
S.8Untrusted-service hardening gatePlannedSMEP/SMAP, user access windows, hostile-userspace tests
S.9Authority graph and resource accountingLanded 2026-04-21docs/authority-accounting-transfer-design.md
S.10Supply-chain and generated-code TCBPartially landeddocs/trusted-build-inputs.md
S.11Device/DMA isolation gateDesign accepted; brokered-bounce DDF production authority gates landed for the current local/GCE path, while direct-remapping and hostile-hardware claims remain futuredocs/dma-isolation-design.md
S.12Kani harness bounds refreshPlannedFuture transfer/accounting/user-buffer proof obligations
S.13ELF parser arbitrary-input coverageLandedcapos-lib::elf::parse, fuzz/fuzz_targets/elf_parse.rs
S.14Telnet IAC filter fuzz coverageLanded 2026-04-27 16:33 EESTcapos-lib::telnet, fuzz/fuzz_targets/telnet_filter.rs
S.15Telnet differential round-trip + line-discipline extractionLanded 2026-04-27 17:18 EESTcapos-lib::line_discipline, Telnet round-trip fuzz target
S.16Ring SQE wire-validation extraction + fuzz targetLanded 2026-04-27 19:42 EESTcapos_config::ring::sqe_wire_validation_error, fuzz/fuzz_targets/sqe_validation.rs
S.17Sanitizers on host testsASan landed (zero findings); TSan blocked upstreammake sanitizer-host-tests / make sanitizer-host-tests-tsan

Track Details

This slots into docs/tasks/README.md as a cross-cutting track rather than a phase – items are independent of Stage 6 IPC and can proceed in parallel.

Subtracks are scoped identifiers under their parent track:

SubtrackParentNamePrimary document or evidence
S.10.0S.10Trusted build input inventorydocs/trusted-build-inputs.md
S.10.2S.10Generated-code drift checkmake generated-code-check
S.10.3S.10Dependency policy and no_std review gatemake dependency-policy-check, deny.toml
S.11.1S.11DMA capability invariantsdocs/dma-isolation-design.md
S.11.2S.11Userspace-driver ownership-transition gatedocs/dma-isolation-design.md

Security Verification Track S.11.2 defines checklist rows S.11.2.0 through S.11.2.9 in docs/dma-isolation-design.md; those row labels are local acceptance criteria for the userspace-driver transition, not independent registry tracks.

  • Track S.1 – CI bootstrap – landed 2026-04-21
    • .github/workflows/ci.yml: fmt-check, test-config, test-ring-loom, test-lib, test-mkmanifest, cargo build --features qemu, make capos-rt-check, generated-code drift checking, and dependency policy checking.
    • QEMU smoke installs build-essential, capnproto, qemu-system-x86, xorriso, and cue v0.16.0 before running make and make run; it remains optional/non-blocking until boot runtime is stable enough to make it a required gate.
    • Clippy-with-deny and cargo-geiger remain future hardening jobs.
  • Track S.2 – Miri + proptest on capos-lib – landed 2026-04-21
    • Add proptest dev-dependency to capos-lib.
    • Host properties for capos-lib::cap_table and capos-lib::frame_bitmap; ELF arbitrary-input coverage is tracked separately under landed Security Verification Track S.13.
    • cargo test-lib runs the native host suite; cargo miri-lib runs the same crate under Miri.
  • Track S.3 – Manifest + mkmanifest fuzzing – landed 2026-04-21
    • fuzz/ crate with harnesses for manifest::decode and tools/mkmanifest CUE → capnp pipeline. Seed corpus checked in.
  • Track S.4 – Ring Loom harness – landed 2026-04-21
    • Extract the SPSC protocol from capos-config::ring into a test-only wrapper where atomics are swappable.
    • Loom tests covering corruption, overflow, and ordering.
    • Doubles as regression coverage for Phase 1.5 in docs/tasks/README.md.
  • Track S.5 – Kani on capos-lib – initial harnesses landed 2026-04-21, expanded bounded gate landed 2026-04-23
    • CapTable generation/index/stale-reference invariants.
    • FrameBitmap fail-closed free-error behavior plus a concrete bounded contiguous-allocation proof.
    • Transfer/resource-accounting fail-closed invariants for cap-slot preflight, frame-grant reservation, invalid transfer-origin rejection, move-reservation rollback after revocation, source visibility/accounting after the real prepare_copy_transfer path, and provisional destination cap-slot/frame-grant ledger restoration.
    • Propagation of real prepared transfer metadata into a provisional destination slot is reserved for make kani-lib-full; Google Cloud Build run 95b49620-06a5-49f4-85e6-782adb82d11c passed this high-memory gate on 2026-04-23.
    • ELF parser short-input early-reject panic-freedom exists as a targeted Kani harness but is not part of the mandatory bounded gate.
    • The current bounds are intentionally conservative so make kani-lib remains a practical local/GitHub CI gate; broader symbolic ELF and contiguous-allocation proofs should wait for more specific invariants or high-memory runners.
  • Track S.6 – Security review docs stay aligned
    • Keep REVIEW.md’s common security checklist aligned with §4’s boundary prompts as new boundaries land.
    • Add a “threat model refresh” step to the stage-completion workflow in CLAUDE.md.
  • Track S.7 – Stage-6-aware refresh
    • Re-run §2 trust-boundary inventory after capability transfer/release semantics land.
    • Plan Loom coverage for cross-process routing and direct-switch IPC.
    • Carry the inventory through the active scheduler-evolution phases (Phase D WFQ, Phase E SchedulingContext, Phase F one-SQ-consumer and nohz telemetry) and the WASI host-adapter surface (Phase W.4 entropy production wiring + per-instance argv text grant) so each new boundary is reflected in §2 before it can be relied on. The WASI host adapter is a userspace trust boundary – wasmi sandbox around untrusted Preview 1 payloads with per-instance EntropySource / argv grants – that the Tier 2/3 plan should explicitly cover as new harness targets emerge (see docs/proposals/wasi-host-adapter-proposal.md).
    • The Phase 1 monitoring log surface (LogSink/LogReader, kernel/src/cap/log.rs) is a new kernel boundary: a LogSink accepts bounded userspace-supplied records (decoded, length-truncated, severity- filtered against SystemConfig.logLevel) into a bounded drop-oldest ring, and a scoped LogReader serves cursor/filtered snapshots. It confers no transfer/grant authority beyond the scoped sink/reader and adds no ambient log namespace. Carry it in §2 before downstream services rely on it; per-process log token-bucket backpressure remains future work (docs/proposals/system-monitoring-proposal.md).
  • Track S.8 – Untrusted-service hardening gate
    • Wire SMEP/SMAP enablement into x86_64 init after paging is live.
    • Replace raw user-slice construction in syscall/ring paths with checked copy/access helpers that bracket the actual access with STAC/CLAC.
    • Add QEMU hostile-userspace tests for bad pointers, kernel-half pointers, invalid caps, corrupted rings, and services without Console authority.
    • Audit untrusted-input paths for panics before Stage 6 endpoints run mutually-untrusting processes.
  • Track S.9 – Authority graph and resource accounting – landed 2026-04-21
    • Concrete design is captured in docs/authority-accounting-transfer-design.md.
    • Defines authority graph invariants, per-process quota ledger (cap slots, endpoint queue, outstanding calls, scratch, frame grants, log volume, CPU budget), diagnostic aggregation, and exactly-once transfer/rollback semantics.
    • Establishes acceptance criteria that gate capability transfer and ProcessSpawner implementation. Current follow-up items live in docs/backlog/stage-6-capability-semantics.md.
  • Track S.10 – Supply-chain and generated-code TCB
    • Pin Limine and other external build inputs by revision/checksum rather than branch name.
    • Make capnp generated-code changes review-visible in CI, including the no-std patching step.
    • Consider cargo-vet only after cargo-deny/cargo-audit are in place; vetting too early is process theater.
    • Security Verification Track S.10.3 adds a concrete dependency policy: no_std additions are accepted only with class attribution, cargo deny + cargo audit, and explicit lockfile intent.
    • Security Verification Track S.10.3 enforcement is make dependency-policy-check, backed by deny.toml and pinned CI installs of cargo-deny 0.19.4 and cargo-audit 0.22.1.
  • Track S.11 – Device/DMA isolation gate
    • The DMA isolation story is now runtime-selected and fail-closed: guest-programmable remapping only when capOS can discover, program, and validate it; otherwise labeled brokered bounce buffers or unsupported.
    • DMAPool, DeviceMmio, and Interrupt invariants are represented by done task evidence for bounded physical/device-visible ranges, explicit interrupt ownership, reset/release teardown, generation checks, and no raw host-physical grants to untrusted drivers.
    • The current GCP/no-IOMMU userspace-provider path is brokered bounce-buffer authority. It supports the proved virtio-net and NVMe provider chains without claiming direct DMA, IOVA export, hostile bus-master isolation, or device-autonomous MSI-X delivery.
    • The DDF production-authority closeout closes the retained review finding for the current brokered-bounce provider path. Security Verification Track S.11.2 remains the canonical matrix for future direct-remapping/vIOMMU, hostile-hardware isolation, and broader device-owner claims.
  • Track S.12 – Kani harness bounds refresh
    • Revisit Kani bounds and harness shape once capability transfer, resource-accounting, or AddressSpace user-buffer helpers expose concrete proof obligations.
    • Prefer actionably narrow properties over arbitrary symbolic parser exploration that spends verifier time in allocator or sort internals.
  • Track S.13 – ELF parser arbitrary-input coverage – landed
    • capos-lib::elf::parse has proptest coverage for arbitrary bytes and valid-header perturbations.
    • fuzz/fuzz_targets/elf_parse.rs exercises ELF bytes through cargo-fuzz.
  • Track S.14 – Telnet IAC filter fuzz coverage – landed 2026-04-27 16:33 EEST
    • Extract the kernel’s TelnetFilter byte-stream parser into capos-lib::telnet so it is host-fuzzable and survives the Phase C move of Telnet framing into userspace per docs/proposals/networking-proposal.md.
    • Add fuzz/fuzz_targets/telnet_filter.rs with structural assertions (Normal must pass non-IAC bytes through unchanged; AfterIac is the only state allowed to emit a 0xFF; emitted byte count never exceeds input length).
    • Wired into make fuzz-build and make fuzz-smoke.
  • Track S.15 – Telnet differential round-trip + line-discipline extraction – landed 2026-04-27 17:18 EEST
    • Add fuzz/fuzz_targets/telnet_filter_roundtrip.rs: synthesize arbitrary RFC 854 event streams from fuzzer bytes, encode to wire, run through TelnetFilter, assert output equals the concatenation of Data(_) payloads. Found a real EXOPL handling bug – the option byte right after IAC SB was being mis-parsed as the start of an IAC IAC escape when its value was 0xFF, leaving the filter stuck in subnegotiation and silently dropping all subsequent data. Fixed via a new AfterSb state that consumes the option byte unconditionally; pinned by a regression test in capos-lib::telnet.
    • Extract the cooked-mode line discipline from kernel::cap::network into capos_lib::line_discipline::LineDiscipline, returning LineStep { outcome, echo } so all socket I/O stays at the caller. Add fuzz/fuzz_targets/line_discipline.rs with structural invariants (line_len <= max_bytes; ±1 line_len delta per Pending step; Cancelled clears; Echo::Byte/Backspace iff buffer grew/shrank by exactly one).
    • Future follow-up: differential against an external Telnet library (libtelnet C or Rust port) to catch RFC conformance bugs the structural targets cannot express.
  • Track S.16 – Ring SQE wire-validation extraction + fuzz target – landed 2026-04-27 19:42 EEST
    • Closes the original three-parser fuzz plan (elf::parse, manifest::decode, ring SQE decoder). Lifts the per-opcode *_sqe_has_unsupported_fields predicates from kernel/src/cap/ring.rs into capos_config::ring, exposes a unified sqe_wire_validation_error(&CapSqe) -> Result<(), i32> entry point, and reroutes the kernel through the shared functions so the kernel-host pair has one source of truth for ABI rules.
    • Add fuzz/fuzz_targets/sqe_validation.rs: cast arbitrary 64 bytes to CapSqe, run sqe_wire_validation_error and the matching per-opcode predicate, assert determinism, opcode-classification consistency (CAP_OP_FINISH -> CAP_ERR_UNSUPPORTED_OPCODE, unknown opcodes -> CAP_ERR_INVALID_REQUEST), and that the unified validator never disagrees with the predicate it dispatches to. Wired into make fuzz-build / fuzz-smoke.
    • Add 12 host unit tests in capos_config::ring covering the classification rules each opcode imposes (THREAD_OWNED + call_id pairing on CALL/PARK, RETURN’s APPLICATION_EXCEPTION flag, CANCEL’s required pipeline_dep target, NOP’s reserved-fields-zero rule, PARK_BENCH’s required addr).
    • The structural fuzz target pins arbitrary-byte behavior. The follow-up well-formed SQE generator oracle landed on 2026-06-06: the test/fuzz-only sqe-validation-oracle feature exposes capos_config::ring::sqe_oracle, which generates validator-accepted SQEs for each accepted opcode and one-field rejecting mutations, and fuzz/fuzz_targets/sqe_validation.rs runs that oracle on each input. This is a shared wire-validator oracle only; it does not claim cap-table lookup, userspace pointer mapping, transfer-descriptor loading, or full kernel ring semantic coverage. A future differential against an independent reference predicate remains a possible stronger disagreement oracle.
  • Track S.17 – Sanitizers on host tests – ASan landed; TSan blocked upstream
    • make sanitizer-host-tests runs RUSTFLAGS=-Zsanitizer=address over the capos-lib and capos-config host suites (crate set / features mirror the test-lib / test-config aliases) on the repo-pinned nightly + host target. It is a focused gate, not part of make check, mirroring dependency-policy-check / sdk-publish-dry-run. Outcome so far: zero findings; both suites pass clean, including the named unsafe suspects (FrameBitmap slot indexing, CapTable generation counters, lazy_buffer raw &mut [u8]). The §Tier 2 “cheap to add” claim holds for ASan, which needs no -Zbuild-std.
    • make sanitizer-host-tests-tsan is wired but currently blocked by an upstream cargo limitation, not a capOS defect. TSan changes the crate ABI, so rustc refuses to link sanitized code against the uninstrumented precompiled std; instrumenting std needs -Zbuild-std, which fails with duplicate core lang items for build-script-bearing dependencies (typenum / libc / cfg-if / subtle) when the sanitizer target equals the host triple. The exact reproduction (four attempted workarounds) is recorded in docs/backlog/security-verification.md Track S.17. Concurrency invariants are meanwhile covered by the dedicated Loom model (cargo test-ring-loom).
    • Done means: the ASan gate exists, runs under nightly, and any findings either land as fixes or get a documented disposition; the TSan target starts passing once the upstream -Zbuild-std + build-script issue is fixed.

Security Verification Tracks S.1 through S.5 have initial coverage. Track S.6 is ongoing doc hygiene and should move with review-process changes. Track S.8 must land before Stage 6 runs mutually-untrusting services. Track S.9 design is complete and now gates concrete implementation work in 3.6/5.2. Track S.11 gates device-driver work. Track S.12 should not expand bounds for their own sake; it is a refresh point when new kernel invariants make better proof targets available. Track S.13 closes the remaining target-3 gap from the table above.

7. What This Proposal Does Not Promise

  • No claim that capOS will be “secure” at the end. It will be harder to write a silently wrong change to the code paths the tooling covers, and it will be easier to find the ones that are still wrong.
  • No proof obligation on every PR. Kani and Loom are expensive to run on every push; CI runs them on a reduced schedule (e.g. nightly, or on PRs that touch the covered crates).
  • Userspace and host-tool bugs are in scope, but their impact is classified by boundary. A userspace bug should not compromise kernel isolation; a host-tool bug can still compromise the build TCB or developer/CI filesystem.
  • No claim that confidentiality is handled beyond architectural isolation. Timing channels, cache side channels, device side channels, and covert channels through shared services remain explicit research topics, not current implementation goals.

8. Relation to Other Docs

  • docs/research/sel4.md §1 and §6.1 already make the case that full verification is not the right goal. This proposal is the operational answer.
  • REVIEW.md is the reviewer’s rulebook. This proposal explains the security and verification rationale behind its common checklist and per-boundary prompts.
  • docs/tasks/** is the open-issue ledger. This proposal feeds it – every bug found by Tier 2/3/4 tooling gets a task record unless fixed in the same change.
  • docs/roadmap.md owns the stages; this proposal does not add stages, only a cross-cutting track that runs alongside them.
  • Task records under docs/tasks/ own concrete ordering; Security Verification Tracks S.1–S.17 above are mirrored there when they are actionable slices.
  • docs/design-risks-register.md is the consolidated index of long-horizon design risks and open architectural questions; consult it when this proposal’s open gaps reference a hazard whose primary owner lives in a subsystem proposal, backlog, or design file rather than here.

DMA Assurance Model

Current DMA authority and isolation design authority lives in DMA Isolation. This proposal defines the accepted evidence model and is retained as the grounding record for DMA proof obligations.

The DMA assurance model is the evidence scaffold for moving capOS from bounded QEMU-local provider proofs toward cloud and production device-driver claims. It does not select a cloud DMA backend. It defines the claims that a backend must prove, the model objects those claims refer to, and the tools that should check each claim before a driver slice can cite it.

The immediate use is the cloud DMA backend decision: direct DMA through a reviewed remapping domain, labeled bounce buffers, or unsupported. The binding choice and any per-VM-shape safety claim remain attended decisions.

Claim Boundary

The model is about DMA authority, not whole-kernel correctness.

In scope:

  • ownership of Device, DMAPool, DMABuffer, Page, IommuDomain, Iova, descriptor, completion, and interrupt-route state;
  • lifecycle transitions from allocation through mapping, publication, completion, revocation, invalidation, scrub, and reuse;
  • stale handle, stale completion, revoke/reset race, teardown-under-DMA, no-host-physical-exposure, and cross-domain aliasing claims;
  • the evidence split between IOMMU-backed direct DMA and labeled bounce-buffer fallback.

Out of scope:

  • proving all kernel behavior;
  • proving cloud-provider hardware facts without attended evidence;
  • treating QEMU Intel VT-d evidence as general hardware evidence;
  • creating a new prover or proof kernel.

The capOS-specific layer may become a DSL later, but it must emit to mature checkers or proof assistants. A self-authenticating capOS prover would increase the trusted base and is not part of this plan.

Model Objects

The abstract model uses these terms consistently across docs, model files, and future proof harnesses:

ObjectMeaning
DevicePCI function or provider device that can issue DMA.
IommuDomainDevice-manager-owned translation context or trusted sharing group.
DMAPoolCapability-scoped allocation authority for DMA buffers.
DMABufferLive buffer handle with owner, slot, generation, and mapping state.
PagePhysical backing page owned by the device manager or held fail-closed.
IovaDevice-visible address meaningful only inside one domain.
DescriptorDevice-visible command referencing a live buffer generation.
CompletionDevice or software observation that a descriptor finished.
IrqRouteInterrupt source, route generation, waiter, mask, and ack state.

The first model files live under models/dma/. They are small by design: reviewers should be able to read the whole state machine and tell whether it matches the DMA design before any checker is involved.

Required Invariants

InvariantRequired meaning
No host-physical exposureResult caps, diagnostics, audit, and cloud evidence never expose a host physical address to a driver. IOMMU-backed paths may expose only a domain-scoped IOVA labeled with its domain.
Mapping before publicationA descriptor cannot become device-visible until the backing buffer is live, owned by the device manager, and either mapped in the selected IOMMU domain or copied through the selected bounce-buffer path.
No page reuse before teardownA DMA page cannot return to the general free pool until submissions are stopped, in-flight descriptors are drained or invalidated, mappings are removed, required invalidations complete, and the page is scrubbed.
Stale handles fail closedA stale pool, buffer, slot, page, source, route, or generation cannot create a new side effect.
Stale completions fail closedA completion whose descriptor, buffer, slot, page, owner, or generation no longer matches cannot publish CQ state, ack IRQ state, free pages, or reuse buffers.
Domain-scoped aliasing onlyThe same IOVA may be reused in different domains, but one domain cannot map the same IOVA to two pages unless an explicit trusted sharing group model permits it.
Fail-closed leaks are boundedIf teardown cannot prove that hardware can no longer reach a page, the page or pool may be held, but that hold must be accounted, bounded, and surfaced as a remediation item.
Backend evidence is explicitDirect DMA requires remapping-domain evidence. Bounce-buffer fallback must stay labeled as not hostile-hardware isolation. Unsupported devices stay disabled.

Tool Mapping

The assurance model intentionally uses several narrow tools instead of one large proof.

ToolcapOS use
TLA+ / TLCModel lifecycle ordering and races: allocate, map, publish, complete, revoke, flush, scrub, reuse, reset, and fail-closed hold. The v0 skeleton is models/dma/dma_authority.tla.
AlloyModel the relational authority graph: device, domain, IOVA, page, owner, and alias constraints. The v0 skeleton is models/dma/dma_authority.als.
KaniProve pure Rust validators and accounting helpers once they are extracted into host-checkable code: generation matching, budget arithmetic, stale rejection, and fail-closed transitions.
LoomCover concurrency-sensitive state that depends on atomics, queues, or multi-CPU ordering. The first target was the DeferredCompletionQueue / TLB-shootdown model gap now recorded in docs/tasks/done/2026-06-04/dma-assurance-model-deferred-completion-loom.md.
VerusCandidate later tool for small critical Rust cores that need unbounded functional contracts and are stable enough to justify annotation cost.
HAMR / MicrokitReference architecture for static component contracts and traceability, not a replacement runtime for capOS. Useful for comparing device-manager and driver partitioning assumptions.

Do not claim a checked model result merely because the files exist. A checked claim requires recording the exact tool, version, configuration, model bounds, and command output in the task evidence.

V0 Gate

dma-assurance-model-v0 is complete when:

  • this proposal defines the model objects, invariants, tool mapping, and claim boundaries;
  • models/dma/ contains inspectable TLA+ and Alloy skeletons for the lifecycle and authority graph;
  • the cloud DMA backend draft task depends on this model before it can be promoted beyond proposal text;
  • the verification workflow names these model files as planned design evidence, while making clear that no required checker gate exists yet;
  • docs workflow and diff hygiene pass.

Future slices should add actual checker commands only after the repo has pinned tool installation and run targets. Suggested future targets are make model-dma-tla, make model-dma-alloy, make kani-dma-authority, and a focused Loom target for DeferredCompletionQueue.

V1 Operationalization

dma-assurance-model-operationalization (2026-06-04) reconciles the v0 skeletons with the DMA authority code that landed after them and emits the checker tracks as concrete task records, so the work cannot be silently parked again. The reconciliation gap table — which invariants the skeletons already capture and which landed-since invariants are MISSING — is recorded in models/dma/README.md and grounded against named symbols in kernel/src/device_dma.rs, kernel/src/cap/dma_buffer.rs, kernel/src/device_manager/stub.rs, kernel/src/cap/virtio_net_userspace_rx_dma_proof.rs, and kernel/src/arch/x86_64/tlb.rs.

Landed-since invariants MISSING from the v0 skeletons: ownership-generation bump on recycle, map-record-before-PTE-install ordering, drive-pin/quarantine, the queue-enable epoch fence, and the deferred-EOI / completion-queue concurrency. Each is owned by an emitted checker slice (each names its make target, pinned tool + version, model bounds, and the exact invariant it checks, and each must record checked output per the anti-overclaim rule above):

Trackmake targetToolSlice
Lifecycle ordering + generation + stale-completionmake model-dma-tlaTLA+/TLC (pinned; TLC-pin owner shared with the scheduler/IRQ model tracks)dma-assurance-model-tla-checked-gate (done 2026-06-04, checked clean at 2/2/2/2, gen 0..1)
Device/domain/IOVA/page/alias authority graph + generationmake model-dma-alloyAlloy (pinned 6.2.0; Alloy-pin owner)dma-assurance-model-alloy-checked-gate (done 2026-06-04, checked for 4)
Extracted pure ownership-generation / stale-handle / no-re-expose coremake kani-dma-authorityKani (pinned 0.67.0, kani-lib style)dma-assurance-model-kani-authority-core (done 2026-06-04, 3 harnesses checked over capos_lib::dma_authority)
Deferred-EOI / completion-queue concurrencynew Loom targetLoom (test-ring-loom sibling)dma-assurance-model-deferred-completion-loom
CI wiring (make check / GitHub gate) + cite checked evidence(wiring only)dma-assurance-model-ci-wiring (done 2026-06-05)

Cloud Backend Use

The cloud backend draft must cite this model and fill an evidence matrix for each backend candidate:

CandidateRequired evidence before sign-off
Direct remapping domainCloud VM shape exposes guest-programmable remapping hardware; capOS can discover and program it; descriptor publication is ordered after mapping; teardown removes mappings and observes required invalidations before page reuse; hostile stale-DMA and stale-completion smokes cover the selected path.
Labeled bounce-buffer fallbackDirect DMA remains blocked; all device-visible addresses are manager-owned bounce pages; no host physical address is exposed; stale handle/completion/teardown evidence covers the selected fallback; documentation states that hostile bus-mastering hardware isolation is not claimed.
UnsupportedDevice remains disabled or unbound; no driver-visible DMA, MMIO doorbell, interrupt ownership, or storage/network readiness claim is made.

The matrix must distinguish provider-side isolation facts from guest-controlled isolation facts. SR-IOV, virtual NIC, GPU, accelerator, or local NVMe support is evidence that a VM exposes DMA-capable device surfaces, but it is not direct remapping evidence unless the guest also exposes an IOMMU or equivalent translation authority that capOS can program. Each VM-shape row should record the provider, region or zone, instance type, image and kernel, provider API or documentation source and date, live guest probe output, visible PCI/device drivers, visible IOMMU tables or groups, maintenance/revocation behavior, and the resulting backend classification.

The matrix is a support-policy input, not a hardcoded boot oracle. capOS should infer the safest available backend at runtime from the device inventory, remapping authority it can actually program, driver self-tests, and fail-closed probe results. Unknown or contradictory observations select Unsupported, not direct DMA. Provider evidence remains necessary for VM shapes the project wants to advertise as supported, because a guest probe cannot fully prove host-side provider isolation or maintenance behavior.

The matrix is an input to attended sign-off. It is not itself the sign-off.

Design Grounding

Proposal: Device Manager Refactor

Before the current module split, kernel/src/device_manager.rs was the convergence point for production device authority, transitional userspace cap surfaces, QEMU proof harnesses, audit labels, DMA/MMIO/IRQ policy, and serialization checks. That shape was useful while the Device Driver Foundation gate was evolving, but it hid the target userspace-driver model behind a single large file.

The refactor should keep the kernel device manager as the authoritative ownership ledger for claimed devices while separating proof scaffolding and domain-specific record logic into clearer modules. It must not weaken the single ownership transaction across DMAPool, DMABuffer, DeviceMmio, and Interrupt.

Implementation Status

The first mechanical proof split landed at 99c37592 (refactor(kernel): split device-manager proof scaffolding). Current main keeps public proof wrapper functions in kernel/src/device_manager/mod.rs for existing virtio.rs call sites, with the moved proof scaffolding in kernel/src/device_manager/proofs.rs.

Later mechanical slices split handles/errors and domain record helpers into the current kernel/src/device_manager/ module tree. The transaction-helper cleanup also landed at 98dddb72 (device_manager: share authority admission helpers), with the aggregate PciDeviceRecord still serving as the single claimed-device ledger and existing proof/audit labels preserved.

Design Grounding

  • DMA Isolation Design requires one device-manager ledger of record for each claimed device before userspace NIC or block drivers receive hardware authority.
  • Service Architecture makes init the holder of DeviceManager and ProcessSpawner, with child hardware drivers receiving only scoped device caps.
  • Networking defines the target NIC split: a userspace NIC driver holds DeviceMmio, Interrupt, and DMAPool, then exports a Nic cap to a separate network stack.
  • Device Driver Foundation is the active implementation track for the hardware authority gates that make this refactor useful. The plan explicitly schedules this refactor as high-priority DDF risk reduction subordinate to behavior-moving authority slices, and any further split slice must be mechanical, behavior-preserving, and reduce review risk for upcoming DeviceMmio, Interrupt, or DMAPool authority work.
  • Pass AttachedDmaPoolRecord by reference is the ready DDF prerequisite that converts the device-manager ledger record from by-value to by-reference threading through the proof emission paths. The current by-value layout exhausted the BSP boot stack when the inline AttachedDmaPoolRecord::proof_buffers slot count was grown past three; the by-reference conversion unlocks further provider-TX descriptor concurrency without expanding the per-frame footprint of nested proof emissions.

No external prior-art report is required for the initial split: this is a repo-local maintainability refactor that preserves the existing accepted authority model rather than selecting a new OS design.

Module Shape

The accepted current shape has converted kernel/src/device_manager.rs into a kernel/src/device_manager/ module tree:

kernel/src/device_manager/
  mod.rs          public API, re-exports, lock-order notes
  handles.rs      BDF, owner/state, and handle structs
  error.rs        production DeviceManagerError and display helpers
  mmio.rs         DeviceMmio records and map/unmap/read/write admission
  dma_pool.rs     DMAPool records, accounting, budget, teardown evidence
  dma_buffer.rs   DMABuffer records, map/free/submit/complete admission
  interrupt.rs    interrupt records and route/wait/ack/mask/unmask admission
  proofs.rs       transitional QEMU proof entry points and proof logs

Future cleanup is limited to optional registry, ledger, or proof-internal splits if they reduce review risk for upcoming DDF work. The current accepted proof split is proofs.rs, not a proofs/ directory.

PciDeviceRecord should remain the aggregate owner of a claimed device’s ledger. The split should move record-specific logic behind modules, not create independent managers that can diverge during teardown.

Follow-Up Risk Reduction

Two adjacent tracks reduce review risk for further DDF proof growth without disturbing the accepted module shape:

  • The by-reference ledger-record threading prerequisite tracked in docs/tasks/ddf-attached-dmapool-record-by-ref.md converts AttachedDmaPoolRecord from by-value to by-reference through the affected proof emission paths so that growing inline proof slot counts no longer multiplies cumulative stack frames across nested proof calls.
  • The scheduler off-stack release work that landed under d322a78f (sched: make thread stack drops off-stack explicit) and 9b94ea7f (sched: release qemu proof stacks off-stack) already pulls QEMU release-proof process kernel stacks off the dropping thread, which removes one stack-pressure axis on the BSP boot path that previously interacted with the device-manager proof emissions.

Refactor Strategy

  1. Split proof code first. Landed at 99c37592. prove_qemu_*, proof log structs, proof-only error enums, and bounded proof helper functions moved into a proof module. Wrapper functions remain in mod.rs so current virtio.rs call sites do not churn.

  2. Split handles and errors. Landed at 734383f9. PciBdf, owner/state enums, handle structs, DeviceMmioRegion, and DeviceManagerError moved into dedicated modules.

  3. Split record domains. Landed at af539f6c. MMIO, DMA pool, DMA buffer, and interrupt attached-record logic moved into domain modules while PciDeviceRecord remains the aggregate ledger owner.

  4. Preserve one authoritative ledger. Every operation that creates, consumes, or releases device-visible authority must still update the claimed-device ledger as part of the same ownership transaction that changes device-manager state.

  5. Improve internal APIs after the split. Landed at 98dddb72. Narrow transaction helpers and typed admission contexts now remove repeated stale-handle, owner, generation, state, and attached-record checks while preserving the single aggregate ledger.

Constraints

  • Preserve the existing lock order: PCI_DEVICE_MANAGER before DEVICE_INTERRUPT_ROUTES.
  • Preserve cap semantics, audit labels, proof labels, and QEMU smoke output during the initial split.
  • Keep userspace-driver authority blocked until the Device Driver Foundation gates still marked open are closed.
  • Avoid broad call-site churn. Compatibility wrappers are acceptable during the mechanical phase.
  • Do not move authority decisions into userspace. Userspace drivers receive scoped caps, but the kernel remains the ledger and enforcement point.
  • Keep proof code available until userspace-driver production gates have equivalent coverage.

Validation

For mechanical file movement, run:

  • make fmt-check
  • cargo build --features qemu
  • make workflow-check

When a slice moves code that emits or validates device proof labels, also run the affected QEMU gates:

  • make run-net
  • make run-ddf-provider-consumer
  • make run-devicemmio-grant
  • make run-dmapool-grant
  • make run-interrupt-grant
  • relevant make run-hardware-audit* targets when audit or proof labels move

Choose those gates from the moved authority surface, not from the file move alone:

  • proof-log or proof-label movement needs the QEMU target that asserts those exact proof lines;
  • grant-source or cap-object movement needs the matching run-*-grant target plus any parent lifecycle target it depends on;
  • audit emission, snapshot decode, or audit-label movement needs the matching run-hardware-audit* target;
  • DMABuffer, DeviceMmio, Interrupt, selected provider TX, proof labels, or schema-comment movement for those surfaces needs make run-ddf-provider-consumer;
  • MMIO, DMA, IRQ, or teardown transaction movement needs the focused grant target and the broader device proof target that exercises stale handle, revoke/reset, or release behavior;
  • pure type, handle, or error-module movement may stop at make fmt-check, cargo build --features qemu, and make workflow-check only when the public diff leaves proof labels, grant behavior, and authority transactions unchanged.

Success Criteria

  • device_manager is a module tree rather than one monolithic source file.
  • Production authority paths are visibly separated from QEMU proof scaffolding.
  • Public behavior and existing proof/audit labels are unchanged by the initial split.
  • The module boundaries match the target userspace-driver design: kernel code owns claim, revoke, teardown, MMIO, DMA, and IRQ authority; userspace drivers consume only scoped capabilities.

Cloud Driver Foundation: Gap Analysis

Premise Correction

A prior framing held that “capOS has no userspace device-driver foundation.” That is wrong. The userspace virtio driver foundation exists and is proven in QEMU across a month of landed DDF work. This document establishes precisely what the foundation covers and reduces each blocked cloud-driver task to its narrow real remaining gap, so no one re-implements a foundation that already exists.

What The Foundation Already Provides (proven, in docs/tasks/done/)

  • Device-agnostic virtio DMA/notify seam + relocated queue/discovery (ddf-virtio-driver-foundation-boundary, 2026-05-25). The split-ring Virtqueue and discover_modern_transport live in kernel/src/virtio.rs mod transport, driven through the VirtqueueDma seam (preflight/register/allocate/free/record-submission/record-completion over the device_dma ledger). virtio-net is one caller of the seam, not the only possible caller – a non-net virtio device (e.g. virtio-blk) can drive the same bounded ledger semantics. Proofs: make run-net, make run-ddf-provider-consumer.
  • Userspace provider owns the selected virtio-net TX queue end-to-end (ddf-provider-virtio-net-driver-closeout, 2026-05-23). A userspace process publishes real selected-queue TX descriptors, rings the doorbell through a DeviceMmio notify-write claim, consumes the TX used-ring completion, and exposes CQ identity – all through user-mode DMAPool/DMABuffer/DeviceMmio/ Interrupt authority, with no silent fallback to the in-kernel virtio-net TX helper while the provider owns TX. RX is bounded synthetic-token CQ identity (kernel RX cohabitation explicit). DMA backend is manager-owned bounce buffers.
  • Manager-granted provider/consumer authority lifecycle (ddf-userspace-driver-provider-consumer, 2026-05-11). A userspace provider consumes manifest-granted DMAPool/DeviceMmio/Interrupt authority; stale-authority rejection, revoke, and release/reset/driver-death teardown are proven.
  • GCP virtio-net function bound through the gate locally in QEMU (cloud-gcp-virtio-net-local-qemu-binding, 2026-05-26). The enumerated/bound function matches the documented GCP 1st/2nd-gen virtio-net surface (vendor 0x1af4), the resolved DMA backend is the labeled bounce-buffer path, proven by make run-net and make run-ddf-provider-consumer.
  • DMA backend selection (cloud-dma-backend-selection, 2026-05-24): boot probe -> fail-closed select -> manifest override; GCE resolves to bounce-buffer.
  • Production IOMMU remapping closeout (ddf-iommu-remapping-production-closeout, 2026-05-23): the direct-remapping domain path for IOMMU shapes (make run-iommu-remapping).
  • First BlockDevice CapObject (ddf-blockdevice-boundary-virtio-blk-smoke, 2026-05-25): a bounded sector write/read-back over virtio-blk (make run-virtio-blk). Note: this BlockDevice is kernel-side, over manager-owned bounce buffers – it is not a userspace storage provider.

Boundary Of The Foundation (where userspace ownership stops today)

  • NIC: userspace owns virtio-net TX; RX is synthetic/cohabited. No live hardware RX used-ring ownership, no direct DMA/IOMMU on the provider path, no cloud enumeration.
  • Storage: there is no userspace storage provider of any device class. The BlockDevice cap is kernel-side; NVMe is metadata-only (kernel/src/pci.rs enumerates the controller and emits a no-authority/ no-driver ... controller_init=not-started line, no register/queue/IDENTIFY/IO code). The NIC userspace driver does not transfer to storage: NVMe is a different device class (admin/IO submission+completion queue pairs, doorbells, PRP/SGL), and even userspace virtio-blk/virtio-scsi has no provider driver – the foundation seam makes it possible, but no slice has built it.
  • Production grant sources stage an arbitrary function through one device-agnostic entry point (done 2026-05-30). The non-qemu {dmapool,devicemmio,interrupt}_grant_source_prod statics previously inferred their candidate function from a hardcoded selection rule narrowed by #[cfg(feature = "cloud_*")] blocks scattered through each pick_candidate body. cloud-prod-grant-source-despecialization replaced that with one stage_with_class entry point per source that takes an explicit ProdGrantClass device-class descriptor (cap::prod_grant_source_class): AnyFunction (plain BAR / first usable function), DmaCapable (virtio or NVMe), or NvmeController (NVMe only); the DeviceMmio source additionally takes the explicit mapped-window length (one page for the plain/virtio-net notify family, two pages for the NVMe CC/admin-register selected-write region). The no-arg init() wrappers select the build’s descriptor and delegate, so a non-virtio-net function is staged by passing the matching descriptor rather than by reaching virtio-net-specific code. The transitional in-kernel qemu-path grant sources still carry the per-function init_*_for_device / init_provider_* variants; those follow the virtio transport into userspace under Phase C of the networking proposal rather than through this slice.

Per-Task Gap (the narrow real Y)

cloud-gcp-virtio-net-nic-driver -> runnable-now claim is superseded

The 2026-05-27 version of this document concluded that the GCP virtio-net live driver task was runnable as a cloud-evidence slice. That conclusion is now stale. The local production cloudboot bind markers have landed, but cloud-prod-provider-nic-bound-local-proof deliberately settled its completion boundary with a kernel-side dispatch-slot proxy because the production userspace-provider grant/waiter surface is still not available in the non-qemu cloudboot build.

The current local production chain is therefore still implementation work, not just billable evidence capture: The cloud-prod-provider-devicemmio-grant-source-local-proof, cloud-prod-provider-dmapool-grant-source-local-proof, and cloud-prod-provider-interrupt-grant-source-local-proof children are done (2026-05-28): the non-qemu cloudboot kernel can deliver DeviceMmio, DMAPool, and Interrupt grants to small userspace provider services through manifest/process-spawner delivery, each with its own local-QEMU proof and bounded caveats. The aggregate docs-status closeout cloud-prod-provider-grant-surface-local-proof is also done (2026-05-28): it records those landed children as one provider grant-surface boundary without adding new behavior. The remaining local production work is cloud-prod-provider-cap-waiter-local-proof, then cloud-prod-virtio-net-userspace-provider-local-proof (and the brokered NVMe sibling). Only after those local production userspace-provider tasks land does the live-GCE NIC task reduce to a cloud evidence/harness run.

The access and spend corrections still stand: GCE access is provisioned and the operator authorized billable runs on 2026-05-27. The blocker is local production userspace-provider authority, not cloud access.

Storage tasks -> gap is a userspace NVMe-class storage provider

cloud-gcp-storage-driver, cloud-gcp-storage-local-qemu-binding, cloud-aws-nvme-storage-driver, cloud-azure-disk-storage-driver all reduce to the same genuine missing piece: a userspace storage provider driver. virtio-net TX ownership does not carry to storage. Two real sub-gaps:

  1. No userspace storage provider driver. Either (a) a userspace virtio-blk/ virtio-scsi provider over the existing virtio seam (the kernel BlockDevice is kernel-side and does not satisfy the “no hidden kernel DMA ownership” acceptance), or (b) a userspace NVMe-class driver (controller bring-up + admin/ IO queue pairs + doorbells + PRP DMA) over the bounce-buffer/IOMMU backend. NVMe is the strategic target: GCP 3rd-gen+, AWS Nitro EBS, and Azure Boost are all NVMe, so one NVMe foundation unblocks all three providers’ storage legs.

  2. The no-IOMMU run-pci-nvme proof gate and the DMA-address ownership model. A real provider-driven NVMe completion + “no hidden kernel DMA ownership” + “no host-physical exposure” must all hold under the no-IOMMU bounce-buffer shape. The 2026-05-27 Model B override (provider writes queue-base/PRP addresses, kernel validates on notify) does not satisfy those constraints on the current no-IOMMU gate: device-visible equals host physical, and reviewed IOVA export discipline intentionally returns no usable device address to userspace.

    The correction is to split the lanes. Model B remains valid for a verified direct-remapping/vIOMMU gate, or a future synthetic address namespace translated by trusted code. The GCP/no-IOMMU lane must use brokered bounce: the provider owns NVMe protocol state and buffer/command capabilities, while the kernel or device manager materializes ASQ/ACQ, I/O queue-base, and PRP/SGL device-visible fields from the live DMAPool ledger. That is the only current path that preserves no-host-physical-exposure on GCP.

The ordered NVMe work therefore splits into:

  • no-IOMMU brokered lane: nvme-no-iommu-brokered-controller-enable (landed 2026-05-27 21:38 UTC, commit 11b86568) -> nvme-admin-queue-identify (landed 2026-05-27 22:34 UTC, commit cede5257) -> nvme-admin-interrupt-delivery (landed 2026-05-27 23:07 UTC, commit 18fd25c7) -> nvme-io-queue-and-read (ready brokered I/O/read);
  • direct-remapping lane: nvme-doorbell-dma-validator (landed mechanism) -> provider-written enable/admin/I/O slices on a verified IOMMU/vIOMMU gate.

Those are the real storage Y for the NVMe path; the virtio-scsi path is an alternative userspace provider of comparable size. None of this is “build a foundation” – it is “build a storage device-class provider on the existing foundation.”

AWS / Azure storage -> consume the GCP NVMe foundation + provider delta

cloud-aws-nvme-storage-driver and cloud-azure-disk-storage-driver already re-scope themselves to a small provider delta once the shared NVMe foundation lands. No new driver decomposition; their blocked-until is the GCP NVMe child chain. Their AWS/Azure NIC siblings (ENA, MANA) are vendor-custom and out of GCP-first scope.

What This Document Changes

  1. Supersedes the cloud-gcp-virtio-net-nic-driver runnable-now claim. The QEMU userspace virtio foundation remains useful grounding, but the live GCP NIC task stays blocked until the local production userspace-provider grant-source, waiter, and userspace virtio-net provider chain lands.
  2. Decomposes the storage gap GCP-first into a no-IOMMU brokered-bounce userspace NVMe lane for GCP and a separate direct-remapping Model B lane for IOMMU/vIOMMU proofs.
  3. Re-points AWS/Azure storage at the GCP NVMe child chain.

Design Grounding

  • docs/tasks/done/2026-05-25/ddf-virtio-driver-foundation-boundary.md
  • docs/tasks/done/2026-05-23/ddf-provider-virtio-net-driver-closeout.md
  • docs/tasks/done/2026-05-11/ddf-userspace-driver-provider-consumer.md
  • docs/tasks/done/2026-05-26/cloud-gcp-virtio-net-local-qemu-binding.md
  • docs/tasks/done/2026-05-25/ddf-blockdevice-boundary-virtio-blk-smoke.md
  • docs/tasks/done/2026-05-24/cloud-dma-backend-selection.md
  • docs/tasks/done/2026-05-23/ddf-iommu-remapping-production-closeout.md
  • docs/proposals/nvme-model-b-doorbell-dma-validator.md (conditional Model B validator for direct-remapping/synthetic-address lanes)
  • docs/research/dma-userspace-driver-isolation.md
  • docs/dma-isolation-design.md (Cloud DMA Backend; IOVA export discipline)
  • kernel/src/virtio.rs (transport::VirtqueueDma, transport::Virtqueue), kernel/src/cap/{dma_pool,dma_buffer,device_mmio,interrupt,block_device}.rs, kernel/src/device_dma.rs, kernel/src/device_interrupt.rs, kernel/src/pci.rs
  • docs/proposals/cloud-deployment-proposal.md, docs/backlog/hardware-boot-storage.md#cloud-device-tracks

NVMe Userspace Provider: Conditional Model B Doorbell/Notify DMA Validator

Operator Decision (2026-05-27)

The userspace NVMe-class storage provider (docs/proposals/cloud-driver-foundation-gap-analysis.md, NVMe child chain) selected Model B: provider-writes-everything, kernel-validates-on-notify for the direct-remapping userspace-driver lane. This was intended to override the kernel-mints-the-address model (Model A) that the gap analysis originally recommended for the storage chain and that the landed virtio-net TX provider uses.

The operator’s stated reason: capOS wants the genuine userspace-driver model, where the driver process — not the kernel — owns and writes the device-visible addresses it programs into the controller. Model A keeps device-address minting inside the kernel, which is safe but is not a real userspace driver: the provider only places a value the kernel already chose. Model B makes the provider a first-class driver and moves the kernel from address-author to address-validator.

Correction recorded later on 2026-05-27: Model B cannot be used on the current no-IOMMU run-pci-nvme or probed GCP bounce-buffer path without exporting host physical addresses to userspace. It remains valid for a verified direct-remapping/vIOMMU lane, or for a future synthetic device-address namespace that the manager translates before hardware sees it. The GCP/no-IOMMU path must use brokered bounce address publication instead.

This is a design-and-task slice only. The landed nvme-doorbell-dma-validator mechanism remains the direct-remapping/synthetic-address validator component; the no-IOMMU controller-enable work is re-planned as a brokered-bounce slice.

Model A vs Model B

DimensionBrokered address publication (kernel/device-manager materializes)Model B (provider-writes, kernel-validates)
Who writes the device-visible addressKernel or device manager writes queue-base/PRP/SGL values from live buffer authority. Provider submits typed requests or places opaque kernel-authored values only when that is safe.Provider writes the device-visible address itself into ASQ/ACQ/SQ/CQ bases and PRP/SGL entries.
Kernel roleAuthor of every device address; trivially correct by construction; no scan needed.Validator: on each doorbell/notify, scan the submitted descriptors/queue-base registers and reject any address outside the owner’s granted DMA window.
New kernel componentNone.A ring/queue-scan on-notify DMA validator (this proposal).
Driver authenticityProvider owns protocol choices but not raw device-address authorship. This is required when device-visible equals host physical.Provider is a real driver that owns its addresses.
Where it appliesNo-IOMMU brokered-bounce paths, including probed GCP shapes and the current no-IOMMU run-pci-nvme gate.Verified direct-remapping/vIOMMU paths, or a future synthetic address namespace.

The two models coexist. The existing virtio-net TX path keeps brokered/kernel- authored device addresses. The NVMe validator is retained for lanes where provider-written addresses are not host physical addresses. A DeviceMmio doorbell claim must declare which model is active; no-IOMMU claims must not accept provider-authored raw device addresses.

What Model B Requires: the On-Notify DMA Validator

The validator is a kernel component invoked on the doorbell/notify path of the NVMe provider’s DeviceMmio selected-write claim. Before the doorbell write reaches the device (i.e. before the controller can fetch the just-submitted descriptors or act on a just-programmed queue base), the kernel scans the device-visible addresses the provider wrote and fails closed if any address is not inside that owner’s granted DMA window(s).

Scan targets (what the validator reads)

  1. Queue-base registers, scanned when the doorbell/notify that arms a queue is rung (or on the controller-enable / CC.EN write that activates the admin queue): ASQ, ACQ, and the I/O SQ/CQ base addresses the provider programmed through its selected-write DeviceMmio claim.
  2. Submission-queue entries newly made visible by an SQ tail doorbell: the PRP1/PRP2 entries (and, where used, the PRP list pages and SGL descriptors) of each NVMe command between the last validated tail and the new tail. The validator follows one level of PRP-list indirection; deeper SGL/PRP-chain shapes are out of scope for the bounded proof and are rejected, not silently accepted.

The validator scans only on notify — not on every provider memory write. The provider may freely write into its own mapped DMA pages between doorbells; nothing device-reachable happens until a doorbell rings, and that is the single choke point the kernel guards. This bounds the validation cost to the descriptors a single doorbell newly publishes (one queue entry for a depth-1 admin proof, a small bounded batch otherwise), not to the whole address space.

Invariants (fail-closed on any violation)

  • Bounds. Every scanned device-visible address, and the full extent of the region it names (queue size × entry size for a queue base; transfer length for a PRP/SGL data pointer), must lie wholly within a DMA window granted to the owning provider. An address at the window edge whose region runs past the window end fails closed. Unaligned queue-base or PRP addresses (NVMe requires page-aligned PRP1 for the first entry, dword-aligned queue bases) fail closed.
  • Owner-scoping. The window set checked is exactly the set granted to the provider that owns the DeviceMmio doorbell claim being rung. An address that is valid for another owner’s window is rejected for this owner: no aliasing into a different owner’s DMA region, no host-physical address, no out-of-any-window address. The validator resolves “owner” from the doorbell claim’s grant identity, not from the address value.
  • No host-physical / no out-of-window. The provider-written value must be a domain-scoped IOVA or synthetic device address, never a host physical address. On the current no-IOMMU bounce path this invariant cannot be satisfied by provider-authored queue-base/PRP values, because device-visible equals host physical and userspace export is disabled.
  • Stale-completion / generation. The validator binds its accept decision to the live grant generation of the owner’s DMA window and doorbell claim. A doorbell rung after revoke/reset/regrant against a stale generation fails closed even if the byte value would have been in-window for the prior grant. Completions are accepted only against the issue/generation that was live at submission scan time, matching the existing stale-completion gate on the virtio-net path; a completion whose submission was never validated (or was validated under a now-retired generation) does not wake a waiter.
  • On-notify timing. The scan completes and either accepts or rejects before the doorbell write is allowed to take effect on the device. A rejected scan does not write the doorbell, returns a fail-closed error to the provider’s DeviceMmio write, and records the rejection; the device never sees the descriptor batch. There is no window in which the controller can fetch an unvalidated descriptor.
  • Quiesce/teardown. On release/reset/driver-death, in-flight doorbell scans are quiesced, the owner’s windows are removed from the validator’s accepted set, backing pages are scrubbed before frame reuse, and any subsequently rung doorbell against the retired grant fails closed.

Where it hooks

The validator hooks the NVMe provider’s selected-write DeviceMmio doorbell claim in the kernel capability layer — the same selected-write claim the bring-up slice scopes to the NVMe enable/admin-queue-base/doorbell registers (mirroring the virtio-net notify-write claim). Concretely:

  • The doorbell/queue-base DeviceMmio.write* path (kernel/src/cap/device_mmio.rs) gains a pre-write validation step for the NVMe doorbell/queue-arm register subset.
  • The scan reads the provider’s mapped SQ pages and queue-base register shadow through the manager-owned DMA window records (kernel/src/device_dma.rs), checking containment against the owner’s granted window descriptors. It does not gain a generic memory-read authority over the provider; it reads only the descriptor/queue-base bytes the doorbell newly publishes, via the manager’s record of the owner’s DMA pages.
  • Generation/owner identity comes from the grant ledger (kernel/src/device_dma.rs / the *_grant_source records), not from provider-supplied metadata.

This is a kernel-side, capability-scoped, on-notify check — not a new ambient syscall and not a per-write trap on all provider memory.

Performance note

The validator runs only on the notify/doorbell path, not on the data path and not on every provider write. Its cost is O(descriptors newly published by this doorbell) — one entry for the depth-1 admin/IDENTIFY proof, a small bounded batch for the I/O queue. Steady-state provider memory writes between doorbells are uninstrumented. This keeps the genuine-driver model without a per-access trap and without copying the data path through the kernel.

No-IOMMU Correction And Brokered Bounce Path

On GCE shapes without a usable guest IOMMU, and on the current no-IOMMU make run-pci-nvme gate, the labeled bounce-buffer backend does not provide a provider-visible IOVA namespace. The device-visible value a real NVMe controller consumes is the host physical or bus address of a manager-owned page. Publishing that value to userspace would violate the reviewed no-host-physical-exposure invariant.

Therefore the no-IOMMU storage path must be brokered:

  • The provider receives buffer capabilities, queue ownership handles, and typed NVMe command intent, not raw queue-base or PRP addresses.
  • The kernel or device manager allocates/pins the bounce pages and writes AQA/ASQ/ACQ, I/O queue-base, and PRP/SGL fields from the live ledger.
  • The selected DeviceMmio claim gates CC.EN, queue-arm, and doorbell writes on the brokered ledger state, not on provider-supplied numeric addresses.
  • Teardown still quiesces outstanding DMA, blocks stale completions, scrubs pages before reuse, and keeps hostile_hardware_isolation=not-claimed.

Model B can be reintroduced for NVMe when the proof gate is a verified direct-remapping/vIOMMU shape where the provider-visible value is a domain-scoped IOVA, or after capOS implements a synthetic address namespace that is translated by trusted code before the controller observes it.

Brokered Alternative For No-IOMMU

The brokered model is no longer a rejected storage alternative for no-IOMMU targets. It is the required GCP/no-IOMMU design until a safe non-host-physical device-address namespace exists. Its tradeoff is narrower driver authenticity: userspace owns NVMe protocol state and command construction, but trusted kernel or manager code remains the author of raw device addresses.

Implementing Slices

  • nvme-doorbell-dma-validator (landed 2026-05-27 08:56 UTC): the kernel on-notify DMA validator mechanism (kernel/src/cap/nvme_doorbell_validator.rs, validate_doorbell_scan / completion_wakes_waiter) and its invariants, proven by the bounded cfg(qemu) hostile-scan self-test (prove_qemu_on_notify_scan_contract) that make run-pci-nvme asserts: out-of-window, host-physical, cross-owner-alias, region-overrun, unaligned, deeper-PRP-chain, and stale-generation all fail closed with no doorbell write and no waiter wake. Synthetic owner windows stand in for the live grant ledger; the live DeviceMmio doorbell-path wiring is the bring-up slice below. This is the kernel component Model B requires; the controller bring-up slice depends on it. Provenance map: NVMe.
  • nvme-no-iommu-brokered-controller-enable (landed 2026-05-27 21:38 UTC, commit 11b86568): no-IOMMU replacement for the blocked provider-written enable task; brokered admin queue-base materialization with no host-physical export.
  • nvme-userspace-bind-and-controller-bringup: remains blocked unless re-scoped to an IOMMU/vIOMMU proof lane or replaced by the brokered no-IOMMU slice above.
  • nvme-admin-queue-identify (landed 2026-05-27 22:34 UTC, commit cede5257) closes the no-IOMMU admin command.
  • nvme-admin-interrupt-delivery (landed 2026-05-27 23:07 UTC, commit 18fd25c7) closes the admin completion wake.
  • nvme-io-queue-and-read is the ready brokered I/O/read continuation. It inherits the same split: provider-written PRPs require direct remapping or a synthetic namespace; no-IOMMU GCP planning requires brokered PRP materialization.

Design Grounding

  • docs/proposals/cloud-driver-foundation-gap-analysis.md (the foundation map and the original Model A recommendation this overrides for storage)
  • docs/dma-isolation-design.md (Cloud DMA Backend; bounce-buffer fallback; IOVA/window discipline; teardown/scrub ordering)
  • docs/proposals/dma-assurance-model-proposal.md
  • docs/tasks/done/2026-05-23/ddf-provider-virtio-net-driver-closeout.md (the Model A virtio-net TX provider that this leaves unchanged)
  • kernel/src/cap/device_mmio.rs (the selected-write claim the validator hooks), kernel/src/device_dma.rs (owner DMA window records / grant generation), kernel/src/cap/{dma_pool,dma_buffer,interrupt}_grant_source.rs, kernel/src/pci.rs (NVMe enumeration today)

Remote Session UI Security Proposal

The current Linux remote-session-ui bridge in tools/remote-session-client/src/bin/remote_session_ui.rs is a trusted local web bridge: a loopback HTTP listener whose Rust backend owns the TCP connection to the capOS gateway and the upstream session, while browser JavaScript receives only DTOs (view models, call results, denial diagnostics, and redacted transcript rows). This document describes the web-security posture required before that bridge ships beyond research use, and how the Tauri desktop wrapper inherits the controls. It also records which browser-facing controls carry over to the capOS-served remote-session-web-ui service and which public-origin controls belong to the selected GCE provider-terminated HTTPS policy, without authorizing public exposure. It is cross-linked from docs/proposals/security-and-verification-proposal.md, docs/proposals/remote-session-capset-client-proposal.md (the parent proposal that defines the remote session CapSet wire and host-client shape this bridge instantiates), and the design risks register entry R17 – Remote-session UI bridge and Tauri wrapper are research-only, which routes long-horizon residual risk (distributable packaging, desktop automation, non-loopback exposure) back to this proposal.

Threat Model

The bridge holds the operator’s authority to drive the capOS gateway. Anything that can issue HTTP requests to the loopback listener inherits that authority. The original bridge shape had:

  • A single shared Arc<Mutex<AppState>> constructed once in run() (around line 1606) and cloned to every accepted connection.
  • No per-browser session cookie, no per-tab token, no per-origin isolation, no proof-of-possession of the original operator login.
  • An origin allow-list that returns true when the Origin header is absent (origin_allowed, line 2163-2169), which lets non-browser POSTs bypass the only state-change guard.
  • Plain http://127.0.0.1:<port>/ transport.

Already closed:

  • The previous non-constant-time != comparison on the automation token has been replaced with constant_time_eq in automation_report and set_automation_report (see tools/remote-session-client/src/bin/remote_session_ui.rs:1378 and :1392). Future secret comparisons must use the same comparator.
  • The loopback bridge now mints per-browser BrowserSession cookies, requires CSRF tokens on state-changing /api/* routes, validates Host / Origin / JSON content type before route work, and enforces first-wins bridge ownership through an atomic tentative reservation.
  • The local HTTP parser now bounds request-line length, header-line length, header count, aggregate header bytes, body size, slow reads, and concurrent handler threads before gateway or authentication work.

Gateway-host redirect scope. POST /api/config is intentionally operator-controlled: it allows an authenticated operator to point the bridge at a different gateway_host. This is bounded by the operator-console trust boundary — only a caller who has already passed the BrowserSession cookie guard and the CSRF double-submit check (i.e., the bridge-owning operator session) can invoke it. The capability model provides the deeper guarantee: the bridge holds a single capOS gateway connection at a time; redirecting to an arbitrary host replaces that connection but does not grant new capability authority that wasn’t already present in an authenticated operator session. No arbitrary-host proxy to untrusted endpoints is possible without an authenticated operator action.

Treating 127.0.0.1 as a trust boundary repeats the failure pattern of historic Docker, Jupyter, and Electron loopback CVEs: any local user, another OS account, a malicious browser extension, a locally-running package install script, or any other process that can connect(2) to the listener can drive the upstream capOS gateway with the operator’s authority. Two browsers today silently share one upstream session; there is no way for an operator or audit log to distinguish them.

Required Posture

Per-browser BrowserSession

Mint a high-entropy opaque session id at the first browser hit and store it server-side as a BrowserSession record distinct from the upstream capOS session. The cookie is the only thing the browser holds; everything else stays in AppState. Two browsers must end up with two BrowserSession records.

Cookie attribute target:

  • HttpOnly
  • SameSite=Strict (the loopback bridge has no cross-site sign-in redirect, so Strict is unconditional here; the capOS-served remote-session-web-ui behind public ingress selects the posture from the boot manifest instead – Strict by default, Lax only when an IAP-fronted deployment manifest grants the iap_fronted_ingress marker, per the selected policy in cloud-deployment-proposal.md – and applies it uniformly to the session, CSRF, and clear-cookie headers)
  • Path=/
  • Host-only: no Domain attribute.
  • __Host- name prefix when transport allows it (requires Secure and Path=/ and forbids Domain).
  • Max-Age=... plus an absolute upper bound enforced server-side.

Secure cookie attribute over plaintext loopback is browser- and version-specific. Modern browsers do treat 127.0.0.1 and ::1 as potentially trustworthy origins for some Secure-Context APIs, but acceptance and sending of Secure-flagged cookies over plaintext loopback is not uniform across vendors and versions. Two acceptable deployment paths:

  1. Move the bridge to HTTPS or to the Tauri custom-scheme secure origin before requiring Secure and __Host-.
  2. Run on plaintext loopback as an interim with HttpOnly; SameSite=Strict; Path=/; Max-Age=... and no Secure / __Host-, with a documented support matrix and a test that proves browsers retain and resend the cookie across the supported range.

Decision: option 2 (plaintext loopback, no Secure, no __Host-). This matches the current research-stage operator-bridge deployment, which only listens on 127.0.0.1 and is not reachable from the network. Cookie attributes are therefore HttpOnly; SameSite=Strict; Path=/; Max-Age=<absolute-timeout-secs> exactly – no Secure and no __Host- prefix. The follow-on Tauri / HTTPS track will switch to option 1 (with Secure and __Host-) before shipping beyond research use; the cookie-emit code carves out one place to flip both attributes when transport changes.

Browser support matrix verified for option 2 (cookies retained and resent across loopback HTTP without Secure):

BrowserMin versionNotes
Chromium96+127.0.0.1 is a potentially-trustworthy origin
Firefox96+same; SameSite=Strict enforced for loopback
Safari15.4+macOS 12.3+ / iOS 15.4+
Edge96+matches Chromium

The verification host test in iter7 round-trips a cookie through a synthetic loopback request to assert browsers within this matrix retain and resend it. Older browsers (pre-Same-Site-Strict enforcement on loopback) are not supported.

The design must not silently rely on a Secure flag that some target browsers drop.

Server-side requirements:

  • High-entropy opaque ids; never derived from user-controlled input.
  • Server-side rotation: regenerate the BrowserSession id on successful login and on privilege transitions, and invalidate the prior anonymous/pre-auth record. (Session-fixation defense.)
  • Server-side invalidation on logout, idle timeout, absolute timeout, and explicit revoke; wipe the record from AppState.
  • Cookie value must never be logged, never written to the transcript, and never included in any DTO returned to the browser.

Multi-browser policy

Pick one and document it here:

  • (a) Independent logins. Each BrowserSession carries its own upstream capOS session; logging in from a second browser opens a second upstream session.
  • (b) First-wins exclusivity. The first authenticated BrowserSession owns the upstream session; subsequent browsers see an explicit “session already in use” denial DTO rather than silent piggy-backing.

Either is acceptable if explicit and audit-logged. Silent shared state is not.

Decision: option (b), first-wins exclusivity. The bridge today holds exactly one upstream capOS session per process, and the research-stage operator boot does not have a clean way to multiplex two operator-authority sessions through a single capOS gateway connection. First-wins is also the auditor-friendlier path: every denied “session already in use” carries the active BrowserSession’s timestamped lineage, so the operator and audit log see the rejection rather than silently sharing state. Concretely:

  • The first browser that starts /api/login/password, /api/login/anonymous, or /api/login/guest after passing local request guards reserves the owner slot before upstream gateway authentication. Successful login rotates the BrowserSession id, marks that slot authenticated, and keeps the upstream capOS session handle in AppState.
  • Failed local login validation, bad credentials, and gateway denials release the tentative reservation. An already authenticated owner is not released by a later bad retry from the same browser session.
  • Subsequent BrowserSessions authenticating against the same bridge get a typed sessionAlreadyInUse denial DTO rather than an upstream login attempt, including while the first session is still authenticating upstream. The denial includes the owner’s claim or authentication timestamp so the second operator sees when the bridge was claimed.
  • Logout / idle-timeout / absolute-timeout on the owner releases the upstream session and clears owner_session_id; the next authenticator wins.
  • Every transition (claim / denial / release) emits a structured audit event into the same stream as upstream capOS session events so an operator looking back can see the bridge contention pattern.

The Tauri wrapper inherits this rule per-window unless the wrapper introduces an explicit multi-window upstream-fanout authority the loopback bridge does not have.

CSRF and origin discipline

  • Require a valid BrowserSession cookie on every /api/* route, not only state-changing routes. Today’s GET /api/state, GET /api/transcript/redacted, and GET /api/automation/report expose state, transcript, and automation surfaces and must not rely on SOP/loopback assumptions alone.
  • Reject state-changing requests when Origin is missing. The current origin_allowed short-circuit on missing Origin (line 2164) must be removed for state-changing methods. Validate Origin against the listener’s expected loopback origin set, and validate Referer as a fallback only when Origin is absent on legacy paths.
  • Add a double-submit CSRF token bound to the BrowserSession cookie and required on every state-changing POST. SameSite=Strict is not sufficient defense in depth on its own.
  • Defense-in-depth via Fetch Metadata: reject browser POSTs whose Sec-Fetch-Site is cross-site or whose Sec-Fetch-Mode is not in the expected set for the route. This is not a replacement for CSRF/Origin, but adds another layer.

DNS-rebinding hardening

Validate the Host header against the loopback set {127.0.0.1:<port>, localhost:<port>, [::1]:<port>}. Without this, DNS-rebinding from a malicious public site can use the victim’s browser as a proxy into the loopback bridge.

Content-Type enforcement

Reject POSTs whose Content-Type is not application/json (or the specific expected type for the route). This blocks text/plain / form-urlencoded cross-origin form submits that bypass preflight.

Implemented on both surfaces. The capOS-served remote-session-web-ui normalizes the header (casing and ;-parameters stripped) and requires the application/json media type on every state-changing /api/* POST class – login-family and authenticated – before route work, with a typed 415 denial (missingContentType / unsupportedContentType). The more specific Host/Origin denials keep precedence, and the fixed non-JSON routes (/healthz, bundle assets, the scoped ACME http-01 challenge path) are unaffected. make run-cloud-prod-remote-session-web-ui-l4 proves the negative matrix (missing, text/plain, form-encoded, multipart, malformed, mixed-case parameterized non-JSON) and the parameterized/mixed-case JSON positives over the real ingress path. This is local request-shape hardening only; it is not public ingress or TLS readiness.

Local HTTP request and handler bounds

The host bridge remains a trusted local development bridge. These bounds reduce local resource-exhaustion and confused-client failure modes; they do not make the UI a public network service.

The HTTP parser must reject overlong request lines, overlong header lines, too many headers, excessive aggregate header bytes, and overlarge bodies before route dispatch, JSON parsing, authentication, or gateway I/O. Incomplete or slow request lines, headers, and bodies must time out under a fixed read deadline. The accept loop must also cap concurrent request handler threads and fail closed with a typed local denial rather than spawning one thread per accepted connection without bound.

CORS stance

Emit no Access-Control-Allow-Origin by default. If a future route ever needs CORS, allow only the exact same-origin echo of the listener URL. Refuse wildcards. Refuse Access-Control-Allow-Credentials: true combined with permissive origins. Document the rule in code so future contributors do not accidentally widen it.

Security response headers

Implemented in the in-guest remote-session-web-ui service (SECURITY_RESPONSE_HEADERS / CONTENT_SECURITY_POLICY in demos/remote-session-web-ui/src/main.rs), emitted on every response class – HTML, static assets, JSON API, /healthz, the ACME http-01 challenge route, and every denial:

  • X-Frame-Options: DENY (anti-clickjacking).
  • X-Content-Type-Options: nosniff.
  • Referrer-Policy: no-referrer.
  • Cross-Origin-Opener-Policy: same-origin.
  • Cross-Origin-Embedder-Policy: require-corp.
  • Cross-Origin-Resource-Policy: same-origin.
  • Cache-Control: no-store.

The implemented shape applies Cross-Origin-Resource-Policy: same-origin and Cache-Control: no-store to every response, not only API responses: every asset is consumed same-origin by the operator app, non-browser consumers (provider health checkers, ACME validators) ignore browser embedding policy, and serving the fixed boot-resource bundle uncached is acceptable for the operator UI. Relaxing caching for static assets would be a deliberate future change, not a default.

The implemented Content-Security-Policy meets the no-unsafe-inline target for both script-src and style-src:

default-src 'none'; script-src 'self'; style-src 'self';
img-src 'self' data:; connect-src 'self'; base-uri 'none';
form-action 'self'; frame-ancestors 'none'

img-src allows data: in addition to 'self' because the committed stylesheet’s hacker-theme dashed border is a data:image/svg+xml background image; a data: image cannot execute script under this policy, and folding it into the pinned bundle as a file asset would be a separate reviewed bundle change. The earlier inline feature-flag script and inline style="..." attributes in tools/remote-session-client/ui/index.html were moved into static bundle assets (/feature-flags.js, the stylesheet) before the CSP landed, so the strict policy serves the fixed bundle without nonces or hashes. The local QEMU proof (make run-cloud-prod-remote-session-web-ui-l4) asserts the header set and CSP on every response class over the real ingress, boots the served root document in a real browser under the strict CSP with zero securitypolicyviolation events, and asserts no Access-Control-* header is emitted on any probed route.

Constant-time secret comparison

The automation-token check has been migrated to constant_time_eq (automation_report and set_automation_report in tools/remote-session-client/src/bin/remote_session_ui.rs). Apply the same comparator to the future BrowserSession cookie value lookup, the CSRF token check, and any future bearer/HMAC validations.

Auth-endpoint rate limiting and lockout

Add per-BrowserSession and per-listener rate limits to /api/login/password and any future credential-handling routes. Exponential backoff on failure. Audit-logged lockout. Wire into the same audit stream as upstream session events so the operator sees failed attempts.

Idle and absolute timeouts

Independent of the upstream capOS session expiry, expire BrowserSession cookies on idle and on absolute lifetime. Force re-auth on resume. Rotate the cookie id on re-auth.

Log injection / transcript safety

Sanitize browser-supplied strings routed into the transcript or stderr for CRLF, ANSI escape sequences, and control bytes so a hostile client cannot forge transcript rows or terminal control on operator stderr.

DTO-only-to-webview discipline

Keep the existing *Vm DTO boundary in tools/remote-session-client/src/bin/remote_session_ui.rs (lines ~199-382). The browser must never receive raw cap handles, raw interface ids, or unredacted session ids. The CapVm.interface_id field is already #[serde(skip_serializing)]; preserve that pattern for any new fields.

Self-Served And Public-Origin Carry-Over

The host-local remote-session-ui bridge and the capOS-served remote-session-web-ui service are different deployment surfaces. The host bridge is a trusted Linux loopback development tool whose backend owns the TCP gateway connection. The self-served service is a capOS userspace HTTP service that owns its TcpListenAuthority, session-manager login flow, authority-broker bundle, and remote CapSet/proxy state inside the guest. The host bridge is not the self-served service moved into the guest.

The authority boundary is the shared rule. Browser JavaScript receives only view models, typed commands, typed results, denials, redacted transcript/status rows, and fixed UI assets. It must not receive raw capOS capabilities, raw cap ids, endpoint-owner authority, ProcessSpawner, socket factories, NetworkManager, TcpListenAuthority, TcpListener, TcpSocket, key material, remote CapSet handles, result-cap slots, process handles, host usernames, host paths, host environment markers, or QEMU-forwarding identity hints. These exclusions match the self-served Gate 1B boundary in Remote Session CapSet Client and the implementation proof records under remote-session-self-served-full-ui-bundle.

Forbidden browser-visible surface matrix:

Forbidden browser-visible classTrusted owner or denial boundaryProof / denial expectation
Raw capOS capabilities, raw cap handles, raw interface ids, and local cap idsHeld only by the remote-session-web-ui backend, its server-side proxy state, or the upstream gateway connection.Browser envelopes, DOM state, diagnostics, transcripts, and JSON contain only DTO names and redacted labels; any browser request that tries to name a cap id fails before backend dispatch.
Endpoint-owner authority and arbitrary endpoint creationOwned by the backend service runner and AuthorityBroker policy, not by browser state.Browser launch forms name only approved service descriptors; denied launches return typed denial DTOs without endpoint-owner tokens or creation handles.
Process handles, raw ProcessSpawner, and shell launcher authorityKept behind AuthorityBroker-approved remote-client bundle policy.Status and transcript rows expose only redacted process/service state; process handles and spawner markers are absent from browser-visible data.
NetworkManager and TcpListenAuthorityremote-session-web-ui owns only the manifest-scoped UI listener for the selected proof target; the open cloudboot L4 task must source that listener through the Phase C userspace network path rather than browser or raw manager authority.Listener/source metadata is service-derived from the accepted socket plus a service event id; browser requests cannot supply trusted source, route, or listener authority.
TcpListener, TcpSocket, and socket factoriesThe HTTP accept loop owns accepted sockets and per-connection state server-side.Browser JavaScript uses ordinary same-origin HTTP commands only; socket factory names, accepted-socket handles, and backend connection handles never appear in DTOs.
Key material, TLS private keys, certificates, public IPs, and firewall rulesPublic-origin TLS and ingress remain in the on-hold provider-terminated HTTPS task; local and private proofs do not hold these secrets in the browser or capOS Web UI.Local self-served and cloudboot proofs must not emit TLS key/certificate material, provider resource ids, public addresses, or firewall rule names as browser-readable state.
Remote CapSet handles, backend cap holders, session-global ids, and result-cap slotsStored in server-side remote-session proxy tables and invalidated through backend logout/stale-call rules.Browser commands reference typed route/request ids only; stale calls and unauthorized result access fail closed without leaking slot numbers or remote handles.
Host paths, host usernames, host environment markers, and QEMU-forwarding identity hintsLimited to development harness/operator context and not part of the capOS-served browser contract.DOM state, JSON responses, diagnostics, and transcripts use redacted service labels; source metadata is backend-derived and cannot be replayed from browser-supplied fields.

The matrix is a review checklist, not the enforcement mechanism. The browser boundary is acceptable only when the backend also rejects stale, unauthorized, or client-supplied authority selectors before any capability dispatch.

The carry-over controls are backend-held session state, server-side BrowserSession records, CSRF tokens on state-changing JSON routes, Host/Origin/Referer/content-type validation, no wildcard CORS, security response headers, request and handler bounds, per-session rate and resource limits, idle and absolute lifetime enforcement, logout that drops server-side authority, transcript sanitization, constant-time comparisons for secrets, and audit-visible denials. Those controls are required for the capOS-served service as well as the loopback bridge, but their concrete transport assumptions differ.

On the capOS-served remote-session-web-ui, the browser-boundary baseline is implemented and locally proven on make run-cloud-prod-remote-session-web-ui-l4: server-side session hardening (unpredictable rotated session ids, a domain-separated double-submit CSRF token, Host/Origin validation, and idle/absolute lifetime enforcement), GFE-range-pinned forwarded-scheme trust, the manifest-selected single public origin, the IAP-aware SameSite cookie posture, JSON content-type rejection on state-changing /api/* POSTs, the uniform security response headers with the strict no-unsafe-inline CSP, in-guest login peer-gating with failure backoff, and the public /healthz health-check contract. All of that evidence is local QEMU/cloudboot proof only; none of it claims private GCE reachability, public ingress, TLS custody, or operator exposure.

Two browser-boundary local proofs remain open as dispatchable task records under docs/tasks/, not done: a public-deployment loopback gate that rejects loopback Host/Origin/Referer acceptance and loopback-shaped source hints when the public-origin load-balancer posture is configured (the landed local proofs intentionally preserve the QEMU loopback posture), and a consolidated browser-visible forbidden-marker matrix proof that scans every response class – success, denial, health, manual, and error bodies – for the forbidden surface above and proves hostile browser-supplied authority fields fail closed before backend-held capability dispatch.

Loopback-only decisions do not carry to a public origin. The plaintext http://127.0.0.1 cookie exception above is only for the trusted local bridge. A public operator endpoint must use the selected policy in Cloud Deployment: one HTTPS origin at a GCP external Application Load Balancer, no wildcard CORS or cross-origin credentialed requests, provider-terminated TLS with no capOS or harness private-key custody for the bootstrap proof, capOS serving only plain HTTP/1.1 on the backend port, no public IP on the VM, and firewall-bounded trust in the load balancer’s forwarded-scheme headers. Public sessions use Secure/HttpOnly/SameSite cookies, HSTS at the HTTPS edge, CSRF Origin/Referer checks against the known public origin, bounded idle and absolute lifetimes, and server-side logout.

The forwarded-scheme half of that trust boundary is already implemented and locally proven on the capOS-served service: remote-session-web-ui honors X-Forwarded-Proto only from the recorded GCP front-end source ranges (130.211.0.0/22, 35.191.0.0/16) and treats the header from any other peer – or any unknown peer-address format – as absent, so a direct client cannot forge secure-context cookie posture. make run-cloud-prod-remote-session-web-ui-l4 drives both the forged-header negative over the real ingress path and the trusted-forwarder fixture positive.

The single-public-origin half is also implemented and locally proven: remote-session-web-ui reads exactly one public_origin.<host> manifest marker cap (fail-closed on a second marker, a malformed, loopback-named, or IP-literal-shaped host, or any unrecognized extra grant) and accepts the configured https://<host> origin in its Host/Origin/Referer gates only for requests arriving through the trusted forwarded-scheme HTTPS path. Cross-origin, mixed-scheme, wildcard, and missing-origin state changes fail closed before backend-held capability dispatch, browser-supplied principal/source hint headers are rejected on the public-origin path, no CORS headers are ever emitted, and the loopback proof posture is unchanged. The same proof drives a direct-client forged public Host/Origin negative over the real ingress and the trusted-forwarder fixture positive in-process. This is local public-origin readiness only – no DNS name, load balancer, TLS endpoint, or live public exposure is claimed.

Keep the proof classes separate. The landed local/QEMU self-served UI bundle proof does not prove local cloudboot L4 over the Phase C userspace network stack. The local cloudboot L4 proof does not prove private GCE reachability. The private GCE proof does not authorize public IPs, firewall exposure, DNS, TLS certificates, or operator browser exposure from the internet. The later cloud-gce-public-self-hosted-webui-ingress-tls task remains on hold for explicit public-ingress/TLS authorization and must build against the selected provider-terminated HTTPS policy rather than raw public HTTP.

Tauri Wrapper

The repository now contains a check/dev Tauri wrapper scaffold under tools/remote-session-client/src-tauri/. It does not introduce a new remote-session authority boundary: make remote-session-tauri checks the wrapper and host Tauri prerequisites by default, and CAPOS_REMOTE_SESSION_TAURI_MODE=dev make remote-session-tauri launches cargo tauri dev. The webview loads http://127.0.0.1:3337/ from the existing remote-session-ui Rust backend, so the backend still owns the gateway TCP connection, remote session state, remote caps, and worker proxies. Webview JavaScript receives only the same view models, user events, typed results, denials, and redacted transcript rows as the trusted local web bridge.

The wrapper command also has a policy-only preflight: CAPOS_REMOTE_SESSION_TAURI_MODE=policy tools/remote-session-tauri.sh. That preflight runs before Tauri dependency/build checks in the normal check path and does not require Tauri Linux packages or a desktop session. It fails closed if the reviewed scaffold drifts: bundling must stay disabled, both the Tauri devUrl and the single main window URL must remain http://127.0.0.1:3337, the default capability must grant only core:default to the main window, and the wrapper must not add app-specific Tauri commands, invoke handlers, generate handlers, or tauri-plugin-* dependencies/uses. This is a guardrail over the current check/dev scaffold only; it is not evidence that distributable packaging or desktop automation is reviewed.

The current check/dev wrapper therefore inherits the loopback HTTP bridge threat model:

  • Loopback HTTP controls apply. Host validation, Origin checks, CSRF tokens, per-BrowserSession cookies, request bounds, first-wins ownership, rate limiting, transcript sanitization, and DTO-only-to-webview discipline apply to the Tauri webview path unchanged because the webview talks to the same loopback backend.
  • No custom Tauri invoke authority. The current scaffold has no app-specific invoke commands for remote-session actions. Do not add Tauri commands that expose raw caps, cap ids, process handles, endpoint owner caps, result slots, host usernames, host paths, or gateway connection internals to the webview.
  • Distributable packaging is still residual. Bundling is disabled until the backend lifecycle is reviewed. A future packaged wrapper may keep a reviewed loopback sidecar or migrate to Tauri command IPC / custom-protocol assets, but that change must update this proposal and re-evaluate which loopback controls still apply. The wrapper’s package mode is intentionally blocked until that review is done.
  • Webview content is the attacker. If any non-trusted asset can ever load (remote frame, broken integrity check, mis-scoped asset protocol), webview JavaScript becomes the attacker. CSP, asset scope discipline, no remote frames, no eval-style hatches still apply.
  • Capability/allowlist minimization. Lock the Tauri capability manifest tightly. Every invoke command and every core API (fs, shell, http, dialog, process, window, clipboard, …) the frontend may call must be enumerated and minimized before distributable packaging is enabled. Misconfigured Tauri allowlists are the dominant Tauri CVE pattern; prefer per-window capability scoping over global allow.
  • Per-window BrowserSession isolation. If multiple windows are spawned over a shared Rust state, keep per-window BrowserSession isolation matching the loopback design.
  • Carry-over controls. Constant-time secret comparison, rate-limiting, idle/absolute timeouts, transcript-injection sanitization, DTO-only-to-webview discipline, and audit logging apply to the Tauri wrapper unchanged.
  • Desktop automation remains unreviewed. The wrapper’s automation mode is intentionally blocked until screenshot/input authority, automation-token handling, UI-smoke oracle scope, desktop session isolation, and fail-closed teardown have a reviewed design.

Verification

Before the corresponding review-finding task is closed:

  • Host tests cover each control above (cookie attributes, CSRF guard, Origin/Host validation, Content-Type rejection, CSP surface, header set, constant-time compare, rate limit, timeouts, log injection sanitization).
  • The CSP refactor of tools/remote-session-client/ui/index.html ships in the same change set as the CSP header.
  • The cookie-transport choice (HTTPS/secure-origin vs. interim plaintext-loopback no-Secure) is recorded in this proposal and the matching browser support matrix is documented.
  • The multi-browser policy choice is recorded in this proposal and reflected in audit logs and DTO denial diagnostics.
  • The Tauri wrapper check/dev scaffold keeps the existing loopback bridge controls in force, has no app-specific remote-session invoke commands, leaves distributable packaging disabled until the sidecar/custom-protocol/backend lifecycle is reviewed, and keeps the policy preflight passing as a narrow guardrail over that scaffold.

Proposal: capOS Repository Harness Engineering

This proposal applies OpenAI-style harness engineering to the capOS repository itself. The goal is not to add agent features to the operating system. The goal is to make this repository a better, safer work environment for long-running agents and human reviewers.

The related capOS-Hosted Agent Swarms proposal describes capOS as a future host for OpenClaw-like agent services. This proposal describes the repository infrastructure needed so agents can work on capOS without repeatedly rediscovering project state, extending superseded designs, choosing the wrong QEMU proof, or silently drifting documentation.

Why This Proposal Exists

The capOS repo is already heavily agent-shaped:

  • AGENTS.md and CLAUDE.md define workflow rules.
  • docs/tasks/state.toml selects the current milestone, and task records under docs/tasks/ define immediate gates.
  • docs/tasks/** records open remediation and review-finding work.
  • docs/proposals/, docs/backlog/, and docs/research/ hold design context.
  • docs/topics.md, docs/SUMMARY.md, and proposal indexes make docs navigable.
  • Make targets and QEMU harnesses prove behavior.
  • CUE manifests define focused system configurations.

That is enough for a careful agent to work, but it is not yet a complete harness. Too much project state still requires fragile human-style inference: which document is authoritative, which proposal is stale, which run target proves which behavior, which open finding blocks a task, and which design pivot explains why old text should not be extended.

OpenAI’s harness engineering lesson is direct: what an agent cannot inspect in its working context effectively does not exist. capOS should therefore compile its project state into repo-local, versioned, mechanically checked artifacts.

Two existing tracker documents already shape the harness contract this proposal builds on, and the artifacts below must stay consistent with them rather than re-derive their state:

  • Trusted Build Inputs inventories the toolchain, generated bindings, dependency policy, Limine pin, QEMU/OVMF observation, and host-tool surface the repo currently trusts. Any run-target, proof, or generated-code claim the harness exposes to agents must point back to that inventory rather than restate pinning or drift status independently.
  • Design Risks and Open Questions Register is the consolidated index of long-horizon design risks (including the supply-chain risk R13, the harness-coverage gaps, and the open-question pointers for proposal/backlog/design ownership). Harness artifacts that claim a risk is “tracked” should cite the register row, and new risks surfaced by harness checks should be filed there rather than buried in this proposal.

Scope

In scope:

  • agent-facing repository map;
  • task-selection and milestone state;
  • proposal/research/status consistency checks;
  • run-target and QEMU proof inventory;
  • machine-readable design relationships;
  • agent-maintained but reviewed knowledge compilation;
  • deterministic evals for future coding agents;
  • active-work and shared-resource visibility;
  • review and security handoff artifacts.

Out of scope:

  • capOS-hosted agent runtime implementation;
  • model provider selection;
  • browser, MCP, or A2A runtime integration;
  • replacing human review;
  • changing the current mandatory worktree workflow.

Design Principles

  1. Repository-local context wins. Important design and workflow state should live in tracked files, not in chat history or operator memory.

  2. Indexes are harness inputs. docs/topics.md, docs/SUMMARY.md, proposal indexes, backlog pointers, and run-target tables are not cosmetic; they are how agents find the right context.

  3. Status must be checkable. Proposal status, supersession, implementation status, selected milestone, and review findings should fail checks when they drift.

  4. Proofs need names and ownership. A QEMU harness target should say what it proves, which manifest it uses, which proposal/backlog owns it, and what transcript shape is expected.

  5. Compiled knowledge is non-authoritative until reviewed. Agent-generated wiki pages can help navigation, but proposals, architecture docs, schemas, code, and review findings remain authoritative.

  6. Prefer generation over duplicate hand-maintained state. When possible, sidecars and indexes should be generated from front matter, Makefile metadata, manifests, or explicit source files.

  7. Expose replacement paths. If a proposal is superseded, an agent should see the replacement before acting on stale text.

  8. Make unsafe shortcuts hard. The harness should steer agents away from main-worktree edits, stale branches, missing review, unverified QEMU claims, and undocumented design pivots.

  9. Agents must know when they are not alone. Shared resources such as git branches, worktrees, docs indexes, task lists, generated files, and review queues need visible ownership, lease, and version state before agents mutate them.

Proposed Artifacts

docs/agent-harness.md

A concise entry point for future agents. It should answer:

  • where current project state lives;
  • how to choose a task;
  • how to create a compliant worktree;
  • how to find relevant proposals, backlog, research, and review findings;
  • how to choose checks;
  • how to handle docs/status updates;
  • how to hand off verification and review.

This file should link to authoritative docs rather than duplicate them. It is a map, not a new policy source.

docs/run-targets.md

Generated or maintained inventory of run/check targets:

TargetManifestPurposeExpected proofOwner
make run-session-contextsystem-session-context.cueone immutable session context proofhostile second-session attempts fail closedsession-bound invocation context
make run-chatsystem-chat.cueresident chat service proofsession-scoped chat transcriptchat/shared-service proposal

The table should cover make run-*, make qemu-*, docs checks, generated-code checks, and security checks. Agents should not infer target meaning from target names alone.

Active Work Registry

Add a small generated or reviewed active-work registry for concurrent agents. It should be derived from git worktrees where possible and supplemented by task metadata:

TaskBranchWorktreeClaimed resourcesModeExpiresStatus
example-session-modelfeat/session-model-proof<worktree-root>/session-model-proofsrc/capos/service.rs, docs/proposals/session-context.mdexclusive source, shared docs2026-05-01checking

The registry is not a replacement for git or human review. It is a harness surface for “another agent is already touching this shared resource.” The row above is synthetic sample data, not live project state.

The same registry should also feed the daily development-performance report defined in capOS Agentic Development Experiment. Git can explain what merged, but the registry explains live ownership, intended role, claimed resource surface, and whether a task was implementation, review, verification, recovery, or metrics processing.

Minimum fields:

  • task or issue id;
  • owner identity or runner id;
  • actor class when known: claude, codex, human/manual, mixed, or unknown;
  • role: implementation, review, planning/design, verification, recovery/integration, or recap/metrics processing;
  • attribution confidence: direct, corroborated, inferred, or unknown;
  • branch and worktree path;
  • claimed paths, subsystems, generated outputs, todo items, or review queues;
  • exclusive/shared mode;
  • observed base revision;
  • lease expiry and renewal time;
  • status: planning, editing, checking, review, merge, blocked, abandoned.

Rows should keep attribution confidence explicit. A direct session id, commit trailer, or operator-created row is higher-confidence than timestamp overlap. Low-confidence rows should stay unknown or mixed rather than assigning work to a specific tool.

For the current repo workflow, this would make the existing worktree policy queryable. For a future capOS-hosted swarm, the same shape becomes a SharedResource/ResourceLease service: git repos, shared todo items, wiki pages, generated docs, and merge queues all get visible claims and versioned writes.

Proposal Relationship Metadata

Add or standardize front matter fields:

status: "Future design. No implementation."
last_reviewed: "2026-04-28 00:00 UTC"
supersedes:
  - old-proposal.md
superseded_by: new-proposal.md
implemented_by:
  - commit-or-target
owned_backlog: docs/backlog/example.md
proof_targets:
  - make run-example

The exact schema can be narrower at first. The important requirement is that replacement and proof relationships become queryable.

Design Pivot Records

Add short ADR-style files under docs/decisions/ for high-impact pivots:

  • endpoint badges as service identity rejected;
  • service-object capabilities superseded by session-bound invocation context;
  • SSH work paused behind session-bound invocation context;
  • hosted agents split from shell agent mode.

Each record should state context, decision, consequences, superseded docs, and current replacement docs.

docs/agent-wiki/

A generated or agent-maintained compiled knowledge tree:

  • index.md: current topic map;
  • capability-model.md: current “interface is permission” model;
  • session-model.md: implemented session-bound invocation context summary;
  • shell-and-remote-access.md: shell, Telnet, SSH, WebShellGateway status;
  • qemu-proofs.md: proof target summaries;
  • open-findings.md: current review findings summarized with links.

This tree must be clearly labeled as compiled navigation, not authority. It can be hidden from public docs until reviewed.

Agent Evals

Add deterministic repository-workflow evals:

  • identify selected milestone from docs/tasks/state.toml;
  • find the relevant backlog and proposal;
  • reject editing the main worktree;
  • detect another active worker claiming the same exclusive path or generated output;
  • choose a non-overlapping task or wait when a shared resource is already leased;
  • identify required checks for a doc-only proposal change;
  • detect a superseded proposal and follow replacement;
  • update proposal index and summary when adding a proposal;
  • avoid claiming full tests passed when only docs built;
  • surface open review-finding task records before unrelated feature work.

These evals can start as scripted fixtures. They do not need live model calls.

Mechanical Checks

Extend existing documentation tooling to check:

  • every proposal in docs/proposals/ is present in docs/proposals/index.md or an explicit archive section;
  • every proposal linked in docs/SUMMARY.md exists;
  • every proposal with topics appears in docs/topics.md after generation;
  • superseded_by points to an existing file;
  • superseded proposals display a replacement link near the top;
  • selected milestone in docs/tasks/state.toml has matching docs/tasks/README.md / backlog orientation;
  • run-target inventory entries point to existing Make targets and manifests;
  • research-backed proposals link at least one docs/research/*.md note;
  • external source snapshots in research notes include a review date;
  • QEMU proof claims name a target;
  • active-work registry entries point to existing branches/worktrees when local;
  • no two active registry entries claim the same exclusive resource unless one is marked blocked, abandoned, or waiting for merge;
  • daily metrics rows that cite an active-work entry use a known actor class, role, and confidence label.

These checks should start warning-only if needed, then become required once the metadata is in place.

The harness checks above stop at proposal/index/run-target/active-work hygiene. They are deliberately not a substitute for the security review process. Trust-boundary review, threat-model refresh, per-boundary CWE/CAPEC tagging, tiered tooling (CI hygiene, miri/proptest/fuzzing, Loom, Kani), and the Security Verification Track registry live in Security Review and Formal Verification. When a harness check (for example “proof claim names a target” or “active-work registry attributes a generated output”) touches trust-boundary or supply-chain authority, it must route the finding to the matching security verification track or design-risks register row rather than absorb the authority claim into agent-facing harness metadata.

Workflow Impact

For agents:

  • start at docs/agent-harness.md;
  • read selected milestone state through stable headings or generated sidecar;
  • inspect active-work/resource claims before choosing or mutating shared files;
  • follow proposal relationship metadata to avoid stale design;
  • choose checks from run-target inventory;
  • update docs/status through mechanically checked indexes;
  • hand off with proof target names and transcript artifacts.

For humans:

  • less repeated explanation of repo rules;
  • easier review of whether an agent chose the right context;
  • clearer detection of stale docs;
  • explicit locations for “why did we change direction?” records.

Implementation Phases

Phase 1 - Map and Inventory

  • Add docs/agent-harness.md.
  • Add initial docs/run-targets.md by hand for major run targets.
  • Link both from docs/SUMMARY.md, docs/topics.md, and README.md.
  • Add a short section in docs/tasks/README.md pointing future agents to the harness map.

Phase 2 - Metadata and Checks

  • Standardize front matter for proposals and research notes.
  • Extend mdBook metadata tooling to validate proposal index, topic membership, summary links, status fields, and supersession links.
  • Add run-target inventory validation against Makefile and manifest paths.

Phase 3 - Decision Records

  • Add docs/decisions/ and initial pivot records for the session-bound invocation context change and hosted-agent split.
  • Link decisions from affected proposals and backlog files.

Phase 4 - Compiled Agent Wiki

  • Create a reviewed docs/agent-wiki/ seed for the current selected milestone.
  • Add lint for stale links, missing citations, and “compiled, not authority” labels.
  • Decide whether generated wiki pages are published in mdBook or kept as repo-internal harness files.

Phase 5 - Agent Workflow Evals

  • Add fixtures and scripts for repository-workflow evals.
  • Run them in a docs/check target.
  • Use failures to improve docs/agent-harness.md, metadata, and run-target inventory.

Open Questions

  • Should proposal relationship metadata live only in front matter, or should there be a generated JSON sidecar for fast agent/tool consumption?
  • Should docs/agent-wiki/ be generated on demand or checked in after review?
  • How much QEMU transcript output should be retained as proof artifacts without bloating the repository?
  • Should run-target metadata live in Makefile comments, a CUE file, or docs/run-targets.md front matter blocks?
  • How strict should the first status linter be, given existing historical docs?
  • Should agent evals be part of make docs, a separate make agent-harness-check, or a broader make check?

Relationship to Existing Documents

  • Hosted agent harnesses research records the external harness research and the initial checklist.
  • capOS-Hosted Agent Swarms uses this repo harness as precedent for future capOS-hosted agents.
  • mdBook Documentation Site owns public docs structure and status vocabulary; this proposal adds agent-legibility and mechanical checks on top.
  • Trusted Build Inputs is the source of truth for toolchain pinning, generated-code drift, dependency policy, Limine binary pinning, observed-only QEMU/OVMF surface, and host-tool inventory. The harness run-target inventory, proof-target metadata, and generated-output active-work claims in this proposal must cite the relevant row there rather than re-derive trust status.
  • Security Review and Formal Verification owns the trust-boundary model, per-boundary CWE/CAPEC checklist, tiered tooling (CI hygiene, miri/proptest/fuzzing, Loom, Kani), and the Security Verification Track registry. Harness mechanical checks must hand security- bearing findings to that proposal’s tracks rather than redefine review authority.
  • Design Risks and Open Questions Register is the consolidated index of long-horizon design risks and open architectural questions. New harness-surfaced risks should be filed against existing rows there (for example R13 for supply-chain pinning gaps) or added as new rows, not buried in harness artifacts.
  • CLAUDE.md, AGENTS.md, docs/tasks/README.md, and the task ledger remain authoritative workflow inputs. docs/agent-harness.md should route to them, not replace them.

Proposal: capOS Agentic Development Experiment

This proposal treats capOS development as a longitudinal field experiment in agentic software engineering. The experiment studies whether persistent coding agents, subagents, review agents, recovery routines, and session-recap tooling can make sustained progress on a nontrivial operating-system project while preserving engineering quality, reviewability, and coordination safety.

The core question is not whether an AI can produce isolated code changes. The stronger question is whether an agentic workflow can maintain a coherent project over many sessions, interruptions, branches, reviews, and handoffs, and which process controls keep that workflow reliable.

This proposal studies the development-time workflow that produced capOS, not the in-system agent runtime that capOS itself targets. The capability-served language-model, embedder, and agent-runner surface lives in Language Models and Agent Runtime; that proposal is the authority on tool-use loops, per-tool permission modes, and how a future agentic capOS user surface holds model authority. The experiment described here uses external Claude and Codex sessions running against the repo, and records observations about their behaviour for later analysis.

Motivation

capOS is a useful setting because it is systems software with real correctness constraints: kernel behavior, capability discipline, QEMU evidence, generated schemas, docs, reviews, and integration rules all matter. It is a stronger testbed than toy programming tasks because the work has long dependency chains and observable integration gates.

The immediate practical need is session memory. Raw ~/.codex and ~/.claude logs contain the evidence, but they are too large and operationally noisy for routine recovery or research analysis. The recap tooling creates a derived evidence layer: structured metadata, compact evidence packets, plain-text summaries, parent/child session graphs, and freshness tracking.

Research Questions

  1. Can agentic development produce sustained, reviewable progress on capOS across many sessions and subagents?
  2. Which controls reduce coordination failures such as stale ownership, duplicate work, unsafe branch cleanup, live-process confusion, and review drift?
  3. How should parent sessions and subagent sessions be summarized so project history remains useful without recursively flooding the recap system?
  4. How reliable are LLM-generated factual recaps when grounded in compact evidence packets rather than full transcripts?
  5. What failure modes remain visible after adding stronger evidence fields, prompt examples, routing rules, and summary comparison snapshots?

Hypotheses

  • Dedicated worktrees, explicit ownership rules, and mandatory review gates reduce destructive interference between concurrent agents.
  • Root-session summaries plus compact child-session evidence are more useful than treating every subagent as an independent top-level recap by default.
  • Small summarizer models can handle simple review sessions when given exact paths, strict output scope, and good/bad examples, but routine/recovery and child-heavy parent sessions need stronger models.
  • Derived recaps can support research and operations if treated as coded observations, while raw transcripts remain the authority for audits.
  • Iterative prompt and evidence changes can measurably reduce recap defects such as bootstrap-boilerplate summaries, queue-processing self-references, and “limited evidence” outputs.

Experimental Setting

The setting spans more than one development machine. Session identity therefore needs an explicit source-machine dimension: a session captured through one machine but originating on another must remain attributed to the originating machine in raw manifests and derived data.

Observed source classes:

  • Claude transcripts under ~/.claude/projects/.../*.jsonl.
  • Claude live metadata under ~/.claude/sessions/*.json.
  • Codex thread metadata in ~/.codex/state_5.sqlite.
  • Codex parent/child relationships in thread_spawn_edges.
  • Codex rollout transcripts under ~/.codex/sessions/YYYY/MM/DD/.
  • Git branch, worktree, commit, review, and check evidence from this repo.

Raw collection keeps source_host separate from capture_host. A central machine may perform the capture, but the manifest records where each source file originated.

An initial private pilot inventory found a large child-session skew: most Codex sessions were spawned subagents rather than root sessions. This motivates the default policy of indexing every session while queuing only primary/root sessions for standalone summaries.

Tooling

The repo-tracked tools live under tools/agent-session-recaps/:

  • maintain_recap_store.py inventories local Claude/Codex sessions, writes script-owned metadata and evidence JSON, maintains summary queues, maps live PIDs conservatively, and ingests LLM-owned summary.txt freshness metadata.
  • archive_raw_sessions.py snapshots raw session sources with host provenance, checksums, compression, optional project filtering, and optional upload to private object storage.

The default derived recap store remains outside the repo:

  • ~/ai-session-recaps/index.json
  • ~/ai-session-recaps/by-session/{tool}/{session_id}/meta.json
  • ~/ai-session-recaps/by-session/{tool}/{session_id}/evidence.json
  • ~/ai-session-recaps/by-session/{tool}/{session_id}/summary.txt
  • ~/ai-session-recaps/by-session/{tool}/{session_id}/summary.meta.json
  • ~/ai-session-recaps/queue/*.json

Important design choices:

  • Summary prose lives only in summary.txt.
  • JSON files remain script-owned metadata/evidence/freshness files.
  • The index tracks source updated_at timestamps for staleness.
  • Parent/root sessions are queued by default.
  • Spawned child sessions remain indexed and linked, but are not queued by default.
  • Parent evidence includes compact child-session evidence so root summaries can include meaningful subagent outcomes.
  • Codex task_complete.last_agent_message is extracted to improve final review and implementation verdicts.
  • Live Claude/Codex PIDs are mapped conservatively using /proc, Claude procStart, Codex wrapper/native process relationships, and explicit Codex resume evidence when available.

Data Products

The experiment distinguishes four layers:

  1. Raw logs: private source of truth.
  2. Evidence packets: compact redacted excerpts, metadata, child-session packets, and command/check summaries.
  3. LLM summaries: qualitative coded observations, not ground truth.
  4. Analysis snapshots: immutable comparison runs that evaluate prompt and evidence changes.

Daily development-performance reports are analysis snapshots. They combine git, worktree, check, review, and session evidence for a bounded reporting window. They are not raw logs and should not contain private prompts, unredacted transcripts, local credentials, or unrelated operator context.

Raw transcripts should not be committed to the public source history. Evidence packets and summaries may be committed only after redaction policy and privacy review. Tooling, schemas, prompts, synthetic examples, and methodology docs can be tracked first.

Raw Evidence Archival

The recap store is derived data; it is not enough for auditability. Raw session sources should be archived separately, with checksums and a manifest that lets a later analysis reproduce which transcript version produced each evidence packet and summary.

Preferred raw archive design:

  • Use private object storage, such as a locked-down GCS bucket, as the default archive for raw session logs.
  • Store compressed snapshots by capture time and source host, for example:
gs://<private-bucket>/capos-agentic-dev/raw-sessions/YYYY/MM/DD/<snapshot-id>/
  manifest.json
  sha256sums.txt
  hosts/primary-dev/.codex/sessions/....jsonl.zst
  hosts/primary-dev/.claude/projects/....jsonl.zst
  hosts/portable-dev/.codex/sessions/....jsonl.zst
  hosts/portable-dev/.claude/projects/....jsonl.zst
  • Enable uniform bucket-level access, least-privilege IAM, lifecycle rules, and object versioning or retention if the bucket policy allows it.
  • Consider customer-managed encryption if the archive will contain sensitive prompts, private operational instructions, or source excerpts.
  • Store a manifest with source host, capture host, source path, archive object path, byte size, SHA-256, source mtime, capture timestamp, tool, session id when known, compression, and redaction status.
  • Keep the manifest path or archive snapshot id in derived recap metadata so summaries can be audited against the exact archived source.
  • Do not merge logs from one machine into another machine’s live ~/.codex or ~/.claude trees. Gather them into host-partitioned archives first, then import from that archive if the recap store is extended to multi-host analysis.

Project filtering matters on machines that contain unrelated Claude/Codex projects. archive_raw_sessions.py --project-root <path> selects only Codex rollouts whose threads.cwd is inside a selected project/worktree root, writes a filtered Codex state JSON extract, and selects matching Claude project JSONL and session metadata. Full Codex SQLite state, global history, Codex logs DB, Claude tasks, and Claude file history are opt-in.

Git branch or Git LFS storage is useful only under tighter constraints:

  • A private Git LFS dataset branch can be convenient for small, curated, redacted, or synthetic fixtures.
  • Raw local session logs should not go into normal git history because they may contain private prompts, operational instructions, credentials accidentally pasted into chat, or unrelated user content.
  • Even private Git LFS is awkward for raw logs if later deletion or redaction is needed, because clones and LFS object stores can retain historical content.
  • If Git LFS is used, prefer a separate private data repository or an orphan data branch, never the normal capOS source branch.

Recommended split:

  • Git-tracked source repo: tooling, schemas, prompts, proposal, methodology, redaction scripts, synthetic examples.
  • Private object storage: exact source session JSON/JSONL and SQLite snapshots.
  • Optional private Git LFS dataset: curated redacted snapshots used by paper reviewers.
  • Public artifact, if any: synthetic fixtures plus aggregate metrics and selected redacted examples.

Pilot Results

An initial private pilot processed a small queue of current summaries and then reran a target set after prompt/evidence changes.

Baseline result:

  • Current summaries: 53.
  • Bad queue/meta/evidence self-reference markers: 0.
  • “Limited evidence” summaries: 7.
  • Child-heavy current summaries: 3.

First intervention:

  • Added prompt good/bad examples.
  • Added compact child-session evidence for parent Codex sessions.
  • Dereferenced child recap-worker output files so parent summaries see summary text, not only completion paths.

Second intervention:

  • Added Codex task_complete.last_agent_message extraction.
  • Reran the remaining limited-evidence summaries.

Combined candidate result:

  • Candidate summaries: 53.
  • Bad self-reference markers: 0.
  • Limited-evidence summaries: 0.
  • Average baseline summary length: 1221.6 characters.
  • Average candidate summary length: 1060.6 characters.

These results support the claim that prompt examples help, but evidence shape matters more. The weak summaries were not only a prompt problem; they lacked the right final-result evidence.

Methodology

Collection

Run the recap maintainer periodically or after major work bursts. Each run should:

  • refresh metadata for all sessions;
  • update evidence for recent primary/root sessions;
  • preserve child-session graph information;
  • update live-process mappings;
  • queue only stale or missing summaries;
  • record immutable analysis snapshots before major prompt/evidence changes.

Summarization

Use model routing:

  • gpt-5.3-codex-spark for simple, concrete, non-routine sessions.
  • A stronger model for routine/recovery sessions, live sessions, and child-heavy parent sessions.

Keep small-model tasks concrete:

  • one queue item or a very small batch;
  • exact paths;
  • no JSON output;
  • no broad filesystem exploration;
  • good/bad examples in the prompt;
  • strict instruction to summarize target-session outcomes, not the queue-processing task.

Metrics

Hard metrics:

  • session count by tool, primary/child/root role, model, and live state;
  • queue size and stale/current summary count;
  • child-session count per root;
  • number of review findings, no-finding reviews, failed checks, and passed checks;
  • branch/worktree lifecycle events: created, committed, reviewed, merged, pushed, parked, abandoned;
  • recovery-session frequency and duration;
  • recap quality markers: self-reference markers, limited-evidence phrases, missing final verdicts, excessive bootstrap boilerplate.

Qualitative coding:

  • coordination failures;
  • evidence gaps;
  • useful controls;
  • subagent summarization failures;
  • review-loop behavior;
  • human intervention points.

Daily Development Metrics

The daily report answers a narrower operational question than the recap store: what project progress happened during the reporting window, how strongly was it validated, and which agent/human channels contributed to it. It should keep project-performance metrics separate from attribution. Raw commits or lines of code are activity signals, not performance by themselves.

Use a fixed window and record it in the report. UTC calendar days are the default for cross-machine comparison; a local workday boundary may be used only when the report records the chosen day_start hour. The collector should derive base and tip commits from the window and report both raw and normalized git stats.

Normalize diff metrics by separating generated and vendored churn:

  • raw commit count and non-merge commit count;
  • first-parent merged task branches;
  • raw file and line stats;
  • authored file and line stats excluding vendor/** and tools/generated/**;
  • optional secondary exclusions for lockfiles and generated demo content;
  • top-level directory and subsystem breakdown;
  • schema changes and generated-code regeneration as distinct rows.

Project-progress metrics:

  • reviewed task slices merged;
  • selected-milestone gates closed;
  • task records closed under docs/tasks/done/;
  • review-finding task records opened, closed, or carried forward;
  • blockers retired and blockers still open;
  • new capability, schema, runtime, demo, manifest, or QEMU proof surfaces;
  • checks and QEMU targets recorded as passed, failed, skipped, or flaky;
  • review iterations and review finding severity;
  • rework after review or after merge.

Validation metrics should be evidence-first. A report may say a check was recorded only when it can point to a session evidence packet, saved log, commit message, or local check database entry. It should not convert a conversational claim into “passed” without a corroborating artifact. Flakes should be recorded separately from deterministic failures.

Attribution metrics are secondary accounting. Attribute by task slice and role, not by raw line count. The report should allow at least these actor classes: claude, codex, human/manual, mixed, and unknown. A commit trailer, session evidence, or active-work registry row can support attribution, but timestamp overlap alone is low-confidence and should remain unknown unless corroborated.

Split roles explicitly:

  • implementation;
  • review;
  • planning/design;
  • verification/check running;
  • recovery/integration;
  • recap/metrics processing.

The Claude/Codex split should be reported as a matrix of actor class by role, with counts of task slices, sessions, review findings, checks, and merged commits where known. It should not rank agents by total commits or authored lines because generated code, vendored dependencies, docs refreshes, and review work distort that comparison.

Recommended daily report sections:

  1. Executive summary: visible progress, evidence gates closed, blockers retired, and blockers still open.
  2. Git metrics: raw commits, non-merge commits, merged task branches, normalized diff stats, generated/vendor churn.
  3. Area breakdown: kernel, schema, runtime, demos, tools, docs, and plans.
  4. Evidence and validation: checks, QEMU proof targets, flakes, skipped gates, and missing gates.
  5. Review and rework: review iterations, findings opened/closed, severity, and post-review or post-merge rework.
  6. Claude/Codex/human split: role-based attribution with confidence labels.
  7. Planning state: selected milestone, active high-priority tasks, closed plan items, stale blockers, and next credible gates.

The active-work registry proposed by capOS Repository Harness Engineering is the preferred source for live task ownership, claimed resources, and role labels. Git remains the authority for merged history; raw session archives remain the authority for auditing derived summaries.

Validation

Treat summaries as coded observations. Validate claims against raw logs, git history, and checks before using them as paper evidence. The capOS review and verification regime described in Security and Verification is the authority on what counts as a closed review gate, what counts as a deterministic check versus a flake, and how trust boundaries are documented. The recap store and daily report cite those gates rather than redefining them: a summary may record that a check passed only when the evidence packet, saved log, or commit trailer matches one of the named gates in that proposal.

Use audits:

  • sample raw transcript lines for selected summaries;
  • verify cited commits and branches;
  • verify check outcomes in logs;
  • compare parent summaries against child-session final results;
  • rerun summaries after prompt/evidence changes and compare snapshots;
  • compare daily report attribution against commit trailers, session evidence, and active-work registry rows;
  • sample normalized diff calculations to ensure generated and vendored files are not counted as authored development volume.

Threats To Validity

  • Single-project bias: capOS is one project with one workflow.
  • Model/version drift: model behavior and Codex/Claude log schemas may change.
  • Observer effect: improving prompts and processes changes the system being studied.
  • LLM-coded summaries can omit or distort details.
  • Raw logs may contain private operational data, limiting public reproducibility.
  • Agent behavior is affected by local instructions, model routing, and tool availability.

Paper Outline

  1. Introduction: why long-running agentic development is different from single-prompt code generation.
  2. Background: capOS, worktrees, review gates, Codex/Claude sessions.
  3. System design: recap instrumentation, evidence packets, child-session graph, model routing, live-process mapping.
  4. Methodology: longitudinal observation, metrics, prompt/evidence interventions, audit strategy.
  5. Pilot findings: session scale, child-session dominance, failure modes, recap improvement loop.
  6. Case studies:
    • recovery session after interruption;
    • child-heavy device-driver-foundation work session;
    • repeated review loop;
    • recap prompt/evidence refinement.
  7. Discussion: what worked, what remained brittle, implications for agentic software engineering.
  8. Limitations and future work.

Immediate Next Steps

  1. Add schema documentation and a privacy/redaction README.
  2. Add repeatable analysis scripts for baseline/rerun comparison.
  3. Add a daily metrics collector that joins git, recap evidence, active-work rows, check artifacts, and review findings into the report sections above.
  4. Add a small synthetic fixture set that exercises:
    • root session with children;
    • recap-worker child returning only a path;
    • review session with task_complete;
    • recovery session with bootstrap boilerplate.
  5. Decide whether generated summaries should be tracked privately, exported as redacted snapshots, or kept only as local research data.

Proposal: Symmetric Multi-Processing (SMP)

How capOS goes from single-CPU execution to utilizing all available processors.

The SMP substrate is one half of capOS’s multicore story; scheduler policy above it is the other half, and they advance through coupled gates. Read this proposal together with:

  • Scheduler Evolution – Phase D (per-CPU WFQ, bounded stealing) and Phase E (SchedulingContext bind/revoke, budget, donation/return, depletion notification) are closed; Phase F has landed the one-SQ-consumer prerequisite, nohz telemetry, housekeeping/deferred-work placement, bounded SQPOLL ring mode, the clockevent/deadline substrate, and bounded non-periodic SQPOLL producer-wake progress, the first automatic nohz activation increment closed via docs/tasks/done/2026/scheduler-phase-f-auto-nohz-activation.md, and SQPOLL-driven auto-nohz activation closed via docs/tasks/done/2026/scheduler-phase-f-auto-nohz-sqpoll.md; timeout-based auto-revoke and ordinary-thread generic full-nohz admission are also landed; generic SQPOLL nohz for arbitrary rings and policy-service AutoNoHz issuance remain future work; Phase F.5 (full-SMP 16/32-core scalability planning) is the named gate for the milestone described below in Full-SMP Scalability Milestone and remains planning, not closed.
  • In-Process Threading Contract – thread-owned execution state, generation-checked ThreadRef queues and wake records, per-thread ring mappings, and the recorded same-process 1-to-2 / diagnostic 1-to-4 evidence rows that this proposal’s scalability work must keep honoring.
  • Design Risks Register, Q9 – CPU accounting and scheduling contexts – partial-status answer that covers per-CPU WFQ, Phase E SchedulingContext, and the cross-service donation / nohz activation / isolation lease / cross-principal fairness work still open.
  • Ring v2 For Full SMP – per-thread ring endpoints and cap_enter-on-thread-CQ are the dispatch contract this proposal’s scheduler-ownership milestones rely on.
  • SMP Phase C backlog – decomposed task list for the in-progress Phase C work tracked below.

The migrated task kernel-upper-half-pml4-propagation-hardening carries the Phase C residual for kernel upper-half page-table mutation after AP startup. The retained finding is closed for the current kernel MMIO/firmware helper path: paging::init() pre-seeds the helper’s upper-half PML4 slot, AddressSpace::new_user clones upper-half entries from the synchronized kernel root under the kernel page-table lock, and map_kernel_physical_range rejects any attempt to create a previously absent kernel-half PML4 slot after a user address space has been created. User-side AddressSpace::{map,unmap,protect} remains shootdown-aware against resident CPU masks; kernel upper-half edits inside pre-existing slots use the kernel-wide shootdown path. Future helper windows or allocator-growth paths that would require a new upper-half PML4 slot must pre-seed that slot before user address-space creation or add synchronized active propagation into live address spaces.

This document has three phases: a per-CPU foundation (prerequisite plumbing), AP startup (bringing secondary CPUs online), and SMP correctness (making shared state safe under concurrency).

Current status: Phase A’s BSP per-CPU foundation and Phase B AP startup are complete. Phase C has completed syscall GS migration, LAPIC/IPI, TLB shootdown, the first AP scheduler-owner handoff, temporary scheduler ownership on CPUs 0-3, per-CPU WFQ runnable queues under the shared scheduler lock, bounded stealing, and bounded idle-to-runnable wake targeting for queued and direct-IPC wakeups. The current scheduler is no longer the temporary single-global-runnable-queue shape from the 2026-05-02 collapse. Remaining SMP risks are the shared scheduler lock, temporary pinning replacement, scheduler-driven AP idle policy, broader workload classes, and higher-thread-count evidence. The next SMP product-level milestone should be full-SMP scalability evidence on a real 16/32-core environment, with QEMU kept for boot and regression coverage rather than as the primary performance source.

Implementation checkpoint: the BSP now has a concrete PerCpu object with stable syscall-stack offsets, and syscall entry uses KernelGsBase/swapgs to reach the per-CPU kernel RSP and saved user RSP slots. The scheduler mirrors its current ThreadRef into the BSP record.

Second checkpoint: runtime stack switches now flow through percpu::set_kernel_entry_stack, which updates the BSP PerCpu.kernel_rsp slot and the BSP TSS.RSP0 together. Scheduler and interrupt paths no longer coordinate those two updates by calling separate GDT and syscall helpers.

Third checkpoint: kernel/src/arch/x86_64/smp.rs now issues the Limine MpRequest, enumerates non-BSP CPUs, allocates AP-local PerCpu records and kernel/IST stack storage, and records dense capOS CPU ids separately from Limine processor and LAPIC ids.

Fourth checkpoint: APs now start through MpInfo::bootstrap() and reach a parked kernel idle loop. The BSP passes an AP record pointer through Limine extra_argument, waits for a bounded online count, and remains the only CPU that schedules userspace. Each AP loads AP-owned GDT/TSS state, the shared IDT, KernelGsBase, and syscall MSRs, reports online, disables interrupts, and parks in hlt. Review tightened this checkpoint so APs first switch from Limine handoff state to the capOS kernel PML4 and AP-owned kernel stack before any online signal.

Fifth checkpoint: syscall entry/exit now runs with kernel GS active between entry and return. Normal returns swap back before sysretq, and blocking or exiting syscall paths that leave through scheduler iretq restore use a dedicated trampoline to swap GS back before restoring the next user context.

Sixth checkpoint: the BSP now enables xAPIC MMIO, maps the LAPIC page through the kernel MMIO allocator, calibrates the LAPIC timer initial count against PIT channel 2, runs scheduler ticks through LAPIC timer vector 48 with LAPIC EOI, installs the LAPIC spurious vector, and masks the legacy PIC once LAPIC ticks are active. Parked APs initialize local APIC state before reporting online. IDT vector 49 and a bounded vector-49-only fixed IPI send primitive back TLB shootdown and bounded idle-to-runnable reschedule requests.

Seventh checkpoint: user page-table map, unmap, and protect now flush the local CPU and then route through a serialized vector-49 TLB shootdown helper using each AddressSpace’s resident CPU mask. The helper records pending full-TLB flush generations and sends vector-49 IPIs to online resident CPUs other than the caller, then returns a completion token that callers wait after dropping ring dispatch locks. Scheduler CR3 handoff points mark the selected address space resident on the current CPU.

Eighth checkpoint: scheduler current-thread state is split into per-CPU slots, AP PerCpu records are registered for current-thread and kernel-entry stack updates, AP TSS.RSP0 is updated during context switches, and AP cpu=1 can enter the scheduler from the AP idle loop when its LAPIC timer is available. The first AP proof intentionally keeps one scheduler owner: when AP cpu=1 is online with a programmed timer, the BSP remains in kernel idle so the process-wide capability ring is not executed concurrently. The scheduler idle path is now a per-CPU CPL0 (kernel-mode) idle thread; the user-mode idle process was removed in commit e3c0df01 (2026-05-14 UTC). “Kernel idle” throughout this proposal refers to that per-CPU CPL0 idle thread, not a user-mode idle process.

Depends on: Stage 5 (Scheduling) – needs a working timer, context switch, and run queue on the BSP before adding more CPUs.

Phase B completion: AP startup is implemented and reviewed. The private process-buffer validate_user_buffer TOCTOU blocker is closed for single locked copy/read paths, and Phase A now has the BSP running through concrete per-CPU syscall-stack/current-thread state. TLB shootdown, the first AP scheduler-owner handoff, temporary scheduler ownership on CPUs 0-3, per-CPU WFQ runnable queues, bounded stealing, and bounded idle-to-runnable wake targeting are implemented; shared scheduler lock contention, temporary pinning replacement, scheduler-driven AP idle policy, broader workload classes, higher-thread-count evidence, and shared SharedParkSpace park key derivation remain later Stage 7 work. Shared keys still need MemoryObject mapping provenance or object pins before they can keep backing stable beyond one address-space-locked access.


Full-SMP Scalability Milestone

The current SMP evidence reaches four physical-core workers and one eight-logical-CPU SMT run under QEMU/KVM. That was enough to expose scheduler structure problems, but it is not the shape that should define whether capOS really uses modern multicore machines. The next SMP milestone should answer a more concrete question: can ordinary capOS workloads keep useful throughput and bounded scheduler overhead as the machine scales to 16 and 32 physical cores?

Preferred evidence environment:

  • direct capOS boot on a dedicated bare-metal or cloud bare-metal/perf-runner machine with at least 16 physical cores, and a 32-core row when hardware is available;
  • recorded CPU topology, SMT state, APIC mode, timer source, frequency policy, memory size, firmware/device model, source commit, toolchain, and kernel configuration;
  • Linux native baselines on the same machine for comparable CPU workloads;
  • QEMU/KVM rows only for boot/regression continuity or for explicitly labeled virtualized comparisons.

Workload coverage should move beyond one fixed checksum row:

  • static map/reduce checksum over equal byte ranges;
  • uneven dynamic task pool with deterministic task ids and result hash;
  • barrier-heavy phase loop that exposes wakeup and cross-CPU coordination cost;
  • same-process thread workload and independent-process workload;
  • IPC/service-bound worker workload that includes capability calls outside the timed compute loop.

Each workload should report 1, 2, 4, 8, 16, and 32-worker rows when the hardware supports those counts, with SMT rows separated from physical-core rows. Each row should include both work-window time and total time, run count, warmup policy, median, variance, and verifier output. The report should show speedup and efficiency curves instead of reducing the result to one boolean threshold.

Implementation work expected before this milestone:

  • replace the temporary scheduler CPU mask and static four-owner assumptions with discovered CPU topology and dynamic per-CPU scheduler structures;
  • decide xAPIC versus x2APIC backend selection for larger APIC-id spaces;
  • split or otherwise shrink the shared scheduler-lock critical sections that still serialize queue selection, wakeups, blocking, and cleanup;
  • make placement topology-aware enough to distinguish physical cores, SMT siblings, and later NUMA/cache groups;
  • keep TLB shootdown, timer, reschedule-IPI, cleanup, and accounting costs observable per CPU and per workload phase;
  • keep per-thread ring ownership and SQ-consumer ownership generation-checked as CPU count rises.

This milestone belongs with scheduler evolution and benchmark planning rather than a new standalone proposal: the SMP proposal defines the CPU substrate, Scheduler Evolution Phase F.5 defines dispatch and policy work for full-SMP 16/32-core scalability, the benchmark proposal defines artifact shape, and the HPC parallel-pattern proposal defines the workload matrix. Q9 in the design risks register is the matching open-question entry: base CPU accounting and scheduling-context authority through Phase E are implemented, while cross-service donation, full nohz activation, CPU isolation leases, and cross-principal fairness are the named follow-ons that this milestone’s evidence will be evaluated against.

Current State

APs can boot into kernel idle loops, and CPUs 0-3 can temporarily own scheduler/user work when their LAPIC timers are available. Specific assumptions that Phase C must still remove:

ComponentFileAssumption
Syscall stack switchingkernel/src/arch/x86_64/syscall.rs, kernel/src/arch/x86_64/percpu.rsSyscall entry/exit uses KernelGsBase/swapgs and GS-relative PerCpu stack fields on the running CPU
AP GDT, TSS, kernel stackskernel/src/arch/x86_64/gdt.rs, kernel/src/arch/x86_64/smp.rsAP-local descriptor tables and stacks exist, and AP TSS.RSP0 updates during AP scheduler context switches
IDTkernel/src/arch/x86_64/idt.rsSingle static IDT (shareable – IDT can be the same across CPUs)
SYSCALL MSRskernel/src/arch/x86_64/syscall.rs, kernel/src/arch/x86_64/smp.rsSTAR/LSTAR/SFMASK/EFER are initialized on BSP and parked APs; BSP and AP startup both publish KernelGsBase
Current thread and run queueskernel/src/sched.rs, kernel/src/arch/x86_64/percpu.rsSCHEDULER owns per-CPU current slots, per-CPU WFQ runnable queues ordered by virtual_finish_ns, bounded stealing from sibling queues, and wake placement through WakePolicy::QueueCpu; queued and direct-IPC wakeups iterate eligible idle scheduler CPUs and wake the first that accepts a fresh reschedule IPI, and CPUs 0-3 can temporarily own scheduler/user execution when their LAPIC timers are available, while shared-lock reduction, temporary pinning replacement, broader workload evidence, and higher-thread-count evidence remain deferred
Timer/IPI deliverykernel/src/arch/x86_64/context.rs, kernel/src/arch/x86_64/lapic.rs, kernel/src/arch/x86_64/pic.rs, kernel/src/arch/x86_64/pit.rs, kernel/src/arch/x86_64/tlb.rsCPUs 0-3 use PIT-calibrated LAPIC timer vector 48 with LAPIC EOI when online; vector 49 services TLB shootdown and bounded reschedule requests
Frame allocatorkernel/src/mem/frame.rsSingle global ALLOCATOR behind one spinlock
Heap allocatorkernel/src/mem/heap.rslinked_list_allocator behind one spinlock

The first checkpoint removed the separate syscall RSP globals and made the BSP PerCpu layout the owner of syscall stack state. The GS checkpoint now uses KernelGsBase/swapgs for those offsets on syscall paths. The LAPIC checkpoint removed the PIT/PIC interrupt dependency from the normal BSP scheduler tick, kept PIT channel 2 as the LAPIC calibration source, installed the spurious vector, and wired the IPI vector. The TLB checkpoint added resident CPU masks, vector-49 shootdown, pending generation counters, completion waits, and syscall-entry plus flush-before-user-return hooks for delayed maskable interrupt delivery. The AP scheduler-owner checkpoint added per-CPU current slots and AP cpu=1 scheduler entry. The remaining Phase C assumptions are in concurrent run-queue ownership and reschedule routing, not in syscall stack lookup, the primary timer source, user page-table mutation invalidation, or AP TSS updates.


Phase A: Per-CPU Foundation

Establish per-CPU data structures on the BSP. No APs are started yet – this phase makes the BSP’s own code SMP-ready so Phase B is a clean addition.

Per-CPU Data Region

Each CPU needs a private data area accessible via the GS segment base. On x86_64, swapgs switches between user-mode GS (usually zero) and kernel-mode GS (pointing to per-CPU data). The kernel sets KernelGSBase MSR on each CPU during init.

The BSP checkpoint originally reached this layout as BSP_PER_CPU+offset from assembly. Phase C now uses the same offsets through GS after swapgs on syscall entry.

#![allow(unused)]
fn main() {
/// Per-CPU data, one instance per processor.
/// Accessed via GS-relative addressing after swapgs.
#[repr(C)]
struct PerCpu {
    /// Self-pointer for accessing the struct from GS:0.
    self_ptr: *const PerCpu,
    /// Kernel stack pointer for syscall entry (replaces SYSCALL_KERNEL_RSP).
    kernel_rsp: u64,
    /// Saved user RSP during syscall (replaces SYSCALL_USER_RSP).
    user_rsp: u64,
    /// Currently running thread on this CPU, if one is active.
    current_thread: Option<ThreadRef>,
    /// CPU index (0 = BSP).
    cpu_id: u32,
    /// LAPIC ID (from Limine MP info or CPUID).
    lapic_id: u32,
}
}

The previous checkpointed syscall entry stub used the same offsets via the BSP symbol:

movq %rsp, BSP_PER_CPU+16(%rip) ; PerCpu.user_rsp
movq BSP_PER_CPU+8(%rip), %rsp  ; PerCpu.kernel_rsp

The current syscall entry stub uses GS-relative addressing:

swapgs
movq %rsp, %gs:16          ; PerCpu.user_rsp
movq %gs:8, %rsp           ; PerCpu.kernel_rsp

And symmetrically on return:

movq %gs:16, %rsp          ; restore user RSP
swapgs
sysretq

Non-returning syscall paths need separate handling: exit, a blocking cap_enter, and a terminal ThreadControl.exitThread can leave the syscall entry path by building a CpuContext and restoring another thread with iretq. Those paths must restore user GS ownership before iretq, even though they never execute the normal sysretq epilogue.

Lock And Ownership Rules

PerCpu fields split by owner:

  • kernel_rsp and TSS.RSP0 are updated together through percpu::set_kernel_entry_stack.
  • user_rsp is written only by syscall entry assembly and read only while constructing a blocked-syscall CpuContext.
  • current_thread mirrors Scheduler.current; the scheduler lock remains the authority for choosing and validating the current thread.
  • cpu_id and lapic_id are immutable after CPU initialization.

Phase A keeps the global scheduler lock and process table. The PerCpu current field is not a second scheduler authority; it is the per-CPU execution cache that Phase B will use when multiple CPUs stop sharing one current slot.

Per-CPU GDT, TSS, and Stacks

Each CPU needs its own:

  • GDT – the TSS descriptor encodes a physical pointer to the CPU’s TSS, so each CPU needs a GDT with its own TSS entry. The segment layout (kernel CS/DS, user CS/DS) is identical across CPUs.
  • TSSprivilege_stack_table[0] (kernel stack for interrupts from Ring 3) and IST entries (double-fault stack) must be per-CPU.
  • Kernel stack – each CPU needs its own stack for syscall/interrupt handling. Current size: 16 KB (4 pages). Same size per CPU.
  • Double-fault stack – each CPU needs its own IST stack. Current size: 20 KB (5 pages).
#![allow(unused)]
fn main() {
/// Allocate and initialize per-CPU structures for one CPU.
fn init_per_cpu(cpu_id: u32, lapic_id: u32) -> &'static PerCpu {
    // Allocate kernel stack (4 pages) and double-fault stack (5 pages)
    let kernel_stack = alloc_stack(4);
    let df_stack = alloc_stack(5);

    // Create TSS with per-CPU stacks
    let mut tss = TaskStateSegment::new();
    tss.privilege_stack_table[0] = kernel_stack.top();
    tss.interrupt_stack_table[DOUBLE_FAULT_IST_INDEX] = df_stack.top();

    // Create GDT with this CPU's TSS
    let (gdt, selectors) = create_gdt(&tss);

    // Allocate and populate PerCpu struct
    let per_cpu = Box::leak(Box::new(PerCpu {
        self_ptr: core::ptr::null(),  // filled below
        kernel_rsp: kernel_stack.top().as_u64(),
        user_rsp: 0,
        current_thread: None,
        cpu_id,
        lapic_id,
    }));
    per_cpu.self_ptr = per_cpu as *const PerCpu;
    per_cpu
}
}

LAPIC Initialization

Stage 5 uses the 8254 PIT (100 Hz) and 8259A PIC (IRQ0 → vector 32) for preemption on the BSP. AP startup must initialize enough local-APIC state for secondary CPUs to park in a kernel idle loop and for later IPIs. Migrating BSP preemption from PIT to LAPIC timer is still required before multi-CPU scheduling, since the PIT is a single shared device that cannot provide per-CPU timer interrupts. LAPIC work is needed for:

  • Per-CPU timer – replace PIT with LAPIC timer (required for SMP)
  • IPI – inter-processor interrupts for TLB shootdown and AP startup
  • Spurious interrupt vector – must be configured per-CPU

2026-04-25 research decision: the immediate Phase C LAPIC/IPI foundation uses xAPIC MMIO, LAPIC timer vector 48, IPI vector 49, LAPIC EOI, AP LAPIC initialization, and PIT/PIC fallback. The grounding note x2APIC and APIC virtualization records the checked Intel and QEMU/KVM sources and keeps x2APIC as a later backend rather than a reason to rework the current LAPIC gate.

Crate Dependencies

CratePurposeno_std
manual xAPIC MMIO backendcurrent LAPIC timer, EOI, IPI, spurious vector foundationyes
future manual x2APIC MSR backend using x86_64 MSR accessnewer/high-core systems and firmware states where xAPIC is unavailable or undesirableyes

The current LAPIC path uses xAPIC MMIO through the kernel MMIO mapper. The later x2APIC backend should still be small and explicit rather than adding an APIC abstraction crate: read the APIC ID, enable x2APIC through IA32_APIC_BASE, program the spurious-vector register, local-vector timer, timer divide/initial-count registers, EOI, and ICR sends through MSRs. I/O APIC remains separate MMIO hardware discovered through ACPI MADT and belongs to the later interrupt-infrastructure/cloud path.

Migration Path

Phase A was a refactor of existing single-CPU code, not an addition:

  1. Add PerCpu struct, allocate one instance for BSP. Done for BSP static storage.
  2. Set BSP’s KernelGSBase MSR, add swapgs to syscall entry/exit. Done for syscall entry/exit, including syscall-to-iretq exits.
  3. Replace SYSCALL_KERNEL_RSP/SYSCALL_USER_RSP globals with per-CPU accesses. Done; syscall assembly uses GS-relative PerCpu offsets.
  4. Replace scheduler’s global SCHEDULER.current with PerCpu.current_thread. Partially done: the BSP per-CPU record mirrors Scheduler.current; the scheduler lock remains authoritative for current-thread and queue ownership until shared scheduler metadata is split further.
  5. Move GDT/TSS stack updates behind the per-CPU path. Done for the BSP runtime stack-update hook; AP-local GDT/TSS allocation belongs to Phase B.
  6. Migrate BSP from PIT to LAPIC timer (PIT initialized in Stage 5). Done for the BSP timer path, with PIT used for calibration and PIT/PIC retained as a fallback.

After Phase A, the kernel still runs user work on one CPU but the BSP per-CPU plumbing is in place. Existing tests (make run-smoke and make run-spawn) continue to pass.


Phase B: AP Startup

Bring Application Processors (APs) online. Each AP runs the same kernel code with its own per-CPU state.

2026-04-25 grounding checkpoint: the next implementation slice should use the current local limine crate’s MP API, not the older SmpRequest naming used in some protocol examples. In capOS’s pinned crate, limine::request::MpRequest returns architecture-specific limine::mp::MpRespData; x86_64 CPU records are limine::mp::MpInfo values with processor_id, lapic_id, MpInfo::bootstrap(entry, extra_arg), and MpInfo::extra_argument(). The Phase B implementation is split into two checkpoints: first enumerate CPUs, assign dense capOS CPU ids separately from Limine’s ACPI processor_id, and allocate AP state/stack slots; then bind each non-BSP CPU to a slot via extra_arg, start it with bootstrap, and park it in a kernel idle loop after local CPU initialization. Both checkpoints are implemented; APs still must not run userspace or mutate the global scheduler.

Limine MP Request

Limine provides an MP response with per-CPU records. Each x86_64 record contains an ACPI processor id, LAPIC ID, and an atomic boot handoff. In the local limine crate, callers should use MpInfo::bootstrap() rather than writing the raw goto_addr field directly.

#![allow(unused)]
fn main() {
use limine::request::MpRequest;

static MP_REQUEST: MpRequest = MpRequest::new(0);

fn start_aps() {
    let mp = MP_REQUEST.response().expect("no MP response");
    let mut next_cpu_id = 1;
    for cpu in mp.cpus() {
        if cpu.lapic_id == mp.bsp_lapic_id {
            continue; // skip BSP
        }
        let cpu_id = next_cpu_id;
        next_cpu_id += 1;
        record_boot_processor_id(cpu_id, cpu.processor_id);
        let ap = init_ap_record(cpu_id, cpu.processor_id, cpu.lapic_id);
        cpu.bootstrap(ap_entry, ap as *const ApCpu as u64);
    }
}
}

AP Entry

Each AP must:

  1. Switch to the capOS kernel PML4 and AP-owned kernel stack
  2. Enable per-CPU CR4 state used by the kernel page tables and user-access guards
  3. Load its per-CPU GDT and TSS
  4. Load the shared IDT
  5. Set KernelGSBase MSR to its PerCpu pointer
  6. Configure SYSCALL MSRs (STAR, LSTAR, SFMASK, EFER.SCE)
  7. Signal “ready” to BSP (atomic flag or counter)
  8. Enter a parked kernel idle loop

Local APIC timer setup and IPI handling remain separate Stage 7 gates; parked APs keep interrupts disabled until that work is ready.

#![allow(unused)]
fn main() {
/// AP entry point. Called by Limine with the MP info pointer.
unsafe extern "C" fn ap_entry(info: &limine::mp::MpInfo) -> ! {
    let ap_ptr = info.extra_argument() as *const ApCpu;
    let ap = unsafe {
        ap_ptr
            .as_ref()
            .expect("Limine AP extra_argument must be an ApCpu pointer")
    };
    let per_cpu = ap.per_cpu();

    // Switch from Limine state to capOS-owned paging and AP stack.
    ap.switch_to_kernel_paging_and_stack();

    // Match per-CPU CR4 state after the kernel PML4 is live.
    paging::enable_global_pages_on_current_cpu();
    smap::init();

    // Load this CPU's GDT + TSS
    ap.descriptors.load();

    // Shared IDT (same across all CPUs)
    idt::init();

    // Set GS base for swapgs
    unsafe { wrmsr(IA32_KERNEL_GS_BASE, per_cpu as *const _ as u64); }

    // Configure syscall MSRs (same values as BSP)
    syscall::init_msrs();

    // Signal ready
    ap.online.store(true, Ordering::Release);
    AP_READY_COUNT.fetch_add(1, Ordering::AcqRel);

    // Park until a later scheduler milestone gives APs runnable work.
    ap_idle_loop();
}
}

The extra_argument pointer must name an initialized, non-null ApCpu record whose storage outlives the AP. The BSP publishes that record before calling MpInfo::bootstrap(), and the AP treats the contained PerCpu pointer as CPU-local state after entry.

Scheduler Boundary

Phase B does not extend the Stage 5 scheduler. The BSP remains the only CPU that runs userspace or mutates the global scheduler. APs only run enough kernel initialization to prove that per-CPU architectural state is valid, signal ready, and park in a bounded hlt loop.

Per-CPU WFQ runnable queues under the shared scheduler lock, bounded stealing that chooses the most-overdue runnable sibling candidate, bounded idle-to-runnable wake targeting that walks eligible idle scheduler CPUs, and address-space CPU residency tracking are the current Phase C structure. The temporary 2026-05-02 single-global-runnable-queue collapse is historical; Scheduler Evolution Phase D (closed 2026-05-10) reintroduced per-CPU queues with weighted fair ordering, and Phase E closed SchedulingContext bind/revoke, budget, donation/return, and depletion notification on top of that. Phase F has landed the one-SQ-consumer prerequisite, nohz telemetry, housekeeping/deferred-work placement, the bounded SQPOLL ring mode, the clockevent/deadline substrate, and bounded non-periodic SQPOLL producer-wake progress, the first automatic nohz activation increment closed via docs/tasks/done/2026/scheduler-phase-f-auto-nohz-activation.md, and SQPOLL-driven auto-nohz activation closed via docs/tasks/done/2026/scheduler-phase-f-auto-nohz-sqpoll.md; timeout-based auto-revoke and ordinary-thread generic full-nohz admission are also landed. Generic SQPOLL nohz for arbitrary rings and policy-service AutoNoHz issuance remain future work. CPU affinity policy, shared scheduler metadata splitting, scheduler-driven AP idle policy, broader workload classes, higher-thread-count evidence, and the named Phase F.5 16/32-core scalability proof remain Phase C/F follow-ups. The first Phase C scheduler proof may continue to use the current process ring while the runtime serializes ring consumption. Full SMP where sibling threads from one process wait independently on different CPUs should use the Ring v2 direction in Ring v2 For Full SMP: cap_enter waits on the current thread’s CQ, not on a shared process CQ.

Boot Sequence

BSP: kernel init (GDT, IDT, memory, heap, caps, scheduler)
BSP: init_per_cpu(0, bsp_lapic_id)
BSP: start_aps()
  AP1: ap_entry() → switch CR3/RSP → init GDT/TSS/syscall state → idle_loop()
  AP2: ap_entry() → switch CR3/RSP → init GDT/TSS/syscall state → idle_loop()
  ...
BSP: wait for all APs ready
BSP: load init process, schedule it
BSP: enter scheduler

Phase C: SMP Correctness

With APs parked in kernel idle loops, Phase C makes user scheduling safe on more than one CPU. The order is:

  1. Move syscall entry/exit and per-CPU access to KernelGsBase/swapgs so APs do not use BSP-symbol-relative syscall stack fields. This includes non-sysretq paths that block or exit through scheduler iretq restore. Done for syscall stack fields and syscall-originated restore paths.
  2. Add LAPIC timer and IPI support so each CPU can take local scheduler ticks and receive cross-CPU requests. Done for PIT-calibrated BSP LAPIC ticks, parked-AP LAPIC initialization, spurious-vector handling, vector 49, a bounded vector-49-only fixed IPI send primitive, live TLB shootdown users, and bounded idle-to-runnable reschedule requests.
  3. Add TLB shootdown before any user address space can run on more than one CPU over its lifetime. Done for user page-table map/unmap/protect through resident CPU masks, vector-49 shootdown, pending full-TLB flush generations, completion waits, and syscall-entry/flush-before-user-return hooks. Remote AP targets become active when AP scheduler ownership records AP residency.
  4. Split scheduler current/run-queue ownership into per-CPU state, with a reviewed AP idle-to-runnable handoff. Done for per-CPU current-thread slots, the first AP cpu=1 scheduler owner handoff, temporary scheduler ownership on CPUs 0-3, per-CPU WFQ runnable queues, bounded stealing, and bounded idle-to-runnable wake targeting; shared scheduler lock reduction, temporary pinning replacement, broader workload evidence, and higher-thread-count evidence remain deferred.
  5. Prove the existing manifest/ring/thread/park smokes under -smp 2.

With multiple CPUs running scheduler-owned work, shared mutable state needs careful handling.

TLB Shootdown

When the kernel modifies page tables that other CPUs may have cached in their TLBs, it must send an IPI to those CPUs to invalidate the affected entries.

Scenarios requiring shootdown:

  • Process exit – unmapping user pages. Only the CPU running the process has the mapping cached, but if the process migrated recently, stale TLB entries may exist on the old CPU.
  • Shared kernel mappings – changes to the kernel half of page tables (e.g., heap growth, MMIO mappings) require all-CPU shootdown.
  • Capability-granted shared memory – if future stages allow shared memory regions between processes, modifications require targeted shootdown.

Current code uses local mapper flushes in AddressSpace::map, AddressSpace::unmap, and AddressSpace::protect, then calls the serialized shootdown helper with the address space’s resident CPU mask. Those methods are reached from VirtualMemoryCap’s parse_map, parse_unmap, and parse_protect anonymous mapping paths and MemoryObjectCap::{map,unmap,protect} borrowed mapping paths. Scheduler CR3 handoff marks the selected address space resident on the current CPU, including AP cpu=1 during the first AP scheduler-owner proof.

Implementation state consists of vector 49, a resident CPU target mask, and per-CPU pending full-TLB flush generations. The first implementation records pending flush generations for online resident CPUs other than the caller, after the local page-table edit and local flush complete, then sends vector-49 IPIs to prompt immediate drain and returns a completion token. VM capability handlers enqueue completion work after dropping the address-space guard, and cap_enter or timer polling drains the queue after ring dispatch releases cap-table and scratch locks. Handlers reserve fixed-size queue slots before page-table mutation, so overload is reported before rollback, unmap, or protect can mutate state. Drains flush the current CPU before waiting, so a CPU that is itself in the target mask cannot wait on its own pending generation. A target CPU that is already in a syscall and contending on those same locks can eventually reach the IPI or return-path drain. If a target CPU has maskable interrupts delayed while it runs a kernel path, it still drains its pending generation at syscall entry or before returning to userspace from syscall, timer, or scheduler restore paths.

#![allow(unused)]
fn main() {
fn shootdown_page(resident_cpu_mask: u64) {
    let targets = resident_cpu_mask & online_cpu_mask() & !current_cpu_bit();
    let generation = next_shootdown_generation();
    for cpu_id in targets {
        PENDING_FLUSH_GENERATION[cpu_id].store(generation, Ordering::Release);
        lapic::send_fixed_ipi(lapic_id_for_cpu(cpu_id));
    }
    ShootdownCompletion { targets, generation }
}

fn flush_pending_for_current_cpu() {
    while pending_generation(current_cpu_id()) != flushed_generation(current_cpu_id()) {
        let generation = pending_generation(current_cpu_id());
        x86_64::instructions::tlb::flush_all();
        FLUSHED_GENERATION[current_cpu_id()].store(generation, Ordering::Release);
    }
}
}

The first implementation targets the address space’s resident CPU mask rather than every online CPU so parked APs with interrupts disabled are not disturbed. It relies on kernel user-buffer access continuing through address-space-locked HHDM copy/read helpers rather than raw user virtual addresses while a delayed flush generation exists. Broader range and page-level coalescing can be added after AP scheduling exists.

LAPIC/IPI Boundary

The normal timer path is now local-APIC-backed: vector 48 handles scheduler ticks with LAPIC EOI after PIT-channel-2 calibration, vector 49 handles TLB shootdown and bounded idle-to-runnable reschedule requests, vector 255 handles LAPIC spurious interrupts without EOI, and vector 32 remains only for the PIT/PIC fallback. AP scheduler owners program their LAPIC timers from the BSP calibration before entering the scheduler-owner loop; if AP timer setup is unavailable, the BSP keeps scheduler ownership. The remaining LAPIC/IPI work is broader scheduler-driven AP idle policy, future preemptive reschedule policy, and a later x2APIC MSR backend after the architectural xAPIC MMIO path is correct, not the bounded idle-to-runnable wake request path.

The TLB shootdown IPI handler must not allocate and must not take locks that can be held while sending a shootdown. Completion waits must happen after dropping the mutated address space’s lock and ring dispatch’s cap-table/scratch locks. The deferred completion queue must remain bounded, non-allocating at enqueue, and reserved before page-table mutation. Syscall-entry and user-return paths must drain pending flush generations so delayed maskable IPI delivery cannot leave a target CPU unable to observe completion or resume a thread with stale TLB state.

KVM paravirtual features such as kvm-pv-eoi, kvm-pv-ipi, and kvm-pv-tlb-flush are future performance work. They must not be required for the first LAPIC timer, IPI, or TLB-shootdown correctness proofs.

Lock Audit

Existing spinlocks need review for SMP safety:

LockCurrent UseSMP Concern
SERIALCOM1 outputSafe but high contention if many CPUs print. Acceptable for debug output.
ALLOCATORFrame bitmapHot path. Holding lock during full bitmap scan is O(n). Consider per-CPU free lists.
KERNEL_CAPSKernel cap tableLow contention (init only). Safe.
SCHEDULER.currentSingle global running-thread slotSplit into PerCpu.current_thread in Phase A.

Before APs can run userspace, the scheduler also needs an explicit CPU residency record for each live thread or address space. That record drives TLB shootdown targeting and prevents migration from racing page-table changes. Process exit and thread exit must clear residency before freeing stacks, address spaces, or ring state that another CPU might still observe.

Interrupt + spinlock deadlock: if CPU A holds a spinlock and takes an interrupt whose handler tries to acquire the same lock, deadlock. This is already noted in REVIEW.md. Fix: disable interrupts while holding locks that interrupt handlers may need (frame allocator, serial). The spin crate supports MutexIrq for this pattern, or use manual cli/sti wrappers.

Allocator Scaling

The frame allocator is behind a single spinlock with O(n) bitmap scan. Under SMP, this becomes a contention bottleneck.

Options (in order of complexity):

  1. Per-CPU free list cache – each CPU maintains a small cache of free frames (e.g., 64 frames). Refill from the global allocator when empty, return batch when full. Reduces lock acquisitions by ~64x.
  2. Region partitioning – divide physical memory into per-CPU regions. Each CPU owns a bitmap partition. Cross-CPU allocation falls back to a global lock. More complex, better NUMA behavior (future).

Option 1 is recommended for initial SMP. ~50-100 lines.

The heap allocator (linked_list_allocator) is also behind a single lock. For a research OS this is acceptable initially – heap allocations in the kernel should be infrequent compared to frame allocations.


Cap’n Proto Schema Additions

SMP introduces a kernel-internal CpuManager capability for inspecting and controlling CPU state. This is not exposed to userspace initially but follows the “everything is a capability” principle.

interface CpuManager {
    # Number of online CPUs.
    cpuCount @0 () -> (count :UInt32);

    # Per-CPU info.
    cpuInfo @1 (cpuId :UInt32) -> (lapicId :UInt32, online :Bool);
}

This capability would be held by init (or a system monitor process) for diagnostics. It’s additive and can be deferred until the mechanism is useful.


Estimated Scope

PhaseNew/Changed CodeDepends On
Phase A: BSP per-CPU foundationDone (BSP PerCpu, syscall-stack storage, scheduler mirror, stack-update hook)Stage 5
Phase B: AP startupDone (MpRequest, AP records/stacks, AP CR3/RSP handoff, parked idle)Phase A
Phase C: Multi-CPU schedulingIn progress (GS/swapgs migration, LAPIC timer/IPI with EOI, shootdown-aware VM mutation wrappers, pending TLB generation completion, per-CPU current slots, temporary scheduler ownership on CPUs 0-3, per-CPU WFQ runnable queues, bounded stealing, and bounded idle-to-runnable wake targeting are implemented; shared scheduler lock reduction, temporary pinning replacement, scheduler-driven AP idle policy, broader workload evidence, and higher-thread-count evidence remain open)Phase B
Ring v2 for full SMPTBD (per-thread rings, completion routing, SQPOLL ownership)Phase C plus threading/park
TotalTBD after Phase C hardware/scheduler audit

Milestones

  • M1: Per-CPU data on BSP – BSP PerCpu syscall-stack/current-thread state, BSP per-CPU kernel-entry stack hook, and single-CPU QEMU proofs. Done.
  • M2: APs running – secondary CPUs reach idle_loop(). BSP prints “N CPUs online”. make run still runs init on BSP. Done.
  • M3: TLB shootdown – page table modifications are safe across CPUs. Process exit on one CPU doesn’t leave stale mappings on others. Done for address-space resident masks and AP cpu=1 residency marking.
  • M4: Multi-CPU scheduling – processes can run on any CPU. The existing boot-manifest service set still works, but the scheduler distributes work across CPUs once runnable processes are available (runtime spawning still depends on ProcessSpawner). Temporary scheduler ownership on CPUs 0-3, per-CPU WFQ runnable queues, bounded stealing, and bounded idle-to-runnable wake targeting are implemented; shared scheduler lock reduction, temporary pinning replacement, scheduler-driven AP idle policy, broader workload evidence, and higher-thread-count evidence remain open.
  • M5: Ring v2 completion ownership – every live thread can own a ring endpoint; endpoint, timer, park, process-wait, and thread-join completions route by ThreadRef. This is the target for full SMP where sibling threads in one process wait independently on different CPUs.

Open Questions

  1. x2APIC backend. Phase C currently has an xAPIC MMIO LAPIC foundation. A later x2APIC MSR backend is still needed for newer/high-core systems and firmware states where xAPIC is unavailable or locked out; it should not block TLB shootdown on the current implementation path.

  2. Idle strategy. hlt is the simplest idle. mwait is more power-efficient and can be used to wake on memory writes. Overkill for QEMU, but worth noting for future hardware targets.

  3. CPU hotplug. Limine starts all CPUs at boot. Runtime CPU online/offline is a future concern, not needed initially.

  4. NUMA awareness. Multi-socket systems have non-uniform memory access. Per-CPU frame allocator regions could be NUMA-aware. Deferred – QEMU emulates flat memory by default.

  5. Scheduler policy. The current multi-CPU scheduler uses per-CPU WFQ runnable queues ordered by virtual_finish_ns under the shared scheduler lock, with bounded stealing from sibling queues when a CPU has no local runnable entry. Scheduler Evolution Phase D (per-CPU WFQ and bounded stealing, closed 2026-05-10) and Phase E (SchedulingContext bind/revoke, budget, donation/return, depletion notification) are closed against this substrate; Phase F has landed the one-SQ-consumer prerequisite, nohz telemetry, housekeeping/deferred-work placement, the bounded SQPOLL ring mode, the clockevent/deadline substrate, and bounded non-periodic SQPOLL producer-wake progress; the first automatic nohz activation increment and SQPOLL-driven auto-nohz activation are both closed (see docs/tasks/done/2026/scheduler-phase-f-auto-nohz-activation.md and docs/tasks/done/2026/scheduler-phase-f-auto-nohz-sqpoll.md). The older round-robin/global-overflow starting point is historical, not the current baseline. Future refinements are shared-lock reduction, temporary pinning replacement, stronger CPU-affinity/admission policy, broader workload-class evidence, higher-thread-count evidence, and the Phase F.5 full-SMP 16/32-core scalability proof.


References

Specifications

Limine

Virtualization

Prior Art

  • Redox SMP – per-CPU contexts, LAPIC timer, IPI-based TLB shootdown
  • xv6-riscv SMP – minimal multi-core OS, clean per-CPU implementation
  • Hermit SMP – Rust unikernel with SMP support via per-core data and APIC
  • BlogOS – educational x86_64 Rust OS (single-CPU, but good APIC coverage)

Proposal: Ring v2 For Full SMP

How capOS should evolve the capability ring once multiple threads from one process can run concurrently on multiple CPUs.

The current ring design is intentionally process-wide: one ring page per process, one SQ, one CQ, and one blocked cap_enter waiter admitted per process. That was the right first threading milestone because it preserved the existing transport while moving scheduler identity from process ids to generation-checked ThreadRef values.

That design can support an initial multi-CPU scheduler proof if the runtime continues to serialize process-ring consumption. It should not be the endpoint for full SMP where sibling threads from one process run and wait on different CPUs. A single process CQ forces those sibling threads to coordinate completion consumption in userspace and keeps the kernel from knowing which thread should block for which CQ stream. The full-SMP target is per-thread ring ownership.

Design Grounding

The local research files checked before this design were:

  • docs/research/completion-ring-threading.md;
  • docs/research/out-of-kernel-scheduling.md;
  • docs/research/llvm-target.md;
  • docs/research/sel4.md;
  • docs/research/zircon.md.

The relevant result is that efficient shared rings want clear producer/consumer ownership. Linux io_uring uses user_data to identify requests, but its aggregate wait model does not by itself solve multiple user consumers waiting on one raw CQ. Futexes provide the right user-runtime parking primitive for compatibility demux. Windows IOCP is a shared completion packet queue model, which is useful as a runtime abstraction but should not be confused with letting several kernel-blocked threads wait on the same circular CQ storage.

Target Model

Each live process thread owns one capability ring endpoint. A ring endpoint is a complete SQ/CQ pair with one userspace-visible identity; it may be mapped as one page per thread or as a lane in a larger ring bundle, but a lane is not just a CQ attached to a shared process SQ.

Each endpoint has:

  • one userspace SQ/CQ pair;
  • one kernel RingScratch or equivalent dispatch scratch owned by that thread or by the ring endpoint;
  • one blocked cap_enter waiter for that thread’s CQ;
  • one ring address passed to the thread at startup.

The process remains the authority boundary. Address space, cap table, CapSet, and resource accounting stay process-owned. Result-cap transfers still install capabilities into the process cap table. Per-thread rings only split transport progress and completion ownership.

cap_enter(min_complete, timeout_ns) keeps its current syscall shape, but the meaning becomes:

Process pending SQEs for the current thread’s ring, then block the current thread until at least min_complete CQEs are available on that same thread’s CQ, or until timeout.

Userspace still matches individual requests by user_data within the current thread’s CQ. The kernel does not add slot-specific waits; CQ slots are storage, not durable request identities.

Thread Creation And Bootstrap

The initial thread may keep the legacy fixed RING_VADDR mapping during the transition. Additional threads need unique ring mappings because all threads share one address space.

The initial accepted contract is kernel-chosen ring mapping. ThreadSpawner does not accept a caller-supplied ring address for the first Ring v2 slice. The kernel allocates a ring record, maps that ring at a collision-free user virtual address in the caller’s address space, charges it to the process ledger, stores the address on the child ThreadRef, and passes the address in the child start registers. If no ring mapping or record can be allocated, thread creation fails before the child thread becomes runnable and rolls back all thread and ring reservations.

Runtime-supplied ring address ranges remain a later extension. They need reviewed VirtualMemory reservation semantics so the runtime can reserve a ring arena without racing normal user mappings. Until that extension lands, Ring v2 implementation branches must not add a ThreadSpawner.create parameter for a caller-selected ring address.

The child thread entry contract should continue to pass bootstrap register values equivalent to:

  • RDI = arg;
  • RSI = tid;
  • RDX = pid;
  • RCX = thread_ring_addr;
  • R8 = CAPSET_VADDR, or zero if absent.

For the initial process thread, _start keeps receiving the ring address from the loader ABI. Once every userspace binary uses the runtime-provided ring address instead of assuming RING_VADDR, the fixed mapping can become a bootstrap-only compatibility detail.

When Ring v2 introduces versioned SQE/CQE layouts, the register-level ring address handoff becomes one field of the negotiated runtime boot record:

#![allow(unused)]
fn main() {
struct RuntimeBootInfo {
    ring_addr: u64,
    ring_abi_version: u32,
    sqe_size: u16,
    cqe_size: u16,
}
}

RuntimeBootInfo, ring ABI version constants, and fixed SQE/CQE layouts live in capos-config/src/ring.rs. Kernel code and capos-rt must import the shared definition instead of maintaining parallel boot-ABI structs.

The first implementation may continue using the existing fixed SQE/CQE layout and RING_VADDR for the initial thread. It still needs a shared ring-endpoint descriptor in the kernel so initial-thread and child-thread rings use the same lifetime, waiter, and completion-routing rules. The fixed initial mapping is a compatibility special case, not a separate process-wide ring once Ring v2 is enabled for a process.

The Tickless/Realtime proposal owns the first CapSqeV2 use case (deadline_ns, qos_flags, and sched_ctx_id), but Ring v2 owns the transport rule: every thread ring handoff must carry or imply the same ABI version and entry sizes that cap_enter validates. A runtime must not infer CapSqeV2 from the address alone.

Completion Routing

Any kernel record that can later post a CQE must store a target ThreadRef and post to that thread’s ring after generation validation:

  • ordinary CALL completions target the submitting thread;
  • endpoint RECV completions target the receiver thread;
  • endpoint RETURN completions target the original caller thread;
  • Timer.sleep completions target the sleeping thread;
  • ProcessHandle.wait completions target the waiting thread;
  • ThreadHandle.join completions target the joining thread;
  • ParkSpace wait wake/timeout completions target the waiting thread;
  • deferred endpoint cancellation completions target the thread that posted the cancelable operation.

Process exit cancels every ring owned by the process. Thread exit cancels that thread’s own ring operations and wakes/drops waiters that name its ThreadRef. If a thread exits with outstanding operations that can still complete, the kernel must either cancel them before releasing the ring or hold the ring record until all generation-checked completion paths drain.

Normative lifetime invariant: a ring record cannot be freed while any CPU, waiter, endpoint call, timer waiter, park waiter, cancellation path, deferred completion path, or SQPOLL worker can still post to it. Thread exit either cancels every such record first or keeps the ring record alive until all generation-checked completion paths have drained.

The implementation contract for completion routing is:

  • scheduler state resolves ThreadRef -> RingEndpoint immediately before posting a CQE;
  • a missing process, stale process generation, missing thread, stale thread generation, or closed ring endpoint turns the completion into a stale completion and must not write userspace memory;
  • a ring endpoint stays pinned while a completion writer owns its reference;
  • result-cap installation still targets the shared process cap table, but the CQE that names the installed result-cap slot is written only to the target thread’s CQ;
  • cap_enter drains and waits on the current thread’s ring only; it never drains a sibling thread’s SQ and never waits on a process-wide CQ;
  • same-process thread scaling remains unclaimable until endpoint, timer, park, process-wait, thread-join, deferred-cancel, and direct IPC completion paths all follow this ThreadRef -> RingEndpoint rule.

SQPOLL And Kernel Consumers

Each thread ring must have exactly one kernel SQ consumer at a time:

  • syscall mode: the owner thread’s cap_enter drains its own SQ;
  • SQPOLL mode: a kernel worker drains that ring’s SQ, and cap_enter waits for CQ availability and returns counts. Userspace remains the CQ consumer.

The Phase F prerequisite now makes this an explicit kernel-side lease for the current per-thread ring endpoints. Syscall-mode dispatch has a generation-checked owner covering both caller-driven cap_enter and bounded timer-side current-thread ring service; a stale owner cannot advance SQ head, and a duplicate future SQPOLL owner is rejected while the syscall owner is live. This does not enable SQPOLL mode, nohz, or CPU isolation.

Mode changes require quiescing the ring so cap_enter and SQPOLL do not both consume the same SQ. SQPOLL workers should be bound through scheduler policy or future CPU grants after APs run kernel idle loops and per-CPU scheduling exists.

Timer interrupt polling may continue to process bounded interrupt-safe work for the current thread’s ring in syscall mode, but it must not become a second SQ consumer for an SQPOLL-owned ring.

Full-nohz for SQPOLL is a later CPU-isolation contract, not part of initial Ring v2. A poller CPU may suppress the periodic scheduler tick only when a housekeeping CPU remains online, the SQPOLL worker is the only runnable entity on that CPU, no timer-side SQ polling or transitional network scheduler polling is pinned there, and CPU accounting is boundary/counter driven rather than tick-driven. Phase F now reports explicit housekeeping/deferred-work placement or rejection for those prerequisites while keeping syscall-mode SQ ownership, periodic ticks, and SQPOLL disabled. The broader staging is in Tickless and Realtime Scheduling.

Scheduler And SMP Requirements

Per-thread rings are not sufficient for full SMP by themselves. Multi-CPU userspace scheduling also requires:

  • per-CPU current-thread state as the scheduler authority, not only a BSP mirror;
  • per-CPU run queues plus a migration/work-stealing protocol;
  • a current-CPU field for runnable/running threads plus an address-space active-CPU mask, or equivalent target set, for TLB shootdown;
  • TLB shootdown before a thread can migrate or two threads in one address space can run on different CPUs while mappings change;
  • cap-table locking or finer object locks that tolerate concurrent calls from sibling threads;
  • address-space locking rules for concurrent VirtualMemory operations, process exit, and user-buffer copy paths;
  • process and thread ring cleanup that cannot free a ring while another CPU is posting a completion to it.

The first Phase C multi-CPU scheduler smoke may keep the current process ring if the runtime still serializes process-ring consumption. A later full-SMP smoke that runs sibling threads from one process concurrently on different CPUs should wait for per-thread ring completion routing and TLB shootdown review.

Compatibility Bridge

Before Ring v2, capos-rt can support multithreaded programs on the current process ring with a runtime reactor:

  • one runtime-owned waiter drains the process CQ;
  • ordinary client threads block on runtime wait records using ParkSpace;
  • the reactor matches CQEs by user_data and unparks the waiting thread.

This is a bridge, not the final SMP ABI. It is useful for validating runtime logic and higher-level language support before kernel per-thread rings land.

Rejected Direction: Slot-Specific cap_enter

Do not extend cap_enter to wait for raw CQ slots. Slots are circular-buffer storage and can be reused after cq_head advances. A correct specific-wait design would need stable request ids or completion tokens, at which point per-thread ring endpoints solve the same ownership problem with less special-case kernel state.

Roadmap

  1. Runtime reactor bridge on the current process ring.
  2. Add the shared RingEndpoint kernel record and make the initial fixed bootstrap ring use it without changing userspace behavior.
  3. Move ring allocation/accounting from process-only state to thread-owned ring records.
  4. ThreadSpawner.create allocates/maps a kernel-chosen per-thread ring and passes its user address to the child.
  5. Scheduler waiters and endpoint/timer/park/process/thread completion paths post by target ThreadRef to that thread’s ring.
  6. cap_enter operates on the current thread’s ring; remove the one-process-ring waiter rule.
  7. Add SQPOLL mode only after per-CPU scheduler state exists.
  8. Add SQPOLL nohz only after CPU isolation leases, housekeeping placement, non-tick CPU accounting, and network polling placement are reviewed.
  9. Run full-SMP sibling-thread workloads that wait independently on different CPUs only after per-thread ring routing, TLB shootdown, and cross-CPU cleanup rules are reviewed.

Proposal: Scheduler Evolution

capOS should evolve its scheduler in layers. The goal is not one clever algorithm; it is a capability-shaped CPU subsystem that scales ordinary work, admits realtime islands, allows service/runtime-specific policy, and preserves a small auditable kernel dispatch path.

This proposal complements, rather than replaces, Tickless and Realtime Scheduling. That proposal owns timer/tickless/SQPOLL-nohz details. This proposal owns the broader scheduler architecture and roadmap.

Design Grounding

Local grounding:

Goals

  • Keep protected dispatch, budget enforcement, interrupt handling, and idle in the kernel.
  • Replace the single global runnable queue with per-CPU runnable ownership and bounded cross-CPU wake/migration.
  • Add CPU accounting before adopting policy that depends on runtime charge.
  • Make ordinary best-effort scheduling fair by virtual time, with EEVDF-like virtual-deadline scheduling as the target after accounting exists.
  • Represent admitted CPU time as SchedulingContext capability authority.
  • Represent isolated CPU ownership as CpuIsolationLease authority.
  • Support user-space scheduler policy services for admission and tuning without putting user-space calls on every dispatch path.
  • Provide enough telemetry to distinguish scheduler cost, serial/MMIO logging, TLB/CR3 effects, QEMU/KVM artifacts, and workload contention.

Full-SMP Scalability Focus

The scheduler work after the current Phase F chain should be judged by whether capOS can keep useful throughput and bounded scheduling overhead on 16/32-core machines, not by another small QEMU-only speedup row. The SMP proposal owns CPU bring-up and APIC/TLB substrate; this proposal owns the scheduler changes needed to make that substrate useful at higher core counts.

The scheduler side of the milestone should include:

  • dynamic scheduler CPU sets derived from discovered topology instead of the temporary four-owner mask;
  • per-CPU run queues and current-thread state that do not require one shared lock for ordinary local pick/requeue paths;
  • narrower shared metadata locks for process/thread lookup, blocking waiters, exit cleanup, direct IPC handoff, and timer/deadline waiters;
  • bounded cross-CPU wakeup and migration that records target, source, steal, reschedule-IPI, and failed-placement counters;
  • topology-aware placement that separates physical cores, SMT siblings, and later NUMA/cache groups;
  • total-time accounting for spawn/join/exit and service-bound workloads, not only syscall-free work windows;
  • hardware-run artifacts that include native Linux baselines on the same machine and QEMU rows only as regression or virtualization context.

The benchmark shape should include static map/reduce, uneven dynamic tasks, barrier-heavy phase loops, independent processes, same-process threads, and a capability-call/service-bound workload. That matrix is intentionally broader than the old thread-scale checksum row because high core counts often expose lock convoying, wakeup storms, timer/IPI cost, TLB-shootdown scaling, and runtime lifecycle overhead before pure compute saturates.

Non-Goals

  • Do not import Linux CFS/EEVDF, FreeBSD ULE, or sched_ext as code.
  • Do not expose arbitrary user-supplied scheduler programs in the kernel in the near term.
  • Do not make a user-space process the mandatory next-thread dispatcher.
  • Do not claim hard realtime until admission, budget enforcement, IRQ/device behavior, kernel-path latency, and WCET evidence exist.
  • Do not make nohz/full-nohz a thread flag. It is a CPU lease plus scheduler proof.

Architecture

The target scheduler has four layers:

  • Kernel mechanism: per-CPU run queues, current-thread state, idle, context switch, cross-CPU wake/migration, timer/IPI handling, CPU accounting, budget enforcement, and timeout/depletion faults.
  • Kernel policy primitives: best-effort weights, virtual deadlines, scheduling contexts, CPU masks, isolation leases, direct IPC donation, and realtime-island hooks.
  • Privileged scheduler policy service: admission, budget/profile selection, CPU partitioning, isolation grants, service/runtime hints, policy reload, and operator diagnostics.
  • Application/runtime schedulers: work stealing, actors, async reactors, language M:N schedulers, request queues, and service-local priority and batching.

The hot path remains local and bounded: timer interrupt or wakeup, charge runtime, update runnable state, pick from a per-CPU queue or a bounded steal path, switch context. User-space policy participates at slower boundaries: profile changes, thread/process creation, budget depletion, realtime admission, lease grant/revoke, or explicit operator policy updates.

Stateful task/job graph coordinators sit above these layers. They may own graph node queues, leases, retry state, cancellation, and assignment metadata, but they do not own CPU dispatch. A graph node’s priority, deadline, budget, or queue field is workload policy until a capability-authorized scheduler policy service maps it to a weight, scheduling context, CPU lease, or request deadline.

Stage 0: Evidence Before Policy

Before changing the default policy, the active thread-scale attribution work must keep policy conclusions separated from benchmark artifacts. Current mainline evidence now includes:

  • scheduler candidate/outcome, reschedule-IPI, serial-byte, scheduler-lock, timer interrupt, and CR3/TLB counters behind CAPOS_THREAD_SCALE_GUEST_MEASURE=1;
  • raw guest-PC samples for user-mode timer preemption points;
  • logging-suppression A/B evidence through CAPOS_THREAD_SCALE_SUPPRESS_SWITCH_LOGS=1;
  • exact native Linux pthread baseline evidence, including compact-versus-padded result-slot diagnostics;
  • larger-workload/Amdahl evidence through CAPOS_THREAD_SCALE_TOTAL_BLOCKS and LINUX_THREAD_SCALE_TOTAL_BLOCKS.

This evidence does not prove the primary remaining cause of non-scaling. Per-CPU runnable ownership, accepted work/total speedup thresholds, and optional symbolic guest attribution remain follow-on work before a scheduler policy claim.

This protects the design from treating QEMU/KVM, serial MMIO, or benchmark cache contention as a scheduler algorithm problem.

Stage 1: Per-CPU Runnable Ownership

Split the scheduler’s runnable state first. The accepted initial shape has per-CPU run queues with a runnable ThreadRef deque or priority buckets, current-thread state, a local reschedule flag, and local counters. Shared scheduler state keeps process/thread metadata, sleeping/deadline waiters, blocked waiters, migration records, and the global policy epoch.

Rules:

  • A runnable ThreadRef is owned by exactly one CPU queue at a time.
  • Cross-CPU wake enqueues to the target CPU or a policy-selected CPU and sends a bounded reschedule IPI when needed.
  • Migration removes from one owner before publishing to another.
  • Idle CPUs steal only through bounded policy, not by scanning every process.
  • Process exit and thread exit keep cleanup bounded and must not allocate in interrupt, cancellation, or emergency paths.

This stage may still use round-robin within each CPU queue. The objective is SMP structure and evidence, not perfect fairness.

First implementation evidence exists as commit 1a8bf909: capOS introduced four bounded per-scheduler-CPU FIFO runnable queues under the existing global scheduler lock. That slice proved the basic ownership structure and bounded steal path. Follow-up review fixes reserved per-CPU queue capacity before a thread became runnable, using a live reservation count released on process/thread exit or pre-publication rollback, so timer and unblock requeues did not allocate after work moved between CPUs. Update 2026-05-02: the per-CPU queues were collapsed back into a single global runnable queue under the same scheduler lock with the per-CPU run-queue-collapse cleanup slice (see docs/backlog/scheduler-evolution.md and docs/architecture/scheduling.md). Update 2026-05-07 23:45 UTC: Phase D Task 3 reintroduced the per-CPU runnable queues, this time ordered ascending by virtual_finish_ns (Weighted Fair Queueing) and balanced by a bounded steal path that picks the most-overdue sibling Runnable candidate (each sibling queue’s first entry the destination CPU considers Runnable; ties broken by lower CPU id). The queue ownership and migration contract is documented in the scheduling architecture page. This does not close the stage: the scheduler still needs stronger cross-CPU wake counters, further separation from shared process/thread metadata, replacement of temporary pinning policy, and accepted benchmark evidence before policy conclusions should change.

Stage 2: CPU Accounting

Add a monotonic runtime charge model. ThreadCpuAccount records runtime, last-start time, virtual runtime, context switches, preemptions, and voluntary blocks. SchedEntity records weight, latency class, eligible time, and virtual deadline.

Accounting must be stable enough to support fair scheduling, quotas, and future scheduling contexts. It must account context switches, blocking syscalls, endpoint direct handoff, timer preemption, thread exit, and idle.

Where exact cycle attribution is not yet credible, the implementation should label the metric as diagnostic rather than enforcing policy from it.

Stage 3: Best-Effort Fair Policy

Stage 3’s first implementation slice has landed. Phase D passed its Task 6 evidence gate at commit 77caafc0 (2026-05-10 19:39 UTC, docs(scheduler): record phase d thread-scale gate) and closed in docs commit 1a08ec23 (2026-05-10 21:47 UTC, docs(scheduler): close phase d) with weighted fair queueing (WFQ) as the accepted best-effort policy. The controlled Task 6 benchmark pair recorded capOS 1-to-4 work/total speedups 3.088x / 2.700x at 4 workers, materially closing the prior single-global-queue 1.566x / 1.538x diagnostic gap while the matching Linux pthread baseline on the same host and physical-core logical CPUs 0,1,2,3 recorded 3.974x / 3.850x. The completed execution plan is archived at docs/backlog/scheduler-evolution.md.

After Phase D, capOS should continue ordinary best-effort scheduling from WFQ toward virtual-time fairness with stronger eligibility semantics only when that follow-on is explicitly selected.

The long-term target policy is EEVDF-like:

  • runnable entities accrue lag against their fair share;
  • eligible entities are ordered by virtual deadline;
  • weights affect virtual runtime/deadline progression;
  • latency-sensitive best-effort entities can request smaller slices within policy limits;
  • migration preserves accounting so moving CPUs does not reset fairness.

The first implementation slice was intentionally narrower than EEVDF: weighted fair queueing on top of the existing per-thread runtime/vruntime accounting. That decision and its accepted evidence are recorded in the next subsection.

Phase D first-policy decision (2026-05-05 19:00 UTC)

Decision: weighted fair queueing (WFQ) for the first Phase D slice; EEVDF remains the deferred follow-on. Recorded against main commit 60e421ab and the 2026-05-02 21:38 UTC thread-scale evidence pair against main commit 374f8556 (capOS work 1.566x versus Linux 3.963x at 1-to-4 on the same physical-core pin set).

Rationale (concise):

  • The 1-to-4 gap is dominated by single-global-queue scheduler-lock contention plus exit/join/block/schedule overhead, not by ordering. Any fair-share policy that successfully consumes a per-CPU split should close most of the gap. The simpler policy reaches that signal sooner with less risk.
  • The existing ThreadCpuAccounting record separates the load-bearing ledger from benchmark diagnostics: runtime_ns, virtual_runtime_ns, and last_started_ns are unconditional, while context_switches, preemptions, voluntary_blocks, migrations, placement history, and blocked/exited stability probes stay behind cfg(feature = "measure"). WFQ needs only a per-thread weight and a virtual finish time derived from the unconditional vruntime; that mapping is direct. EEVDF additionally needs a per-thread request size, lag, eligibility deadline, and an ordered eligible-set structure (BTreeMap by virtual deadline). The runtime/vruntime accounting fields exist, but the eligibility/lag fields do not.
  • The target environment is no_std plus spin::Mutex plus a single global scheduler lock. WFQ keeps the eligibility structure as a bucketed per-CPU FIFO ordered approximately by virtual finish time; that is a familiar VecDeque-shaped data structure that mirrors the current run_queue: VecDeque<ThreadRef> ownership. EEVDF requires an ordered set inside the scheduler-lock-protected dispatch state, which is a larger structural change than the slice the gap evidence motivates.
  • Latency-class differentiation (interactive / batch / IPC server) is expressible in WFQ; Phase D pins the mapping below in the capability-surface section so the implementation slice and the short-sleeper smoke have one rule. The Phase H policy service can layer richer policy on top without requiring a tree representation underneath.
  • Linux moved from CFS to EEVDF in mainline 6.6 (released 2023-10); WFQ has decades of stable OS lineage. Either choice is defensible. The weighted-fair slice does not lock capOS into WFQ permanently — the same accounting fields, capability surface, and migration contract carry directly into EEVDF when the eligibility structure is added.

Rejected alternative: EEVDF-first. It is the stronger long-term policy and Linux’s current default. We are not picking it for the first slice because (1) the eligibility-set data structure is a larger diff that mixes structural change with the per-CPU enqueue reintroduction the 1-to-4 gap evidence already motivates; (2) the lag accounting and request-size ABI are not load-bearing for closing the single-global-queue contention bottleneck the recorded benchmark exposes; (3) moving from WFQ to EEVDF is a localized policy-module change once the capability surface, migration contract, and per-CPU queue split are accepted. The deferred EEVDF follow-on is tracked as a later policy-evaluation slice; it is not a Phase D blocker and does not displace Phase E SchedulingContext, which is the next scheduler authority phase after the accepted WFQ gate.

First-slice scope (smallest implementable surface that closes the 1-to-4 gap):

  • per-thread weight: u16 and latency_class: LatencyClass fields, default values matching the current single-class FIFO behavior; the cap-boundary path rejects weight = 0 and any nonzero value outside [MIN_WEIGHT, MAX_WEIGHT] (Phase D constants) with CapException::InvalidArgument rather than silently clamping, so no later divide-by-zero or overflow path can be reached through setWeight and so callers see policy denial instead of a hidden mutation. The invalidArgument variant landed in ExceptionType alongside SchedulingPolicyCap and LatencyClass with Phase D Task 1 (commit cb8c58b1, 2026-05-07); see docs/proposals/error-handling-proposal.md for the updated client-response taxonomy. The full validation rule lives in the cap-surface authority section below; this bullet records only that the validation runs at the cap boundary, not the dispatch path;
  • per-thread weighted vruntime charging at runtime-charge points: the existing ThreadCpuAccounting.virtual_runtime_ns advances by elapsed_ns * REFERENCE_WEIGHT / weight (instead of the current 1:1 elapsed) on every charge_runtime call. runtime_ns continues to advance 1:1 with elapsed time so monotonic CPU accounting, measure-mode reporting, and snapshot APIs are unchanged. The weighted-vruntime change is the actual fairness mechanism; without it, weights affect only enqueue-order ties rather than cumulative share. This matches the CFS-lineage approach and keeps the WFQ derivation virtual_finish = vruntime + slice * REFERENCE_WEIGHT / weight purely as an ordering aid for the local bucket;
  • per-thread virtual_finish_ns: u64 recomputed at each enqueue from virtual_runtime_ns + slice_ns * REFERENCE_WEIGHT / weight. It is not stored across blocking and is never carried as committed state; it is the per-enqueue ordering tag only;
  • per-CPU bounded run_queues: [VecDeque<ThreadRef>; SCHEDULER_CPUS] (reintroduced) each ordered ascending by virtual_finish_ns; local selection scans the queue by index for the first destination-Runnable entry (RetryLater entries left in place; the first Runnable hit is also the lowest virtual_finish_ns candidate the destination can accept because the queue is ordered), then falls back to a bounded steal scan of sibling per-CPU queues;
  • scheduler-lock-contained migration that keeps virtual_runtime_ns with the thread (per-thread state, not per-CPU) and re-inserts on the destination CPU at the post-migration virtual finish time;
  • a capability-authorized policy path (see §“Phase D capability surface” below) that gates weight/latency-class mutation and reads;
  • one-bisect-cycle single-global-queue fallback under CAPOS_SCHED_DISABLE_WFQ=1, now retired by Phase E preflight before SchedulingContext schema work.

The first slice is accepted: the 2026-05-10 19:46 UTC make run-thread-scale evidence pair recorded in docs/changelog.md and docs/benchmarks.md passed the harness-enforced 1-to-2 work/total gates, and Phase D manually accepted the recorded 1-to-4 work/total diagnostics for closeout. The historical success threshold lives in docs/backlog/scheduler-evolution.md.

Phase D capability surface (kernel-side authority, no ambient process fields)

Per docs/capability-model.md “the interface IS the permission”, weight and latency-class authority is granted by giving a process a SchedulingPolicyCap with the appropriately scoped target. The kernel rejects any state mutation that does not arrive through such a cap.

Schema (landed with Phase D Task 1, commit cb8c58b1, 2026-05-07; the original sketch took a target :ThreadHandle per method, but the methods carry no target argument because Phase D associates the target through cap state, not a per-method handle parameter. Phase D Task 2 (closeout 2026-05-07 22:51 UTC) selected the context-derived caller-thread fallback binding from the three sketched options. Every method routes to the calling thread, looked up through CapCallContext::caller_thread. The kernel cap object remains zero-sized (SchedulingPolicyCap); routing moved from call to call_with_context so the dispatch path sees the caller’s ThreadRef. There is no per-cap-object ThreadHandle, no badge-encoded thread id, and no cross-thread or cross-process mutation in this slice; per-cap-object target references and badge-encoded thread ids are reserved for the Phase H privileged scheduler policy service that will need cross-thread authority. Today the manifest grant path therefore authorizes the holder’s own threads in the strict sense – a holder cannot reach another thread’s weight or latency_class through this cap):

enum LatencyClass {
    interactive @0;
    normal      @1;
    batch       @2;
    ipcServer   @3;
}

interface SchedulingPolicyCap {
    setWeight @0 (weight :UInt16) -> ();
    setLatencyClass @1 (class :LatencyClass) -> ();
    snapshot @2 ()
        -> (weight :UInt16, class :LatencyClass,
            runtimeNs :UInt64, virtualRuntimeNs :UInt64);
}

The snapshot return is intentionally narrow: the four fields it exposes (weight, class, runtimeNs, virtualRuntimeNs) are the ones the WFQ slice promotes out of cfg(feature = "measure") unconditionally. The benchmark-only counters (context_switches, preemptions, voluntary_blocks, migrations) stay behind the measure feature because they are not load-bearing for ordering and remain useful only for benchmark instrumentation; a future operator-observability slice can add them to a separate snapshot cap once a non-emergency-path storage and reporting surface exists.

Authority rules:

  • setWeight and setLatencyClass are kernel-checked: an SQE invocation must carry a live SchedulingPolicyCap. The methods carry no per-call ThreadHandle; the target binding (selected in Phase D Task 2) is the context-derived caller-thread fallback: the kernel routes through CapCallContext::caller_thread, so a holder can only mutate its own running thread by construction. If a future cross- process grant lets a holder invoke the cap without authority over its bound target, the call fails closed through the standard cap-revocation transport-error path (the disconnected-class CapException produced by the ring dispatcher when the cap is revoked or stale); the ExceptionType taxonomy has no Denied variant by design.
  • setWeight validates the input at the cap boundary, not at the dispatch path. The validation rule is: weight = 0 (which would make the WFQ derivation slice_ns * REFERENCE_WEIGHT / weight divide by zero) is rejected with CapException::InvalidArgument; any nonzero value outside [MIN_WEIGHT, MAX_WEIGHT] (Phase D constants) is also rejected with CapException::InvalidArgument. The kernel does not silently clamp out-of-range values, because a silent clamp masks caller bugs and hides cap-boundary policy from the audit surface. The invalidArgument variant landed in ExceptionType with Phase D Task 1 (commit cb8c58b1, 2026-05-07); the updated client-response taxonomy is in docs/proposals/error-handling-proposal.md.
  • The bootstrap SchedulingPolicyCap is granted by manifest only. Its initial domain is Self (the holder’s own threads). Wider authority (cross-process weight/class mutation) belongs to the Phase H privileged scheduler policy service; Phase D does not promise that grant in the default boot manifest. Phase D manifests grant only the focused-proof scope needed for the test-matrix smokes.
  • Default policy: a thread without any explicit cap-driven mutation carries weight = DEFAULT_WEIGHT and latency_class = LatencyClass::Normal. Behavior with all defaults must preserve the pre-Phase-D default workload behavior at the limit (no fairness regressions for unmodified workloads).
  • Stale-cap revoke: SchedulingPolicyCap mutations carry the generation/epoch model used elsewhere. A weight change submitted after the cap is revoked fails closed; partially applied changes on a thread that exits between SQE arrival and dispatch fail with the standard Stale outcome and do not leak weight state.
  • The cap surface is a single typed interface; restriction is by granting a narrower wrapper (e.g., SchedulingPolicyCap whose authority domain is exactly one ThreadHandle). The kernel does not carry a parallel rights bitmask.

Latency-class semantics for Phase D (pinned mapping):

  • LatencyClass::Normal is the baseline; weight alone determines the WFQ share. The selected slice_ns is the Phase D default quantum.
  • LatencyClass::Interactive reduces the per-enqueue slice contribution by a Phase D constant (INTERACTIVE_SLICE_DIVISOR; Phase D Task 2 ships 2): the WFQ derivation becomes vruntime + (slice_ns / INTERACTIVE_SLICE_DIVISOR) * REFERENCE_WEIGHT / weight. This places the entity earlier in the per-CPU queue on each enqueue, so a short-sleeper that wakes on a Timer completion runs ahead of a same-weight CPU hog within the same scheduling window. The cumulative share is unchanged because vruntime accounting still advances at elapsed_ns * REFERENCE_WEIGHT / weight; the class only affects the per-enqueue tag, not the runtime-charge step.
  • LatencyClass::Batch increases the per-enqueue slice contribution by a Phase D constant (BATCH_SLICE_MULTIPLIER; Phase D Task 2 ships 4): the derivation becomes vruntime + (slice_ns * BATCH_SLICE_MULTIPLIER) * REFERENCE_WEIGHT / weight. This places the entity later in the per-CPU queue on each enqueue, so a CPU hog at LatencyClass::Batch yields wake-to- run latency to LatencyClass::Normal and LatencyClass::Interactive siblings without losing its weighted share over a long window.
  • LatencyClass::IpcServer is treated identically to LatencyClass::Normal for the WFQ ordering tag in this slice. The class exists in the ABI so a Phase H policy service can later re-bind direct-IPC preference, server affinity, or scheduling-context donation rules without an ABI break; Phase D does not change the existing direct-IPC preference slot semantics for this class.
  • The class is stored on Thread and read at every enqueue. A class change through setLatencyClass is observed on the next enqueue (next dequeue + re-enqueue, or next wake from blocked). No retroactive recomputation of an in-queue tag.

Phase D does not build the userspace policy service (Phase H). It adds the kernel-side primitive that Phase H will consume. SchedulingContext (Phase E) is a separate authority for budget/period/CPU mask; weight/latency-class is the WFQ ordering knob, not CPU-time authority. The two cap surfaces stay disjoint.

Phase D migration fairness sketch

A thread migrating from CPU A to CPU B mid-quantum must preserve its share. Rules:

  • virtual_runtime_ns is per-thread, not per-CPU. It travels with the thread on every migration. The accounting record already encodes that (ThreadCpuAccounting.virtual_runtime_ns lives on Thread, not on a CPU slot). Phase D promotes that field out of cfg(feature = "measure") and changes the charge_runtime step so the field advances by elapsed_ns * REFERENCE_WEIGHT / weight rather than 1:1 with elapsed time; the migration contract is otherwise unchanged.
  • Per-CPU local clocks are not used as a vruntime reference. The scheduler reads the global monotonic clocksource through crate::arch::context::monotonic_ns(), the same source the unconditional runtime/vruntime ledger uses. There is no per-CPU clock offset because there is no per-CPU vruntime reference.
  • virtual_finish_ns is recomputed at enqueue on the destination CPU from the destination weight, not carried as committed state. The migration step is remove-from-source, recompute, insert-at-destination; the scheduler lock is held for the whole window.
  • Cross-CPU steal: a CPU whose local queue has no runnable entry walks sibling per-CPU queues. For each sibling queue the scan walks indices ascending and stops at that queue’s first entry the destination CPU considers Runnable; because each queue is ordered ascending by virtual_finish_ns, the first Runnable hit per queue is the lowest virtual_finish_ns candidate the destination can accept on that source. The steal target is then the source queue whose first-Runnable candidate has the lowest virtual_finish_ns globally — the same fair-share rule the local pick uses (most overdue first) — with ties broken by lower CPU id. The chosen entry is removed from its actual position on the source queue (not necessarily the head: a RetryLater or single-CPU-owner thread may sit at the front and stay there); the destination recomputes virtual_finish_ns and inserts at the destination ordered position. The steal is allocation-free because both queues are pre-reserved against the live runnable count.
  • The ThreadCpuAccounting.migrations counter is incremented on each cross-CPU enqueue, both for placement-time spread and for steal. The behavior mirrors the prior pre-collapse counter; the Phase D slice keeps it under cfg(feature = "measure") until a permanent operator snapshot path lands.

The one-bisect-cycle single-global-queue fallback has been retired before Phase E. The accepted Phase D behavior is now always the per-CPU WFQ queue shape described above.

Phase D test matrix

Workload shapes the implementation slice verified before close:

  • CPU hogs (existing make run-thread-scale). Equal-weight same-process threads must split CPU share within bench tolerance. Different-weight threads must split CPU share approximately in proportion to weights (e.g., weights 2:1 → roughly 2:1 runtime ratio). Phase D manually accepted the recorded 1-to-4 diagnostic at 3.088x work speedup versus the recorded 1.566x baseline.
  • Short sleepers. Threads that block on Timer.sleep for short intervals must preempt CPU hogs within one quantum’s worth of bound after wake. Latency-class Interactive should have lower observed wake-to-run latency than latency-class Batch. Phase D closed this with focused make run-thread-fairness and make run-thread-fairness-interactive QEMU smokes.
  • Direct IPC server/client pairs (existing make run-spawn). An IPC server thread woken by an endpoint CALL must keep paired-call timing comparable to the current direct-IPC handoff. The direct-IPC preference slot must keep its existing generation-checked semantics under WFQ; a server should not starve when the global vruntime advances on other CPUs.
  • Multi-process load (existing make run-smp-process-scale). Independent worker processes with default weights must preserve the recorded 2026-04-30 1.6x 1-to-2 gate. WFQ across processes (no shared address space) must not regress that proof.
  • Same-process sibling load. This is the same workload shape as make run-thread-scale; it doubles as the per-CPU-queue reintroduction proof.

The exact historical per-workload acceptance numbers live in docs/backlog/scheduler-evolution.md.

Phase D overload behavior

Soft overload (runnable entities × weight exceeds the selected CPU set’s capacity):

  • Each entity gets less than its weighted share. No entity is starved; vruntime ordering guarantees that the most-behind thread runs next.
  • The scheduler does not refuse to enqueue. Phase D’s WFQ does not implement strict admission; that belongs to Phase E (SchedulingContext budget/period) and Phase G (RealtimeIsland admission).

Hard overload (e.g., a RealtimeIsland admission attempt that collides with an active CpuIsolationLease):

  • Use the existing isolation/admission path; Phase D defers to Phase F’s CpuIsolationLease and Phase G’s RealtimeIsland for that behavior. WFQ continues to schedule best-effort work on the housekeeping CPU set.
  • If an isolation lease holds CPU N and N has runnable best-effort work that cannot migrate (e.g., bound by manifest pinning), the lease attempt fails closed; existing CPU-mask validation remains the gate. Phase D does not introduce new pinning policy.

Strict admission, deadline overrun, and budget depletion are explicitly out of scope for Phase D and stay in Phase E/G.

Stage 4: Scheduling Contexts

CPU-time authority becomes a capability. SchedulingContext records budget, period, relative deadline, priority or criticality, CPU mask, remaining budget, replenishment state, timeout endpoint, and overrun policy.

The landed Phase E slices remain narrower than the full target above. The ABI now has SchedulingContextSpec authority inputs for budgetNs, periodNs, relativeDeadlineNs, byte-oriented cpuMask, and overrunPolicy, plus a read-only SchedulingContextInfo snapshot with context identity, lifecycle state, binding state, remaining budget, and an explicit dispatch-effect label. SchedulingContext.info() remains method id 0. SchedulingContext.create() creates a same-interface result cap for a validated spec, bindCallerThread() records one caller-thread binding for the current generation, and revoke() advances the generation and clears the matching thread metadata binding. Bootstrap-granted contexts and contexts returned by create() draw from the same non-wrapping context-id allocator, so the (contextId, generation) binding key does not alias distinct cap objects.

Bound active contexts now install a fixed per-thread dispatcher budget ledger: runtime charge decrements remainingBudgetNs, runnable selection replenishes elapsed periods, and exhausted contexts remain queued but ineligible until the next replenishment period. The effect label is budgetEnforced for active contexts and stays infoOnlyNoDispatchChange for stale/revoked fail-closed paths. Deadline-driven accounting now arms a sub-tick budget-exhaustion one-shot when the selected thread’s remaining budget would deplete before the next periodic scheduler tick, and nohz re-arm folds the leased thread’s budget deadline into its existing nearest-deadline timer. Kernel-mode budget one-shot fires restore a live periodic timer before returning to kernel code, so the ordinary and tick-masked paths no longer rely on a full tick quantum to observe budget depletion. Synchronous endpoint donation/return now covers passive receiver threads: endpoint in-flight state carries an internal donation token, receiver runtime charges to the caller-donated context, RETURN, application-exception RETURN, or invalid-result RETURN restores the reduced budget to the caller before caller wake, a donor with an in-flight token is blocked from returning to userspace until RETURN/cancel using an atomic marker-to-block transition that treats already-returned fast paths as normal completion, and nested donation of an already donated context is rejected until stacked return tokens have a dedicated design. Timeout/depletion notifications now use fixed per-context cells allocated at context creation/bootstrap. The cells coalesce budget-depleted and deadline-or-timeout events with typed sequence/count metadata, holder identity, remaining budget, next timestamp, donated-holder marking, explicit-revoke lifecycle state, and ok/revoked/staleGeneration observer results through SchedulingContext.drainNotifications(). Notification publishing does not allocate in scheduler hard paths, publish result caps, append unbounded queues, donate budget, reorder runnable entities, bypass throttling, or imply nohz behavior. A pre-armed observer waiter/wakeup path, realtime admission, SQPOLL, nohz, and CPU placement enforcement remain future work. Stale caps report staleGeneration and cannot mutate the new generation’s scheduler metadata or budget ledger; revoked contexts report revoked. Ordinary non-donated session logout now uses the same stale-generation rule: after UserSession.logout() flips the liveness cell, the scheduler removes matching non-donated bound thread contexts and marks the old cap generation stale. The focused session-context proof covers stale info, bindCallerThread, create, revoke, and notification-drain behavior without result-cap publication or metadata mutation. Donated receiver logout keeps the conservative skip policy: if logout observes a receiver thread holding an endpoint-donated context, the hook counts the skipped donated binding and leaves the donor blocked until endpoint RETURN/cancel commits cleanup. The focused session-context proof covers the RETURN case by showing the receiver logs out while holding the donation, the donor stays blocked, the hook reports donation_inflight_skipped=1, and the caller observes a bound context with reduced remaining budget after RETURN rather than fresh budget. Clean local owner-shell exit now calls the held UserSession.logout() before process exit, and the shell smoke observes the same scheduler hook with no bound local shell SchedulingContext.

cpuMask is a canonical little-endian bitset. CPU n maps to bit n % 8 of byte n / 8, with bit 0 as the least-significant bit of each byte. Empty data means no CPUs are selected, not “all CPUs”; future admission/bind validation rejects empty masks for runnable contexts. Producers omit trailing zero bytes: the all-zero set is encoded as empty, and any non-empty canonical mask has a nonzero final byte. This slice only snapshots that shape and does not enforce placement from it.

Remaining kernel responsibilities:

  • prevent a thread without eligible CPU authority from running;
  • charge runtime to exactly one authority target;
  • add any pre-armed timeout/depletion observer wake path without allocating in emergency paths.

Policy-service responsibilities:

  • admit or reject scheduling contexts;
  • choose budget/period/priority;
  • bind contexts to threads/services;
  • revoke or adjust contexts safely;
  • record operator-visible decisions.

SQE.deadline_ns remains request metadata. It may influence drop, freshness, propagation, and telemetry, but it does not grant CPU budget.

Stage 5: CPU Isolation Leases and SQPOLL

CpuIsolationLease grants placement and exclusivity, not CPU time. It records the owner process/session/service, CPU set, mode, housekeeping exclusions, accounting target, maximum revocation latency, and revoke endpoint. The current Phase F implementation keeps ticks periodic but makes housekeeping/deferred-work placement explicit: at least one online scheduler housekeeping CPU must remain outside active lease candidates, and preflight telemetry routes or rejects deferred cleanup, timer/deadline, network polling, IRQ affinity, scheduler accounting, and cleanup latency before later SQPOLL or nohz behavior can use the lease.

The Phase F substrate landed so far is:

  • the one-SQ-consumer ring-ownership prerequisite that lets nohz/SQPOLL reason about a single submission consumer per ring;
  • nohz activation telemetry that labels admit/reject decisions, rollback reasons, and current periodic-tick fallback state without changing dispatch behavior;
  • housekeeping/deferred-work placement preflight, which fail-closes when unrelated timers, deferred cleanup, network polling, debug/watchdog work, or IRQ delivery would otherwise be pinned to a candidate isolated CPU;
  • a bounded SQPOLL ring-mode worker (MAX_SQPOLL_WORKERS = 16) that records tick_suppression=disabled / full_nohz=disabled strings while the activation proof is still open, with generation-checked stale-owner rollback;
  • a clockevent/deadline substrate independent of the periodic tick, so the scheduler can express “wake at deadline T” without depending on periodic ticks to enforce budget;
  • a bounded non-periodic SQPOLL producer-wake progress path that lets a parked SQPOLL worker make forward progress on producer activity without reverting to a periodic tick.

Automatic nohz activation – actually suppressing the periodic scheduler tick on an admitted CPU and restoring it on rollback/revoke/stale generation – was closed for the first increment via docs/tasks/done/2026/scheduler-phase-f-auto-nohz-activation.md: the CpuIsolationLease preflight now performs real per-CPU periodic-tick suppression for the narrow single-runnable-entity window, satisfying proof obligations for single runnable entity on the target CPU, ready housekeeping CPU outside the lease, non-local deferred-cleanup/timer/network/IRQ dependencies, valid accounting target, bounded revocation latency, and generation-checked ring ownership, with fail-closed rollback. SQPOLL-driven auto-nohz activation is also closed via docs/tasks/done/2026/scheduler-phase-f-auto-nohz-sqpoll.md: a ring-coupled kernelSqpoll lease whose bound ring is in SQPOLL running/sleeping mode with a live owner is admitted for tick suppression, with the SQPOLL ring-state re-check as the decisive rollback gate. The tick_suppression, auto_nohz, and sqpoll telemetry counters reflect real suppression. Generic full-nohz for ordinary budgeted compute threads is now admitted by explicit SchedulingContext-targeted CpuIsolationLease preflight; production realtime island admission remains deferred independently of these closed tasks.

Activation requires scheduler proof:

  • at least one housekeeping CPU remains online;
  • unrelated timers, deferred cleanup, network polling, and debug/watchdog work are not pinned to the isolated CPU;
  • the active ring has exactly one SQ consumer;
  • the accounting target is valid and chargeable;
  • revocation latency fits the lease policy.

The scheduler idle path is now a per-CPU CPL0 (kernel-mode) idle thread; the user-mode idle process was removed in commit e3c0df01 (2026-05-14 UTC). There are two CPL0 idle paths: the cooperative boot/AP path that hlts at CPL0 on the per-CPU kernel stack, and the steady-state idle-thread path reached from the four dispatch sites (schedule, capos_block_current_syscall, exit_current, exit_current_thread). Both are described in detail in Scheduling.

SQPOLL uses the ring-mode contract in Tickless and Realtime Scheduling. The scheduler proposal adds the CPU-ownership and policy-service side of that contract.

Stage 6: Realtime Islands

A RealtimeIsland is an admitted graph, not a single priority. It records scheduling contexts, memory reservations, device and IRQ reservations, rings/endpoints/notifications, any CPU isolation leases, admission evidence, and overrun/shutdown policy.

Use cases include local audio, realtime voice, robotics control, and selected provider/runtime loops. Admission must fail closed if the graph cannot fit the declared period/quantum and reservations.

Stage 7: User-Space Scheduler Policy

After kernel primitives are in place, a privileged scheduler policy service can own:

  • default resource profiles;
  • session/account/service CPU policy;
  • scheduling-context admission;
  • CPU lease grant/revoke;
  • runtime hints such as latency-sensitive, batch, driver, poller, or agent;
  • AutoNoHz placement for ordinary threads that appear capable of utilizing a full CPU core (see Policy-Service Userstories in tickless-realtime-scheduling-proposal);
  • operator-facing diagnostics and policy reload.

AutoNoHz placement is the policy-service surface that turns the “thread appears capable of utilizing a full CPU core” observation into a bounded CpuIsolationLease against a pre-authorized account or session CPU pool. The lease adds isolation; it does not mint CPU-time authority. The thread still consumes time through its existing SchedulingContext (or coarse ResourceLedger); the lease just removes tick and scheduler noise while that budget is being consumed. Bounds the policy service must enforce on every auto-issued lease – lifetime, revocation latency, accounting target, auto-claim pool capacity, and fairness preemption – are detailed in the tickless proposal.

The kernel still owns emergency fallback. If the policy service is dead, blocked, stale, or malicious, the kernel must continue to enforce safety, revoke leases as policy permits, and schedule a minimal recovery path.

Validation Gates

  • Per-CPU queue work must preserve run-smoke, run-spawn, run-thread-scale, park/ring/process-exit smokes, and SMP smokes.
  • A thread-scale milestone closeout must include repeated controlled capos-bench evidence and raw logs.
  • CPU accounting must include sanity tests that measured runtime increases monotonically while a thread runs and stops while it is blocked.
  • Fair policy changes must include adversarial tests: CPU hogs, short sleepers, direct IPC handoff, multi-process load, and same-process sibling load.
  • Scheduling-context work must include admission rejection, budget depletion, replenishment, endpoint donation/return, timeout notification, stale cap revocation tests, and any future pre-armed notification waiter coverage.
  • CPU leases must include revocation, process exit, session close, and housekeeping fallback tests.
  • Realtime island proofs must show preallocation, no allocation/blocking on admitted paths, deadline miss telemetry, and fail-closed overrun behavior.

Open Decisions

  • Whether the first best-effort fair policy should be weighted fair queueing or direct EEVDF. Resolved 2026-05-05 19:00 UTC: WFQ first; EEVDF deferred follow-on. See “Phase D first-policy decision” above.
  • Whether scheduling-context priority is a scalar, a criticality band, or both.
  • Whether SchedulingContext should be bindable to a process default, individual thread, endpoint call path, or all three in the first ABI.
  • Which scheduler telemetry is permanent ABI and which is benchmark-only.
  • How much policy-service state belongs in the boot manifest versus mutable operator configuration.
  • Whether the WFQ slice’s bucketed VecDeque per-CPU queue is the long-term representation or a stepping stone to an EEVDF BTreeMap-based eligibility set. EEVDF is an evaluated follow-on policy, not a committed migration; re-evaluate only when the explicit Phase D follow-on EEVDF migration backlog item is selected. Phase F’s one-SQ-consumer prerequisite, nohz telemetry, housekeeping/deferred-work placement, bounded SQPOLL ring mode, clockevent/deadline substrate, and bounded non-periodic SQPOLL producer-wake progress have landed on top of the closed Phase E SchedulingContext gate; the first automatic nohz activation increment is also closed via docs/tasks/done/2026/scheduler-phase-f-auto-nohz-activation.md and SQPOLL-driven auto-nohz activation is closed via docs/tasks/done/2026/scheduler-phase-f-auto-nohz-sqpoll.md; timeout-based auto-revoke and ordinary-thread generic full-nohz admission are also landed. The policy-service AutoNoHz capstone and generic SQPOLL nohz for arbitrary rings remain open. Phase F.5 (full-SMP 16/32-core scalability) is still planning.

Proposal: Tickless and Realtime Scheduling

This proposal captures the scheduling design from the 2026-04-29 discussion and the subsequent implementation status: tickless idle is useful, full-nohz belongs behind explicit CPU isolation authority, and realtime requires scheduling contexts rather than only per-request deadlines.

Design Grounding

The directly relevant grounding is:

External grounding is recorded in the research note so reviewers can audit the prior-art claims without treating this proposal as the source of truth.

Goals

  • Add tickless idle: when a CPU has no runnable work, stop the periodic scheduler tick and program the local timer for the earliest known deadline.
  • Split monotonic timekeeping from timer interrupt delivery.
  • Convert scheduler timeout waiters to absolute monotonic deadlines.
  • Stage full-nohz as an explicit CPU isolation/lease mode for SQPOLL and realtime executors, not as a generic scheduler default.
  • Define SQE.deadline_ns as request freshness metadata.
  • Define SchedulingContext as CPU-time authority.
  • Define RealtimeIsland as the admission object for media, robotics, provider, and other bounded realtime graphs.

Non-Goals

  • No ambient Linux-style NO_HZ_FULL for arbitrary unbudgeted user threads. Ordinary-thread full-nohz requires an explicit budgeted SchedulingContext target and a CpuIsolationLease.
  • No SQPOLL on the current process-wide ring.
  • No second SQ consumer through timer-side polling for SQPOLL rings.
  • No TSC-deadline or x2APIC requirement for the first tickless-idle milestone.
  • No hard realtime claim before kernel-path, IRQ, device, locking, and WCET evidence exists.
  • No full realtime policy blob inside every SQE.

CPU Authority Taxonomy

These terms must not drift into overlapping authority systems:

ResourceProfile:
  policy template selected by identity, session, account, or service profile;
  it is not spendable authority by itself.

ResourceLedger:
  coarse accounting and quota owner for a resource class. It records and
  enforces limits, including non-realtime CPU share/runtime budgets where the
  scheduler has not minted finer scheduling contexts.

SchedulingContext:
  spendable CPU-time authority with budget, period, relative deadline,
  priority/criticality, CPU mask, and overrun policy.

CpuIsolationLease:
  placement, exclusivity, and nohz/noise-isolation authority for a CPU or CPU
  set. It does not grant CPU-time credit and must charge consumed time through
  a SchedulingContext or coarse scheduler ResourceLedger.

NoHzEligibility:
  a reviewed claim or hint that a thread, ring, poller, or island may use nohz
  isolation if the scheduler can prove the current CPU state allows it.

NoHzActivation:
  the scheduler-proven current CPU state that actually suppresses ticks.

RealtimeIsland:
  admitted bundle of SchedulingContexts, memory reservations, device
  reservations, rings, endpoint/service constraints, and optional
  CpuIsolationLeases.

Scheduling-context donation is not generic resource donation. It donates only execution budget/deadline along a synchronous capability path; it does not donate capability authority, invocation subject identity, disclosure scope, memory budget, network budget, storage budget, or service-management authority.

Layer 1: Tickless Idle

Tickless idle should be the first behavioral milestone. It applies only when the CPU has no runnable thread and no local work that still depends on a periodic scheduler tick.

Clocksource

Add a monotonic clock layer:

#![allow(unused)]
fn main() {
pub fn monotonic_ns() -> u64;
}

The first backend can use the current periodic tick as a compatibility source while the system is still periodic. The selected QEMU/x86_64 backend should eventually use a calibrated stable counter, with SMP consistency handled when multiple scheduler owners exist.

Required invariant:

monotonic_ns() never moves backwards on one CPU.

Clockevent

Add a small scheduler timer backend boundary:

#![allow(unused)]
fn main() {
trait ClockEvent {
    fn program_periodic(period_ns: u64);
    fn program_oneshot(delta_ns: u64);
    fn stop();
    fn min_delta_ns() -> u64;
    fn max_delta_ns() -> u64;
}
}

The first backend is the current PIT-calibrated xAPIC LAPIC timer on vector 48. PIT/PIC and periodic LAPIC remain fallback paths.

Deadline Waiters

Convert timeout state from tick counts to absolute deadlines:

#![allow(unused)]
fn main() {
struct DeadlineWaiter {
    deadline_ns: u64,
    target: ThreadRef,
    kind: WaiterKind,
    user_data: u64,
}
}

Affected paths:

  • Timer.sleep;
  • cap_enter(timeout_ns);
  • ParkSpace timeout;
  • future process/thread wait timeouts;
  • network poll deadline through NetworkPollClock.

Waiter storage remains bounded. No interrupt path may allocate.

Network Poll Clock

The kernel-resident networking path is scheduler-polled. Rather than keep every network-coupled lease in ForcedPeriodic, the in-kernel virtio-net poll is now routed off a lease-isolated CPU (landed 2026-06-04, scheduler-nohz-network-poll-housekeeping-routing): virtio::poll_scheduler consults sched::current_cpu_lease_nohz_active() and skips driving the poll from a CPU inside a lease-backed tick-suppression window, so that CPU no longer needs the periodic tick to make network progress. The always-ticking housekeeping CPU the lease admission already requires keeps servicing virtqueue completions and pending network-waiter scans. The CpuIsolationLease activation preflight reflects this with a network_polling=routed-periodic-network-polling- to-housekeeping-cpu admit label when a housekeeping CPU is available, failing closed (rejected-network-polling-no-housekeeping-cpu-to-relocate, and the lease is refused at create when no housekeeping CPU exists) otherwise. The longer-term explicit poll-deadline interface below remains the target for fully removing the dependency on a housekeeping CPU continuing to tick:

#![allow(unused)]
fn main() {
trait NetworkPollClock {
    fn next_poll_deadline_ns(now_ns: u64) -> Option<u64>;
    fn poll_until_budget(now_ns: u64, budget_ns: u64) -> PollResult;
}
}

next_poll_deadline_ns lets the scheduler include TCP/runtime timers in earliest_global_deadline(). poll_until_budget prevents network progress from becoming an unbounded idle-exit or interrupt path. A CPU with active networking may enter tickless idle only when the network runtime is inactive or has exposed a bounded deadline through this interface.

Kernel Idle

Tickless idle depends on replacing the user-mode idle process with a kernel/per-CPU idle context. Timer IRQ handling must distinguish:

IRQ from CPL3 user thread -> save/restore user context
IRQ from CPL0 idle        -> wake/check scheduler without fake user context

Idle entry shape:

if no runnable work:
    deadline = earliest_global_deadline()
    clockevent.program_oneshot(deadline - now)
    enter_kernel_idle()

The idle loop enables interrupts, halts, wakes on timer/IPI/device interrupt, then rechecks runnable work and deadline expiry.

Tickless State

Per CPU:

Periodic:
  normal scheduler tick active

TicklessIdle:
  no runnable thread
  one-shot local timer programmed for earliest deadline
  CPU in kernel idle

ForcedPeriodic:
  fallback when a subsystem still needs regular polling

Enter TicklessIdle only when:

run queue empty
no direct IPC target
no deferred completion work
no timer-side ring work required
clockevent supports one-shot
kernel idle context available
network runtime inactive or deadline-driven

Keep periodic preemption whenever there is runnable contention. Even one runnable user thread remains periodic until Ring v2, CPU accounting, and timer-side polling dependencies are resolved.

Layer 2: SQPOLL NoHz

SQPOLL full-nohz is a later CPU ownership mode:

full-nohz is not a timer feature here;
it is part of the SQPOLL CPU ownership contract.

Required prerequisites:

  • Ring v2 or equivalent per-thread rings;
  • one SQ consumer per ring, including implemented syscall-mode leases and bounded SQPOLL mode transitions;
  • per-CPU scheduler ownership;
  • reschedule IPI and idle-to-runnable handoff;
  • at least one housekeeping CPU;
  • explicit placement of network polling away from isolated CPUs.

Current Phase F status: CpuIsolationLease and nohz telemetry exist, the housekeeping/deferred-work placement child records selected online housekeeping CPU masks plus deferred cleanup, timer/deadline, network polling, IRQ-affinity, accounting-target, and cleanup-latency placement or rejection labels, bounded SQPOLL ring mode can progress from periodic service or one current-thread syscall/producer-wake batch, and the clockevent/deadline substrate has split monotonic clocksource reads from LAPIC clockevent programming. The clockevent one-shot’s firing precision is proven, not just its programming: a runtime-reprogrammed TICK_NS/2 one-shot armed over the live periodic timer is measured to fire at its requested sub-tick instant (~5 ms for a 5 ms request, far under the 10 ms tick, with the current-count correctly reset to the sub-tick value), and the kernel-mode-fire path restores a live periodic timer so a one-shot consumed without running schedule() cannot strand the CPU with no timer source (make run-scheduling-context).

The monotonic clocksource discipline is now sub-tick-accurate as well. The periodic discipline step previously floored every fire to epoch + TICK_NS (max(tsc_interpolated, epoch + TICK_NS)), which inflated a real sub-tick interval to a full tick and hid sub-tick deadlines from the accounting clock. discipline_clocksource_tick now trusts the TSC interpolation at sub-tick granularity and falls back to the TICK_NS floor only when the interpolated advance is implausibly small (below MIN_DISCIPLINED_ADVANCE_NS), preserving a minimum forward rate against a degenerate TSC (publish_monotonic_ns enforces only non-decreasing time, not a minimum rate). A boot proof advances a real TICK_NS/2 interval through one discipline step and asserts monotonic_ns() tracked the sub-tick delta rather than the full-tick floor (make run-scheduling-context).

The first activation increment is now real: the CpuIsolationLease activation preflight performs real per-CPU periodic-tick suppression for the narrow single-runnable-entity window. When the preflight finds every proof obligation satisfied – exactly one runnable caller on the target CPU, ready housekeeping CPU, no local deferred-cleanup/timer dependency, valid accounting target, live monotonic clocksource, non-stale one-SQ-consumer, and bounded revocation latency – and the target CPU is the CPU running the preflight, it masks the periodic LAPIC tick and arms a bounded one-shot deadline at min(nearest pending timer wakeup, now + max revocation latency). Network polling is now routed to a housekeeping CPU rather than kept read-only fail-closed (landed 2026-06-04): the in-kernel virtio-net poll skips driving from a lease-isolated CPU (virtio::poll_scheduler consulting sched::current_cpu_lease_nohz_active()), so the admission network_polling gate flips to a routed-periodic-network-polling-to-housekeeping-cpu admit when a housekeeping CPU is available and fails closed otherwise. IRQ affinity is now routable in a bounded form (landed 2026-06-04): when a lease opts in, the activation path reprograms the leased CPU’s legacy IO-APIC redirection-entry destinations onto the selected housekeeping CPU (mask-before-reprogram + read-back, restored on rollback/revoke) before admitting tick suppression, and keeps the conservative rejected-irq-affinity-not-routed-to-housekeeping refusal for any ring-coupled lease whose IRQ dependency cannot be safely rerouted. The live reroute is presently scoped to a quiescent housekeeping destination: under the in-kernel KVM irqchip, reprogramming an IO-APIC redirection-entry destination onto a CPU that is actively scheduling stalls forward progress on that destination CPU, so a general “reroute onto any housekeeping CPU regardless of occupancy” admission remains future work behind a real destination-quiescence gate or a delivery backend without that re-evaluation cost. Every disqualifying change (stale lease generation, a second runnable entity, stealable sibling work, a local deferred-cleanup dependency, a target-CPU mismatch, or a one-shot backend that can no longer arm a deadline) rolls the CPU back to the periodic LAPIC tick first, before ordinary work continues. Generic full-nohz for ordinary budgeted compute threads is now admitted through explicit SchedulingContext-targeted compute leases. A generic SQPOLL nohz state machine now admits explicitly leased caller-thread rings when the ring is in SQPOLL running/sleeping mode with a live owner, one SQ consumer, and bounded producer-wake/deadline rollback. Broader userspace-poller/device-queue admission and production realtime island admission remain future work; the periodic tick stays the fail-closed fallback everywhere else. Timeout-based auto-revoke has since landed: a lease created with leaseLifetimeNs > 0 auto-revokes on first observation past its deadline (reason=lease-expired) and a tickless CPU under it rolls back at the next recheck (lease-lifetime-expired) (docs/tasks/done/2026-05-30/scheduler-cpu-isolation-lease-timeout-auto-revoke.md). SQPOLL-driven activation is now proven by make run-scheduler-generic-sqpoll-nohz: a ring-coupled kernelSqpoll lease whose bound ring is in SQPOLL running/sleeping mode with a live owner is admitted for tick suppression, producer wake drives bounded non-periodic service, and revoke/stale-owner rollback fails closed. The per-CPU idle thread has also landed – the scheduler idle path is now a CPL0 per-CPU kernel idle thread and the user-mode idle process is gone (docs/tasks/README.md).

The non-atomic createLease-vs-revokeGrant SMP window (kernel/src/cap/cpu_isolation_pool_grant.rs:472-483) – a createLease that passes the grant live-check on one CPU can register its lease just after a concurrent revokeGrant on another CPU snapshotted the registry, so that lease is not cascade-terminated and lingers until its own leaseLifetimeNs or an explicit revoke – is now a modeled, bounded residual rather than a prose-only caveat. The Alloy lease/grant authority model represents it explicitly as the WindowLingering set and checks that no live lease reaches a revoked grant outside it. That the lingering lease was nonetheless legitimately authorized (no lease is ever minted through an already-revoked grant) is a temporal mint-time-vs-revoke property the static relational model does not itself check; it rests on the code’s create-time minted_grant_live gate (cpu_isolation_pool_grant.rs:484), which fails closed before admission. Taken together this is a bounded capacity-hold window, not an authority escalation. The companion TLA+ model checks the two-lock teardown the cascade and prune share (generation advances exactly once, no capacity double-free, no stranded generation). Both run under make model-scheduler-lease-alloy / make model-scheduler-lease-tla; see models/scheduler/README.md.

The nohz/tickless activation-rollback path – the lock-free NOHZ_ACTIVE_CPUS bit read from ISR context against the locked dispatch.nohz_activation[slot] record, with IPI-delivered cross-CPU activation/rollback – is likewise now a checked model rather than a prose-only invariant. The TLA+ lifecycle model (models/scheduler/nohz_activation.tla) checks that no scheduler CPU is ever left timer-less (a fired one-shot always has the contention fallback re-arm enabled, and is always eventually re-armed), that the lock-free bit and the locked record always reconcile (the bit-set/record-cleared and record-present/bit-cleared divergences the rollback and contention paths produce are transient), and that a staled remote activation is dropped rather than applied to a newer lease (a staled generation is never committed, and a recorded generation staled by the cap-side maybe_expire path is always rolled back by the stale-lease-generation disqualifier). A focused Loom test pins the lock-free-bit ↔ locked-record reconciliation under the C11 memory model. Both run under make model-scheduler-nohz-tla / make model-scheduler-nohz-loom; see models/scheduler/README.md.

Ring mode:

#![allow(unused)]
fn main() {
enum RingMode {
    Syscall,
    SqpollStarting,
    Sqpoll,
    SqpollStopping,
}
}

In syscall mode, the owner thread’s cap_enter drains SQ. In SQPOLL mode, a kernel worker owns SQ head; userspace owns SQ tail and CQ head; cap_enter waits for completions and may wake a sleeping poller, but it does not drain SQ.

SQPOLL state:

Disabled -> Starting -> Running -> IdleSpinning -> Sleeping -> Stopping

The wake protocol uses a NEED_WAKEUP flag. Userspace release-stores the SQ tail, acquire-loads flags, and invokes a wake path only if the poller has gone to sleep.

The race-free sequence is normative.

Poller before sleeping:

#![allow(unused)]
fn main() {
flags.fetch_or(NEED_WAKEUP, SeqCst);

let tail = sq_tail.load(Acquire);
if sq_head != tail {
    flags.fetch_and(!NEED_WAKEUP, Release);
    continue;
}

park();
}

Producer:

#![allow(unused)]
fn main() {
write_sqe();
sq_tail.store(new_tail, Release);
fence(SeqCst);

let flags = flags.load(Acquire);
if flags & NEED_WAKEUP != 0 {
    wake_poller();
}
}

The poller must set NEED_WAKEUP before the final tail recheck. Otherwise a producer can publish a new SQE after the poller checks the tail but before it parks, losing the wake.

The NEED_WAKEUP publication must also be ordered before the final tail recheck by a full store-to-load barrier. A SeqCst RMW is the simplest portable rule for the ABI text; an implementation may substitute an explicitly reviewed architecture-specific fence or park primitive that provides the same ordering. A plain release store or release-only RMW is not sufficient for this protocol.

The producer must likewise order the SQ tail publication before checking NEED_WAKEUP. The normative sequence uses a full fence between sq_tail.store(..., Release) and flags.load(Acquire); an implementation may substitute an explicitly reviewed equivalent that prevents the producer from missing NEED_WAKEUP while the poller misses the new tail before parking.

An SQPOLL CPU may suppress the periodic tick only if:

cpu role is SqpollIsolated
exactly one runnable entity is the poller
no ordinary user thread is runnable there
no timer-side SQ polling is enabled
no network scheduler polling is pinned there
no deferred cleanup is pinned there
stable clocksource/accounting exists
housekeeping CPU is online

If any condition fails, restore periodic tick or migrate the unrelated work.

NoHz Activation Proof Obligations

To enter SqpollNoHz or future AutoNoHz, the scheduler must prove:

exactly one runnable entity is assigned to the CPU
at least one housekeeping CPU is online
no local network polling dependency remains
no timer-side SQ polling can run for the active ring
no local deferred cleanup or unbound kernel worker is pinned there
no unmigratable IRQ targets that CPU unless explicitly allowed
clocksource and CPU accounting are boundary/counter driven, not tick driven
revocation latency is within the lease policy

The proof is dynamic. If any condition stops holding, the scheduler must restore periodic tick, migrate unrelated work, revoke the lease, or leave nohz mode before continuing.

Layer 3: AutoNoHz CPU Lease

The long-term design should split eligibility from activation.

Eligibility says a thread, process, ring, or realtime island may use nohz isolation:

#![allow(unused)]
fn main() {
enum NoHzKind {
    Idle,
    KernelSqpoll,
    AutoCompute,
    AutoUserspacePoller,
    RealtimeIsland,
}

struct NoHzEligibility {
    kind: NoHzKind,
    max_revocation_latency_ns: u64,
    preferred_cpus: CpuSet,
    allow_busy_spin: bool,
    accounting_target: CpuAccountingTarget,
}

enum CpuAccountingTarget {
    CurrentSchedulingContext,
    SchedulerResourceLedger,
}
}

Activation is a scheduler proof that a CPU currently satisfies isolation conditions. Without a lease, a latency-sensitive hint may influence placement but must not grant exclusive CPU access.

Future lease shape:

CpuIsolationLease:
  owner process/session
  allowed CPU set
  allowed mode: poller/compute/kernel-worker
  accounting target, not CPU-time credit
  revocation policy

Housekeeping must be explicit:

Housekeeping CPU set:
  global timers
  deferred frees
  cleanup
  statistics
  non-critical kernel workers
  debug/watchdog
  load balancing and migration control

Layer 4: Deadline Metadata

Deadline metadata lives in fixed ring ABI fields, not in a Cap’n Proto SQE envelope and not in variable side metadata. The current fixed SQE layout should not be silently reinterpreted; add these fields through a versioned CapSqeV2/ring ABI gate when the transport is ready.

#![allow(unused)]
fn main() {
#[repr(C)]
struct CapSqeV2 {
    // existing fixed CapSqe fields, unchanged in order and meaning

    deadline_ns: u64,  // absolute monotonic deadline, 0 = none
    qos_flags: u32,   // drop/allow/reorder/propagate semantics
    sched_ctx_id: u32, // 0 = current/default scheduling context
}
}

deadline_ns is an absolute monotonic timestamp. It is request freshness metadata, not a promise of nanosecond wakeup precision. The kernel may round timer programming to clockevent granularity, coalesce timers where policy allows, or report a miss when dispatch observes the timestamp has already expired. The field remains u64 nanoseconds because absolute u64 ns values are simple, tracing-friendly, and shared with existing timeout surfaces; a u64 microsecond field saves no ABI space.

Only consider a compact profile if SQE space becomes critical:

#![allow(unused)]
fn main() {
deadline_delta_us: u32
}

That profile would be a soft-deadline compact transport shape only. It is not the primary realtime or SchedulingContext ABI and must not replace deadline_ns for admitted realtime work.

ABI negotiation uses both bootstrap metadata and a runtime query surface:

#![allow(unused)]
fn main() {
struct RuntimeBootInfo {
    ring_addr: u64,
    ring_abi_version: u32,
    sqe_size: u16,
    cqe_size: u16,
}
}
  • Process bootstrap passes the ring ABI version and fixed entry sizes alongside the ring address.
  • RuntimeBootInfo, ring ABI version constants, and fixed SQE/CQE layouts live in capos-config/src/ring.rs; the kernel and capos-rt import the same definition rather than carrying local copies.
  • A future RuntimeInfo/SystemInfo query returns the kernel-supported ring ABI range so language runtimes can fail before mapping incompatible rings.
  • cap_enter rejects unsupported SQE versions or entry sizes with stable transport errors such as CAP_ERR_UNSUPPORTED_RING_ABI and CAP_ERR_UNSUPPORTED_SQE_VERSION.
  • Runtimes in Rust, C, Go, and other languages must generate or mirror the exact fixed layout for the negotiated version.

Suggested flags:

DROP_IF_LATE:
  if now > deadline_ns before dispatch, post DEADLINE_EXPIRED

ALLOW_LATE:
  dispatch anyway, but CQE/telemetry marks late

PROPAGATE_DEADLINE:
  endpoint CALL/RETURN carries deadline metadata to server-side request

DEADLINE_ORDERED:
  SQPOLL may reorder within a bounded window only when all reorder-safety
  checks below pass

NO_BLOCKING_PATH:
  reject if target method/op is not declared realtime-safe

Do not put budget, period, priority, criticality, or CPU affinity into each SQE. Deadline is per request. Budget is execution authority.

DEADLINE_ORDERED is valid only when all of the following are true:

the ring mode permits reordering
the SQE marks this request reorderable
the target capability interface and method declare reorder-safe semantics
the reordering window is bounded
the operation does not depend on earlier same-ring requests for correctness

Ordered side effects such as write A; write B; flush or lock; mutate; unlock must not be deadline-reordered unless the target method contract explicitly defines that sequence as reorder-safe.

Layer 5: SchedulingContext

CPU time should become a capability-controlled object:

#![allow(unused)]
fn main() {
struct SchedulingContext {
    budget_ns: u64,
    period_ns: u64,
    relative_deadline_ns: u64,
    priority: u16,
    criticality: u8,
    cpu_mask: CpuSet,
    overrun_policy: OverrunPolicy,
    timeout_endpoint: Option<EndpointRef>,
}
}

Kernel responsibilities:

  • decrement remaining budget by actual runtime;
  • replenish budget by period;
  • throttle or fault a thread on depletion;
  • enforce CPU mask and scheduling eligibility;
  • dispatch among eligible contexts by the selected realtime policy;
  • prevent untrusted SQE bytes from minting budget.

Policy-service responsibilities:

  • admission control;
  • budget/period/priority selection;
  • CPU-isolation lease policy;
  • overload response;
  • telemetry and retuning.

Layer 6: Donation

Synchronous capability calls need scheduling-context donation:

client SchedulingContext -> passive server endpoint
server runs on donated budget/deadline
context returns on reply
timeout/overrun reports to caller or island policy

Without donation or inheritance, a realtime caller can be defeated by a normal-priority server that holds the capability implementation path.

Donation semantics must be fixed before implementation:

max donation call depth:
  bounded per SchedulingContext or RealtimeIsland; overflow fails closed.

nested donation:
  nested synchronous calls carry the current donated context until the depth
  bound, unless a callee uses its own admitted context by explicit policy.

cycle handling:
  a donated context may not re-enter a thread already on its donation stack;
  cycles fail with a typed realtime/donation error.

partial failure:
  budget already consumed stays charged to the context that ran the work.
  rollback of authority or memory is separate from CPU charge rollback.

timeout propagation:
  the earliest of request deadline, scheduling-context deadline, and explicit
  call timeout bounds downstream execution.

server-side blocking:
  a passive server running on donated context may block only on approved
  realtime-safe waits or synchronous calls that continue donation.

return on exception:
  application exceptions, transport errors, and cancellation return the
  context to its previous owner before CQE/error delivery.

async endpoint queues:
  donation does not cross ordinary async endpoint enqueue by default. Async
  donation requires an explicit future token/lease design.

Hot admitted paths should avoid blocking locks. If a shared resource cannot be modeled as a passive service, it needs a reviewed priority/deadline-inheritance primitive or a bounded try-lock/fail/drop policy.

Layer 7: RealtimeIsland

RealtimeIsland admits a whole loop or graph:

#![allow(unused)]
fn main() {
struct RealtimeIslandSpec {
    period_ns: u64,
    deadline_ns: u64,
    cpu_set: CpuSet,
    nodes: Vec<NodeBudget>,
    rings: Vec<RingSpec>,
    memory: Vec<PreallocSpec>,
    devices: Vec<DeviceReservation>,
    overrun_policy: OverrunPolicy,
}
}

Admission requires:

  • total budget fits period/deadline constraints;
  • all hot-path buffers are preallocated;
  • hot-path memory is committed and resident before start;
  • guaranteed hot-path memory uses the OOM proposal’s MemoryResidency policy as pinned or secret; normal memory is not admitted for guaranteed hot paths. A future lock-resident operation may transition ordinary memory into a pinned reservation before admission, but the admitted island sees the result as pinned, not as normal;
  • all caps and policy decisions are resolved before start;
  • no expected page faults on the hot path;
  • no unbounded lock acquisition;
  • no blocking endpoint calls inside callback loops;
  • no allocation, logging, service discovery, or provider credential work on the realtime path;
  • IRQ and deferred work are bounded or moved outside the island.

Failure semantics must be typed:

CAP_ERR_DEADLINE_EXPIRED
CAP_ERR_BUDGET_EXHAUSTED
CAP_ERR_REALTIME_UNSAFE_PATH
CAP_ERR_REALTIME_ADMISSION_DENIED
CAP_ERR_OVERRUN
CAP_ERR_STALE_INPUT

CQE/status should distinguish not-started-late, completed-late, dropped by policy, throttled, and dependency-cancelled.

Policy-Service Userstories: AutoNoHz Placement for Compute-Capable Threads

The Layer 1-7 primitives above are mechanism: NoHzEligibility is a reviewed claim, CpuIsolationLease is the placement authority, SchedulingContext and the coarse ResourceLedger own CPU-time budget, and NoHzActivation is the scheduler proof that current CPU state allows tick suppression. They do not answer who decides to issue an eligibility hint for an ordinary user thread that was not pre-declared as a realtime island or kernel SQPOLL worker, or what observation justifies the issuance. That decision is policy, and it belongs in the user-space scheduler policy service described in Stage 7 of scheduler-evolution-proposal. This section records the userstories that motivate the responsibility and the bounds the policy service must enforce so auto-promotion never becomes an implicit “unlimited CPU-hold” grant.

Core property: promotion is placement, not budget

Auto-promotion adds isolation; it never mints CPU-time authority. A policy-issued CpuIsolationLease only removes tick and scheduler noise while its bound thread consumes time that was already authorized through its SchedulingContext or coarse ResourceLedger. SchedulingContext budget exhaustion is now folded into the same nearest-deadline timer as nohz revocation/timer work, so a tick-masked CPU is re-observed at the budget deadline rather than at a later periodic tick. When budget exhausts, or when any existing Layer 3 activation obligation stops holding, the existing fail-closed rollback path restores the periodic tick. Priority-aware revocation of the lease itself when an equal-or-higher-priority runnable arrives is new Phase H surface (see “Bounds the policy service must enforce” below); today’s Phase F rollback only restores ticks on the leased CPU and does not terminate the lease.

This separation answers the obvious objection. A busy-spinning thread cannot escalate itself into permanent CPU exclusivity, because the spin drains its allotted budget at the same rate periodic scheduling would have drained it. If the operator has granted enough budget to saturate a core, auto-promotion removes tick interference while that budget is consumed; if not, the same authority that would have throttled the thread under periodic scheduling still throttles it under nohz.

Trigger: “thread appears capable of utilizing a full CPU core”

The trigger is not a fixed percentage threshold inside the kernel. The kernel exports per-thread observation; the policy service synthesizes a saturation-capability signal from those observations and decides what “capable of utilizing a full CPU core” means for a given account, session, or service profile. Plausible inputs the policy service may combine:

  • runtime accumulated over a rolling window approaches the wall-clock window the thread had on its assigned CPU;
  • voluntary-block count over the same window stays low (the thread is not IPC- or IO-bound at a rate that would lose the benefit);
  • runnable-but-not-running time stays low when the thread is the only runnable entity on its CPU, or correlates with placement contention rather than IO when it is not.

Concrete window length, smoothing, and the synthesis rule are policy-service choices, replaceable without ABI churn. As of 2026-05-30 the kernel exports the observation inputs the heuristic consumes as ordinary (non-measure) per-thread state: runtime_ns/virtual_runtime_ns, voluntary_blocks, preemptions, and a cumulative runnable_accumulated_ns (runnable-but-not-running time) are all returned by SchedulingPolicyCap.snapshot @2. voluntary_blocks and preemptions were promoted out of cfg(feature = "measure") and runnable_accumulated_ns was added at the run-queue enqueue/select boundary; only migrations remains measure-gated. This closes the Phase H “monitoring/status surface that exports per-thread saturation observation” prerequisite. The surface exports raw cumulative counters only: no fixed threshold and no windowing live in the kernel – the policy service synthesizes the saturation signal.

Userstories

  1. Long-running compute tenant with declared budget. A model-training, video-encoding, or HPC build job is admitted with a SchedulingContext (or coarse ResourceLedger allocation) sized for sustained near-core utilization on a declared CPU pool. The policy service observes the thread saturating the pool’s CPU share, issues a bounded CpuIsolationLease against the pool, the scheduler proves the activation obligations from Layer 2/3, and ticks are suppressed for as long as the thread keeps consuming the granted budget. The lease ends when the budget exhausts, the job completes, the operator revokes the pool, or the saturation signal subsides.

  2. Userspace poller that earned isolation. A service polls a ring or device queue (a candidate AutoUserspacePoller in the NoHzKind taxonomy). The policy service sees consistent saturation with low voluntary blocking, recognizes the AutoUserspacePoller eligibility kind, and issues a lease. The bounds are the same as for the kernel SQPOLL path; only the consumer differs.

  3. Account-scoped auto-claim pool. An operator pre-declares “account X may auto-claim up to N isolated CPUs from pool P, maximum auto-lease lifetime L, with revocation latency R, charging to ledger E.” The policy service monitors threads owned by X, issues leases against P when saturation capability is observed, and refuses promotion when X already holds N leases or when no CPU in P currently satisfies the activation proof. Without the operator declaration the policy service does not auto-promote.

  4. Background agent that bursts to full-core compute. A general-purpose agent process does not normally saturate a core. When it briefly does (a planning phase, a build step, a local inference call), the policy service may issue a short-lifetime lease if the agent’s account has authorized auto-promotion. When the burst ends the signal subsides; the lease is not renewed.

Bounds the policy service must enforce

For every auto-issued lease the policy service records:

lifetime_ns:               bounded; shorter than admin-issued leases by
                           default; renewal requires re-observing the
                           saturation signal.
max_revocation_latency_ns: bounded by NoHzEligibility.max_revocation_latency_ns;
                           cannot exceed the operator/account policy.
accounting_target:         a live SchedulingContext or coarse ResourceLedger;
                           the lease does not mint CPU-time authority.
auto_claim_pool:           the pre-authorized CPU set; no implicit fallback to
                           system-wide isolation.
fairness_preemption:       another runnable entity at equal-or-higher policy
                           priority terminates the lease if no other CPU
                           authorized by both the pool and lease mask is
                           eligible.

Two of these bounds map to existing kernel-enforced surfaces: max_revocation_latency_ns is already a field on NoHzEligibility and the closed Phase F activation preflight; accounting_target is already a field on NoHzEligibility and the live SchedulingContext/ResourceLedger authority.

The other three bounds need new kernel-enforced surfaces before the heuristic can ship and are named as Phase H prerequisites:

  • lifetime_ns: LANDED 2026-05-30. CpuIsolationLeaseSpec now carries leaseLifetimeNs @6 (0 = no expiry, the default). A lease records an absolute monotonic expires_at_ns at creation; the first observation past the deadline auto-revokes through the existing generation-advancing cleanup (reason=lease-expired), and the nohz activation record carries the lifetime deadline so a tickless CPU rolls back at the next timer/IPI recheck (lease-lifetime-expired), bounded by maxRevocationLatencyNs. This is the bounded-lifetime guarantee the auto-issued placement lease needs, so a compromised, blocked, or malfunctioning policy service cannot leave an auto-issued lease holding the CPU indefinitely. The bounded renewal primitive LANDED on top of this: CpuIsolationLease.renew @4 pushes expires_at_ns forward to now + leaseLifetimeNs (clamped to the same one-hour ceiling read_spec enforces), keeping the same (leaseId, generation), accounting binding, and nohz activation state – distinct from re-minting a fresh lease. It is callable only before expiry (a revoked, auto-revoked, or past-deadline lease stays staleGeneration and is not resurrected; an unbounded leaseLifetimeNs = 0 lease reports notRenewable), and the renewed deadline is propagated to a tickless CPU’s nohz activation record so the lease-lifetime-expired disqualifier no longer rolls it back at the old deadline; CpuIsolationLeaseInfo.expiresAtNs echoes the deadline read-only. Only the Phase H renewal heuristic – re-observing the saturation signal to decide whether to call renew on a near-expiry lease – remains future policy-service work on top of this primitive.
  • auto_claim_pool and per-account capacity (N in userstory 3): the operator-declared CPU-pool descriptor LANDED 2026-05-30, making a non-default poolId meaningful for the first time. CpuIsolationLeaseSpec carries poolId @7 (0 = the implicit default pool over every scheduler CPU), and the kernel seeds a fixed declared-pool registry (CpuIsolationPoolDescriptor: the default pool 0 plus exactly one declared non-default pool 1 over a single CPU). The create-time admission gate now looks the pool up: an undeclared poolId is rejected invalidSpec; a declared pool whose CPU mask the lease’s allowedCpuMask exceeds is rejected invalidSpec; a declared pool with a subset mask is admitted and its id/mask are echoed read-only through CpuIsolationLeaseInfo (admittedPoolId/admittedPoolCpuMask) (proof make run-scheduler-cpu-isolation-lease: nondefault_pool=invalidSpec for the undeclared id, declared_pool=ok admitted_pool_id=1 admitted_pool_cpu_mask_subset=true, declared_pool_mask_violation=invalidSpec, default_pool_id=0). The declared-pool table is now operator-sourced (LANDED 2026-05-30): the kernel installs it from the boot manifest SystemConfig.cpuIsolationPools @14 (a List(CpuIsolationPoolDescriptor)), with the in-kernel constant as the fail-closed default when the manifest omits the list, and validates each entry fail-closed at boot (canonical CPU mask subset of the scheduler mask, default pool 0 synthesized if omitted, duplicate ids rejected). The boot line cpu-isolation: declared-pools source=manifest count=3 default_pool_id=0 nondefault_pool_id=1 nondefault_pool_cpu_mask=0x2 proves the source (proof make run-scheduler-cpu-isolation-lease; the kernel-default fallback is proven by cargo test-config decode/empty assertions). The descriptor now also carries a per-pool live-lease capacity bound (poolMaxLeases @2, LANDED 2026-05-31): a non-zero value caps the number of simultaneously live (non-revoked, current-generation) leases the kernel admits against that pool at create-time, counted from the existing LEASE_REGISTRY after prune_dead, rejecting an over-capacity create fail-closed resourceExhausted (0 = unbounded, preserving the default pool 0 and every existing producer). The manifest bounds pool 2 at poolMaxLeases: 2; the proof admits two live leases, refuses a third non-overlapping create (cpu-isolation: pool-capacity-rejected admitted_pool_id=2 live_leases=2 pool_max_leases=2 result=resourceExhausted, pool_capacity_exceeded=resourceExhausted), then reclaims after a revoke (pool_capacity_reclaimed=ok), proving the bound is live-count not cumulative. The account identity and per-account N then landed on top of this counter (LANDED 2026-05-31): CpuIsolationLeaseSpec carries accountId @8 :UInt64 (0 = unattributed, caller-asserted and inert until counted, echoed read-only through CpuIsolationLeaseInfo.accountId @6) and CpuIsolationPoolDescriptor carries poolMaxLeasesPerAccount @3 :UInt32 (0 = unbounded per account). After the pool-wide check, register counts the requesting account’s live entries (matching both admitted_pool_id and account_id) against the per-account bound and rejects an over-bound create fail-closed resourceExhausted (0 account or 0 bound skips the gate). The manifest bounds pool 2 at poolMaxLeasesPerAccount: 1; the proof admits one account-7 lease, refuses a second account-7 create (cpu-isolation: account-capacity-rejected admitted_pool_id=2 account_id=7 account_live_leases=1 pool_max_leases_per_account=1 result=resourceExhausted, account_capacity_exceeded=resourceExhausted), admits a different account-9 lease on that CPU (account_capacity_other_account=ok – per-account, not pool-wide), and reclaims after revoking account-7 (account_capacity_reclaimed=ok). The account id is caller-asserted on the plain lease path. The authentication half LANDED 2026-05-31: CpuIsolationPoolGrant (schema/capos.capnp; source cpu_isolation_pool_grant; kernel kernel/src/cap/cpu_isolation_pool_grant.rs) introduced a bootstrap-staged grant that binds one authenticated account to one declared pool. Its createLease stamps the bound account/pool onto the minted lease, overriding any caller-asserted accountId/poolId, and reuses the same lease-create admission path (cpu_isolation::create_lease_for_caller) – so the per-account bound is unforgeable by cap-possession: a holder cannot assert another account to evade poolMaxLeasesPerAccount. The initial single-grant proof used account 7 bound to pool 2; the current make run-scheduler-cpu-isolation-pool-grant proof boots manifest-declared grants. The grant binding is now operator-declared (LANDED 2026-06-01): the manifest SystemConfig.cpuIsolationPoolGrants table seeds the bound (account, pool) pairs (mirroring the cpuIsolationPools table), and the cpu_isolation_pool_grant / cpu_isolation_pool_grant_secondary sources stage seeded binding index 0 / 1, so an operator can pre-authorize multiple distinct accounts/pools, each staged as its own bootstrap grant cap. An absent/empty list falls back to one in-kernel binding at index 0: account 7 bound to preferred pool 1 when active, otherwise account 7 bound to synthesized default pool 0, preserving a usable single default grant when a manifest-sourced pool table omits pool 1. make run-scheduler-cpu-isolation-pool-grant now boots a two-entry table (account 5/pool 1, account 8/pool 2) and proves each grant stamps its OWN bound account with the per-account bound still enforced. make run-scheduler-cpu-isolation-pool-grant-default boots the empty-list fallback with pool 1 omitted and proves the synthesized (account 7, pool 0) grant is usable. Runtime grant minting landed 2026-06-02 22:24 UTC (CpuIsolationGrantMinter): one cap mints a fresh CpuIsolationPoolGrant for an operator-chosen (account, pool) at call time, bounded by the declared SystemConfig.cpuIsolationGrantMinterAllowlist (an out-of-allowlist pair is refused unauthorized, so the minter is never an ambient grant-any authority; the minted grant reuses the same unforgeable createLease admission path). The same make run-scheduler-cpu-isolation-pool-grant smoke mints a grant for the allowed (account 6, pool 2), proves its createLease stamps account 6 and stays bounded by the per-account gate, and proves an out-of-allowlist mint is refused. Grant-revocation lifecycle landed 2026-06-03 17:11 UTC (CpuIsolationGrantMinter.revokeGrant), closing (c): a runtime-minted grant carries a revocable (grantId, generation); revokeGrant(grantId) advances the grant generation so a stale grant handle’s createLease fails staleGeneration and mints nothing, and revocation cascades to every live lease minted through that grant – reusing the landed fairness-termination cleanup (reason=grant-revoked, periodic-tick rollback, registry unregister) once per tagged lease, so per-pool/per-account live-lease capacity frees immediately and a fresh grant is admitted into the reclaimed slot. Double-revoke is alreadyRevoked and an unknown grantId is unknownGrant, both fail-closed; seeded bootstrap grants are not minter-owned and stay un-revocable. The same make run-scheduler-cpu-isolation-pool-grant smoke proves the full lifecycle. No pool authority is minted from holding a lease cap; the kernel stays the fail-closed admission gate.
  • fairness_preemption: LANDED 2026-06-02 21:17 UTC. The Phase F rollback path now compares policy priority at the existing nohz recheck site: when a second runnable entity appears on the leased CPU at equal-or-higher WFQ policy priority (latency_class, weight) than the captured leased thread, and no sibling CPU authorized by both the admitted pool and the lease allowedCpuMask is eligible to host the lease, the kernel terminates the CpuIsolationLease itself (fairness-preempted ... result=lease-terminated) rather than only restoring the periodic tick, bounded by maxRevocationLatencyNs. The termination runs the same generation-advancing cleanup leaseLifetimeNs expiry uses (reason=fairness-preempted) immediately after the scheduler restores the periodic tick, so a subsequent info/revoke reports staleGeneration and placement/account capacity is freed without waiting for the holder’s next cap call; a strictly-lower arrival or an eligible sibling CPU inside both masks keeps the existing tick-restore-only behavior. The kernel supplies the comparison and fail-closed termination; the policy service remains the issuer and bookkeeper of the saturation signal. Re-placement of the leased thread onto an eligible sibling CPU (instead of terminating) remains generic-full-nohz work; the “no sibling eligible” condition is recorded.

The policy service is the issuer and the bookkeeper of the synthesized saturation signal; the kernel remains the authority gate, the activation prover, and the fail-closed rollback path – including for the three not-yet-existing surfaces above.

Explicit non-goals

  • The kernel does not contain a saturation-detection rule of its own. It exports observation; it does not synthesize the signal.
  • Auto-promotion does not grant unlimited CPU-hold. The lease is bounded by lifetime, budget, revocation, and pool capacity; absent a pre-authorized pool, no auto-promotion occurs.
  • Auto-promotion does not grant realtime authority. RealtimeIsland admission remains a separate, stricter path with preallocation, deadline, and no-blocking proofs.
  • Auto-promotion does not bypass donation, fairness, or session-lifecycle invariants. Process exit, session logout, and explicit revoke still tear the lease down through the existing Layer 3 rollback.

Telemetry Requirements

Tickless, nohz, SQPOLL, and realtime behavior must be observable through future monitoring/status capability surfaces, not only through ad hoc debug logs. The first counters should include:

scheduler_tick_count{cpu}
ticks_suppressed{cpu,mode}
nohz_enter_count{cpu,kind}
nohz_exit_count{cpu,reason}
oneshot_deadline_miss_count
sqpoll_busy_ns
sqpoll_sleep_count
deadline_expired_count
budget_exhausted_count
realtime_overrun_count
donation_depth_max
housekeeping_offload_count

These counters are correctness evidence. Missing or surprising values should fail focused nohz/realtime proofs rather than being treated as performance-only diagnostics.

The ticks_suppressed{cpu,mode} / scheduler_tick_count{cpu} evidence is realized as an asserted proof line on the lease path: make run-scheduler-cpu-isolation-lease now counts genuine periodic LAPIC fires per CPU (a fire is counted only when neither the lease-backed nor the idle tick-suppression bit is set, so the one-shot replacement is never miscounted) and, on lease nohz rollback, emits cpu-isolation: nohz suppressed-ticks cpu=<n> window_ns=<w> expected_periodic=<e> actual_periodic=<a> suppressed=<e-a>. The harness asserts that over a bounded masked window the leased CPU recorded actual near zero while expected was substantial – the periodic tick demonstrably stopped, not merely that the mask write was issued – and that a bounded post-rollback cpu-isolation: nohz restored-rate window shows the periodic rate returning. This is bounded proof-line evidence, not yet a durable SchedulingPolicyCap/monitoring telemetry field; the persistent ticks_suppressed surface and the generic-full-nohz path’s inheritance of the same measured assertion remain future telemetry work.

Implementation Sequence

  1. Add timer/scheduler instrumentation around the existing periodic tick.
  2. Add monotonic_ns() backed by a clocksource that is not derived from the scheduler tick, and switch Timer.now plus scheduler accounting to that clocksource while keeping periodic scheduling. Completed for normal QEMU/x86_64 by the Phase F clockevent/deadline substrate.
  3. Convert timeout waiters to deadline_ns. Completed for Timer.sleep, finite cap_enter, and park timeouts by the Phase F clockevent/deadline substrate.
  4. Add LAPIC one-shot programming, periodic restore state, and a focused one-shot smoke. Completed as a disabled-nohz substrate proof by the Phase F clockevent/deadline substrate.
  5. Replace user-mode idle with kernel/per-CPU idle while keeping periodic ticks. Completed: the scheduler idle path is now a CPL0 per-CPU kernel idle thread and the user-mode idle process is gone (docs/tasks/README.md).
  6. Enable tickless idle only when there is no runnable work. Completed by docs/tasks/done/2026/scheduler-tickless-idle-step6.md: true-idle CPUs with no runnable non-idle work, no active nohz lease, no local deferred cleanup, no cap-enter polling dependency, and a one-shot LAPIC clockevent mask the periodic tick and arm a bounded one-shot at the next Timer/ParkSpace deadline or the 100 ms idle housekeeping floor. The scheduler restores the periodic tick before ordinary non-idle dispatch, on reschedule IPIs, and on backend/refusal rollback. Cap-enter polling waiters and ready-but-budget-throttled SchedulingContext retry windows remain periodic until the legacy terminal/network/IRQ polling and scheduling-context retry surfaces move behind explicit deadlines or housekeeping placement.
  7. Route the in-kernel virtio-net poll off a lease-isolated CPU to the housekeeping CPU (landed 2026-06-04); an explicit NetworkPollClock poll deadline remains the longer-term target.
  8. Add ring mode state and refuse timer-side SQ processing for SQPOLL rings.
  9. Land Ring v2 per-thread ring ownership and completion routing.
  10. Add the SQPOLL wake/sleep protocol and a host or Loom-style lost-wakeup model.
  11. Add kernel SQPOLL without full-nohz, under normal scheduler ticks.
  12. Add CPU isolation leases and housekeeping CPU placement.
  13. Prove SQPOLL progress through a wake/deadline path that does not depend on periodic scheduler ticks. Completed for bounded current-thread syscall/producer-wake progress by the Phase F SQPOLL nohz-progress child.
  14. Enable SQPOLL nohz on isolated CPUs for explicitly leased caller-thread rings. Landed 2026-06-07 09:45 UTC; broader userspace-poller/device-queue policy issuance remains separate.
  15. Add request deadline_ns metadata and typed late/drop CQE outcomes.
  16. Add SchedulingContext and admission-controlled realtime islands.
  17. Add generic full-nohz admission for ordinary budgeted compute threads through explicit SchedulingContext-targeted CpuIsolationLease preflight. Landed 2026-06-06 09:44 UTC; policy-service issuance remains separate.
  18. Add the user-space policy-service AutoNoHz placement heuristic. The kernel exports per-thread saturation observation through the monitoring/status surface; the policy service synthesizes the “thread appears capable of utilizing a full CPU core” decision and issues bounded CpuIsolationLease grants against pre-authorized account or session CPU pools. The auto-revoke timeout primitive (leaseLifetimeNs) landed 2026-05-30 15:22 UTC at 84c1c5ba, priority-aware fairness lease termination landed 2026-06-02 21:28 UTC at cae825a4 with immediate release remediation at ca28ef63, runtime grant minting (CpuIsolationGrantMinter) landed 2026-06-02 22:25 UTC at 5c5c63cc, and the grant-revocation lifecycle (CpuIsolationGrantMinter.revokeGrant with cascade-to-leases) landed 2026-06-03 17:11 UTC, completing the pool-grant authority surface. The local userspace policy-service proof landed 2026-06-07: it reads the per-thread saturation counters, denies a voluntarily blocking worker, issues a finite grant-stamped full-nohz lease only after a saturated local window, renews only after re-observation, and lets stopped renewal expire fail-closed. A reusable production policy daemon with profile-driven smoothing, cross-process target discovery, and richer operator policy remains future work.

Verification

Tickless idle gates:

make fmt-check
cargo test-lib
cargo test-config
make run-smoke
make run-spawn

Additional tickless proof:

1 second idle interval does not produce 100 scheduler ticks
Timer.sleep still completes
cap_enter timeout still completes
ParkSpace timeout still completes
preemption fairness unchanged with runnable contention

SQPOLL gates:

thread-lifecycle
timer-smoke
timer-flood
park wake/timeout
endpoint CALL/RECV/RETURN
mandatory host or Loom-style lost-wakeup model before any real SQPOLL worker:
  poller: set NEED_WAKEUP -> full barrier -> recheck tail -> park
  producer: write SQE -> publish tail -> full barrier -> check NEED_WAKEUP -> wake

Realtime gates:

deadline ordering tests
budget depletion tests
donation/return tests through passive endpoint
admission denial tests
QEMU proof for late/drop/overrun behavior
telemetry counters prove ticks suppressed, deadlines expired, budgets
exhausted, and donation depth bounded as expected

Decision

Adopt this staged direction:

Tickless idle:
  yes, after the kernel/per-CPU idle context and activation proof. The
  clocksource/clockevent split is implemented.

Generic full-nohz:
  implemented for explicit budgeted compute leases targeting a live
  SchedulingContext. Automatic issuance and unbudgeted ordinary threads remain
  out of scope.

SQPOLL nohz:
  yes, for explicitly leased caller-thread rings whose SQPOLL poller is live,
  single-consumer, and bounded by producer wake plus rollback deadlines.

AutoNoHz placement for ordinary threads:
  yes, but only as a user-space policy-service decision that issues a
  bounded CpuIsolationLease against a pre-authorized CPU pool. The lease
  adds isolation; it never mints CPU-time authority. The "thread appears
  capable of utilizing a full CPU core" signal is synthesized in the
  policy service from observations the future monitoring/status surface
  must export, not as a fixed kernel threshold.

Realtime:
  `SQE.deadline_ns` is useful metadata, but `SchedulingContext` is the
  authority that provides CPU time.

Proposal: mdBook Documentation Site

Turn the existing Markdown documentation into a navigable mdBook site that explains capOS as a working system, while keeping proposals and research as deep reference material.

The current docs are useful for agents and maintainers who already know what they are looking for. They are weaker as a reader path: a new contributor has to jump between README.md, docs/roadmap.md, docs/tasks/README.md, proposal files, research reports, and source code before they can form an accurate model of the system. The mdBook site should fix that by adding a concise, current system manual above the existing archive.

Goals

  • Make the first reading path obvious: what capOS is, how to build it, what works today, and where the important subsystems live.
  • Separate implemented behavior from future design, rejected ideas, and research background.
  • Preserve existing long-form proposal and research documents instead of rewriting them prematurely.
  • Give architecture pages a repeatable structure so future edits do not turn into ad hoc status notes.
  • Make validation visible: each architecture page should name the host tests, QEMU smokes, fuzz targets, Kani proofs, Loom models, or manual checks that support its claims.
  • Keep the docs useful from a local clone, without requiring hosted services, databases, or custom frontend code.

Non-Goals

  • Replacing docs/tasks/README.md. Task records remain operational planning documents; REVIEW_FINDINGS.md is only a tombstone for older links, and docs/roadmap.md is now part of the book while still owning long-range planning.
  • Turning proposals into user manuals by bulk editing every existing document. Long proposal files stay as references until a subsystem needs a targeted refresh.
  • Building a marketing site, blog, changelog, or public product page.
  • Adding MDX, React, Vue, custom components, or a JavaScript application layer.
  • Automatically generating API reference documentation from Rust or Cap’n Proto. That can be evaluated later as a separate documentation track.

Audience

The site should serve three readers:

  • New contributor: wants to build the ISO, boot QEMU, understand the current architecture, and find the right files to edit.
  • Reviewer: wants to verify whether a change preserves the intended ownership, authority, lifecycle, and validation rules.
  • Future agent: wants current project context without having to infer the system from stale proposals or source code alone.

The primary audience is maintainers and agents, not end users. This matters: accuracy, status labels, and code maps are more important than a polished external landing page.

Current State

The repository already has a substantial Markdown corpus:

  • README.md explains the project and core commands.
  • docs/roadmap.md describes long-range stages and visible milestones.
  • docs/tasks/state.toml tracks the selected milestone.
  • docs/tasks/state.toml tracks the selected milestone; task records under docs/tasks/ track active implementation order.
  • docs/tasks/** tracks open remediation, review-finding work, and verification history.
  • docs/capability-model.md is a real architecture reference.
  • docs/proposals/ contains accepted, future, exploratory, and rejected design material.
  • docs/research/ contains prior-art analysis (the capability-systems-survey.md synthesis plus per-system deep-dive reports).
  • docs/*-design.md and inventory files capture targeted design/security decisions.

The weakness is not lack of content. The weakness is keeping the current manual visibly separate from archival planning, proposal, and research material.

Site Shape

The mdBook site should be structured as a book, not as a mirror of the file tree. The current hierarchy is:

  • Start Here: reader orientation and commands.
  • Runnable Demos: current user-visible proofs.
  • System Architecture: current implementation, with code maps and invariants.
  • Security and Verification: threat boundaries, validation workflow, and security inventories.
  • Planning: roadmap, changelog, and backlog links.
  • Design Archive: proposal index plus nested active, future, and rejected long-form design documents.
  • Research Archive: research index plus nested prior-art reports.

All proposal and research files should remain reachable through the sidebar so mdBook builds them, but they should be nested under their indexes rather than listed as peer pages beside the current system manual. Sidebar folding should be enabled so the default reader path stays compact.

Page Standard

Every architecture page should use this shape:

---
status: "Partially implemented."
last_reviewed: "2026-04-27 10:00 UTC"
description: "Page description."
topics:
  - { key: "capabilities-ipc-and-authority", reason: "Explains authority or invocation behavior." }
---

# Page Title

What problem this subsystem solves and why a reader should care.

The preprocessor strips front matter from rendered page content and uses the metadata to regenerate docs/topics.md. A post-build agent asset pass patches final rendered HTML so status, description, and last_reviewed appear as page-head metadata without adding visible status blocks to each page. The same pass adds HTML head discovery links for llms.txt and each page’s Markdown mirror.

The docs build also emits agent-facing static assets in target/docs-site: llms.txt, Markdown source mirrors for pages listed in docs/SUMMARY.md, sitemap.xml, robots.txt, and a Cloudflare Pages _headers file with discovery links. robots.txt includes a comment pointing agents to llms.txt; crawler rules stay in standard User-agent, Allow, Disallow, Sitemap, and Content-Signal fields.

Current Behavior

What exists in the repo today.

Design

How it works, with concrete data flow.

Invariants

Security, lifetime, ownership, ordering, or failure rules.

Code Map

Important files and entry points.

Validation

Relevant host tests, QEMU smokes, fuzz/Kani/Loom checks.

Open Work

Concrete known gaps, linked to task ledger records when relevant.


Architecture pages should normally stay between 100 and 300 lines. Longer
background belongs in proposals or research reports.

## Status Vocabulary

Use explicit status labels only where a reader could reasonably confuse
implemented behavior, accepted design, future design, or rejected material.
Status belongs on the page itself only when the page role is not already
obvious from the page type or nearby index. Put this information in YAML front
matter (`status`, `last_reviewed`, `topics`) as the first block in the file.

Canonical page-level form:

```md
---
status: "Partially implemented."
last_reviewed: "2026-04-25 11:36 UTC"
description: "Canonical page-level metadata layout."
topics:
  - { key: "capabilities-ipc-and-authority", reason: "Describes authority and invocation behavior." }
---

last_reviewed is hand-maintained and uses the same minute-precision, timezone-aware format as status updates in docs/tasks/README.md, docs/roadmap.md, and task records. Get it from date '+%Y-%m-%d %H:%M %Z'; do not infer or round from memory. Use this field for substantial content edits that should reset a reader’s trust.

Use one of these labels:

  • Implemented: behavior exists in the mainline code and has validation.
  • Partially implemented: some behavior exists, but the page also describes missing work.
  • Accepted design: intended direction, not fully implemented.
  • Future design: plausible direction, not selected for near-term work.
  • Rejected: explicitly not the chosen direction.
  • Research note: background used to inform design, not a direct plan.

Add a page-level status label to:

  • proposal pages whose content could be mistaken for current behavior
  • architecture or design pages that mix implemented facts with future or partial behavior
  • design-gate documents whose role is to define an accepted implementation contract before the implementation is complete
  • research pages that would otherwise read like selected design rather than background

Do not add a page-level status label to:

  • orientation, index, command-reference, and workflow pages where the page type already makes the role obvious
  • reader-orientation overview pages whose role is to explain why the design looks the way it does (design bets, project framing) rather than catalogue what is implemented. These pages must point at status.md or the relevant architecture page for implementation state; a mixed “Partially implemented” label on them is misleading because each bullet it covers has its own, different status
  • status summary pages that already classify other documents
  • pages whose content is purely operational and only describes current, validated behavior

When only one section differs from the rest of the page, keep the page-level status for the dominant role of the document and add a local sentence in that section such as Current implementation status: or Current status:. Do not replace the page-level label with timestamped prose unless the timestamp itself is the point.

Avoid ambiguous language like “planned” without a stage, dependency, or status label. When a page mixes current and future behavior heavily, split those sections instead of relying on status text alone.

Content Rules

The docs-scoped authoring contract lives in docs/AGENTS.md; the rules below extend it with site-shape conventions specific to the mdBook manual. Apply the AGENTS.md rules first when editing any file under docs/, then layer the site-shape rules from this proposal.

  • Start with operational facts, not motivation.
  • Prefer concrete nouns: process, cap table, ring, endpoint, manifest, init, QEMU smoke.
  • Name source files when a claim depends on implementation.
  • State authority and ownership rules explicitly.
  • State failure behavior explicitly.
  • Link to proposals and research instead of duplicating long rationale.
  • Keep docs/roadmap.md and docs/tasks/README.md as planning sources, not as content to paste into the book.
  • Do not describe behavior as implemented unless validation exists or the code map makes the claim directly checkable.
  • Do not bury current limitations at the bottom of a long proposal.

Proposal Index

docs/proposals/index.md should classify proposal files instead of listing them alphabetically. A useful classification:

  • Active or near-term:
    • service architecture
    • service object capabilities
    • storage and naming
    • error handling
    • security and verification
    • SMP
    • Ring v2 for full SMP
  • Future architecture:
    • networking
    • userspace binaries
    • shell
    • SSH shell gateway
    • boot to shell
    • user identity and policy
    • cryptography and key management
    • certificates and TLS
    • OIDC and OAuth2
    • volume encryption
    • cloud metadata
    • cloud deployment
    • live upgrade
    • GPU capability
    • formal MAC/MIC
    • browser/WASM
  • Rejected or superseded:
    • rejected Cap’n Proto ring SQE envelope

Each proposal entry should have a one-sentence purpose and a status label.

Research Index

docs/research/index.md is the top-level research index, and the capability/microkernel survey lives at docs/research/capability-systems-survey.md with a “Design consequences for capOS” section near the top. Readers should not need to read every long report to learn which ideas were accepted.

Each long research report should eventually end with:

## Used By

- Architecture or proposal page that relies on this research.
- Concrete design decision influenced by this report.

Diagrams

Use Mermaid only where it clarifies flow or authority:

  • boot flow: firmware, Limine, kernel, manifest, init
  • capability ring: SQE submission, cap_enter, CQE completion
  • endpoint IPC: client CALL, server RECV, server RETURN
  • manifest startup: boot package, init, ProcessSpawner, child caps

Avoid diagrams that duplicate file layout or become stale when a function is renamed. Every diagram should have nearby text that states the same key invariant in prose.

Migration Plan

Phase 1: Skeleton and Reader Path

  • Add book.toml with docs as the source directory and output under target/docs-site.
  • Add docs/SUMMARY.md.
  • Add docs/index.md.
  • Add docs/overview.md.
  • Add docs/status.md.
  • Add docs/build-run-test.md.
  • Add docs/repo-map.md.

Acceptance criteria:

  • mdbook build succeeds.
  • The first section explains what capOS is, how to build it, how to boot it, and where to find the major code areas.
  • Existing proposal and research files are reachable through the sidebar.

Phase 2: Current Architecture Pages

  • Add the first architecture pages:
    • boot flow
    • process model
    • capability ring
    • IPC and endpoints
    • userspace runtime
    • manifest and service startup
    • memory management
    • scheduling
  • Keep docs/capability-model.md as a first-class architecture page.

Acceptance criteria:

  • Each architecture page has status, current behavior, invariants, code map, validation, and open work.
  • Each page distinguishes implemented behavior from future design.
  • At least boot flow, capability ring, IPC, and manifest startup include a concise Mermaid diagram.

Phase 3: Security and Verification Pages

  • Add docs/security/trust-boundaries.md.
  • Add docs/security/verification-workflow.md.
  • Link existing inventories and designs from the security section.
  • Make each security page name the relevant validation commands and review documents.

Acceptance criteria:

  • A reviewer can find the hostile-input boundaries, trusted inputs, and verification workflow without reading all proposals.
  • The security section links to REVIEW.md, docs/tasks/README.md, docs/trusted-build-inputs.md, and docs/panic-surface-inventory.md.

Phase 4: Proposal and Research Curation

  • Add docs/proposals/index.md.
  • Keep proposal and research documents reachable through SUMMARY.md, but nest them under archive groups so they do not dominate the default sidebar.
  • Add status labels to proposal files as they are touched.
  • Add “Used By” sections to research files incrementally.

Acceptance criteria:

  • Proposal status is visible before a reader opens a long document.
  • Rejected and future proposals are not confused with implemented behavior.
  • Research pages point back to the architecture or proposal pages they influence.
  • The default sidebar presents the current manual before backlog, proposal, and research archives.

Maintenance Rules

  • When implementation changes a subsystem, update the corresponding architecture page in the same change when the page would otherwise become misleading.
  • When a proposal is accepted, rejected, or partially implemented, update its status and the proposal index.
  • When docs/tasks/state.toml changes the selected milestone, update docs/status.md only if the public current-system summary changes. Do not mirror every operational task into the docs site.
  • When validation commands change, update docs/build-run-test.md and the affected architecture page.

Tooling Follow-Up

The content proposal continues to assume mdBook because it matches the repo’s Rust toolchain and plain Markdown corpus. The current tooling baseline is:

  • book.toml
  • make docs
  • make docs-serve
  • make cloudflare-pages-build
  • pinned mdbook and mdbook-mermaid downloads in Makefile, with version and SHA-256 inputs catalogued in docs/trusted-build-inputs.md under the mdBook documentation tools row. make docs and make cloudflare-pages-build verify those checksums and the executable versions before rendering the book, and mdbook-mermaid supplies the pinned mermaid.min.js browser bundle used by both mdBook HTML rendering and docs-PDF Mermaid rasterization
  • a small local stylesheet for readability and sidebar spacing

Do not add a frontend package manager, theme framework, or generated site assets unless the content structure proves insufficient. If mdBook becomes too limited after the sidebar, index, metadata, and styling cleanup, the preferred replacement candidate is Astro Starlight because it supports Markdown/MDX, content collections, structured sidebars, built-in docs components, and static Cloudflare Pages output. Docusaurus is better only if versioned public docs, blogging, and a larger external project site become requirements. VitePress is reasonable only if the project wants Vue-oriented customization.

Open Questions

  • Should docs/tasks/README.md remain outside the book and linked from status.md, or should redacted public summaries be generated later?
  • Should long proposal files keep their current filenames, or should accepted designs eventually move from docs/proposals/ into docs/architecture/?
  • Should docs/status.md be manually maintained, or generated from a smaller checked-in status data file later?
  • Should Cap’n Proto schema documentation be generated into the book once the interface surface stabilizes?
  • Should proposal and research indexes eventually be generated from structured frontmatter instead of hand-maintained Markdown tables?

The first implementation commit should be deliberately small:

  1. Add mdBook config.
  2. Add SUMMARY.md.
  3. Add the Start Here pages.
  4. Link existing proposal and research files without rewriting them.
  5. Verify mdbook build.

That gives the project a usable docs site quickly, without blocking on a full architecture rewrite.

Proposal: Userspace TCP/IP Networking

How capOS gets from “kernel boots” to “userspace process opens a TCP connection.”

The host-local Telnet flow on 127.0.0.1:2323 described in Part 2 was a plaintext, loopback-only research demo, not a shippable Telnet service. It exercised the TerminalSession/SessionManager/AuthorityBroker/RestrictedShellLauncher boundary over a real TCP socket on the path toward the SSH Shell Gateway (see SSH Shell Gateway). That target is now retired because it depended on the removed qemu-only kernel TCP listener. Non-loopback exposure, production credential handling, and any treatment of Telnet as a long-lived service remain out of scope.

Historical trust-boundary debt: Phase A/B kept the smoltcp stack, per-port TCP listener and accepted-socket capability state, UDP socket cap state, line discipline byte handler, and Telnet IAC filter inside the kernel. Phase C has now retired that kernel owner: kernel no longer depends on smoltcp, the qemu-only TCP/UDP socket entry points fail closed, and the run-network-client, run-tcp-listen-authority, run-telnet, and run-posix-dns-smoke fixtures exit with retirement diagnostics. The forward path is the userspace network stack over DeviceMmio/DMAPool/Interrupt authority and typed NIC/socket capabilities. New protocol logic belongs in that Phase C userspace stack.

The Device Driver Foundation now has a bounded provider-consumer proof for one selected virtio-net TX route: a manifest-granted service can compose DMAPool, DeviceMmio, and Interrupt authority, validate the selected bounce-buffer descriptor path, publish a bounded provider-owned queue entry, ring the selected notify doorbell after policy gates, and consume the matching used-ring completion through a route-scoped tx_interrupt.wait event. That is proof coverage for a selected manager-owned route, not Phase C completion. It does not grant full NIC ownership, arbitrary MMIO doorbells, hardware ack/mask/unmask ownership, direct DMA, IOMMU programming, broader completion queue ownership, provider storage/NIC drivers, cloud NIC support, or production networking readiness.

This document has four parts:

  • a historical kernel-internal smoke test that proved virtio-net and smoltcp,
  • historical in-kernel capability interfaces for TCP sockets and the Telnet Shell Demo,
  • userspace decomposition after driver authority capabilities exist, and
  • cross-cutting TLS and open design questions.

Part 1: Kernel-Internal Networking (Phase A)

Prove that capOS can send and receive TCP/IP traffic. Everything runs in-kernel — no IPC, no capability syscalls, no multiple processes needed.

What’s Needed

  1. PCI enumeration — scan config space, find virtio-net device. Uses the standalone PCI/PCIe subsystem described in Cloud Deployment Phase 4 (~200 lines of glue code on top of the shared PCI infrastructure)
  2. virtio-net driver — init virtqueues, send/receive raw Ethernet frames. Use virtio-drivers crate or implement manually (~600-800 lines)
  3. Timer — PIT or LAPIC timer for smoltcp’s poll loop (retransmit timeouts, Instant::now() support). Not a full scheduler — just a monotonic clock (~50-100 lines)
  4. smoltcp integration — implement phy::Device trait over the in-kernel driver, create an Interface with static IP, ICMP ping, then TCP
  5. QEMU flags — add -netdev user,id=n0 -device virtio-net-pci,netdev=n0 to the Makefile

Current implementation status: PCI enumeration, make run-net, modern virtio PCI transport capability discovery, feature negotiation, RX/TX split-virtqueue initialization, descriptor-accounting guard evidence, ARP resolution, and ICMP echo validation are implemented as lower-layer QEMU fixture evidence. The QEMU default device currently appears as transitional 1af4:1000 but exposes standard modern vendor capabilities; capOS accepts it only after finding bounded MMIO common, notify, ISR, and device-specific config regions. The kernel negotiates VIRTIO_F_VERSION_1, VIRTIO_NET_F_MRG_RXBUF, and MAC when safe, allocates kernel-owned DMA pages for the RX/TX queue metadata plus packet buffers, sets DRIVER_OK, submits device-valid TX descriptors, posts RX descriptors, resolves the QEMU user-mode gateway 10.0.2.2 with ARP from static guest address 10.0.2.15, then validates an IPv4 ICMP echo reply from the gateway, including the reply checksums. The former kernel smoltcp adapter, TCP HTTP smoke, and scheduler-polled socket runtime are retired; the make qemu-net-harness path now asserts the lower-layer QEMU fixture evidence instead of a host-backed kernel TCP proof. Current TCP/UDP socket proof lives in the Phase C userspace network-stack gates, including make run-cloud-prod-userspace-network-stack-smoltcp.

Milestones

  • Ping: ICMP echo to QEMU gateway (10.0.2.2 with default user-mode net). Achieved by commit b56a5c1 at 2026-04-24 15:37 UTC.
  • HTTP: TCP connection to a host-side server, send GET, receive response. Achieved by commit a4f1722 at 2026-04-24 16:47 UTC.

Estimated Scope

~1000-1500 lines of new kernel code. ~200 more for TCP on top of ping.

Crate Dependencies

CratePurposeno_std
smoltcpTCP/IP stackyes (features: medium-ethernet, proto-ipv4, socket-tcp)
virtio-driversvirtio device abstractionyes (optional — can implement manually)

Timer Source Decision

Historical Phase B resolution: the scheduler timer advanced the monotonic TICK_COUNT (AtomicU64 in kernel/src/arch/x86_64/context.rs), and the retained kernel smoltcp runtime used that clock instead of a bounded synthetic 10 ms-per-poll clock. Phase C cleanup removed that retained runtime; scheduler ticks no longer poll kernel smoltcp.

Intermediate Tickless Bridge

The retained smoltcp runtime described below is retired. The bridge rules are archival context for why scheduler-polled kernel networking was not acceptable as a long-term tickless/nohz design. Future socket progress belongs in the userspace stack or an IRQ/deadline-driven device path, not in scheduler polling.

#![allow(unused)]
fn main() {
trait NetworkPollClock {
    fn next_poll_deadline_ns(now_ns: u64) -> Option<u64>;
    fn poll_until_budget(now_ns: u64, budget_ns: u64) -> PollResult;
}
}

Historical bridge rules:

  • a retained smoltcp runtime would have needed to expose NetworkPollClock before active networking could coexist with tickless idle;
  • the scheduler would have included next_poll_deadline_ns in earliest_global_deadline();
  • poll_until_budget would have been the only scheduler/idle-exit network progress path;
  • the budget would have bounded work done outside ordinary process execution;
  • absent this bridge, active networking would have forced periodic tick;
  • SQPOLL/nohz isolated CPUs would not have run retained network scheduler polling.

QEMU Network Config

ConfigUse case
-netdev user,id=n0 -device virtio-net-pci,netdev=n0Default: NAT, guest reaches host
-netdev user,id=n0,hostfwd=tcp:127.0.0.1:2323-:23 -device virtio-net-pci,netdev=n0Historical host-local TCP forwarding for the retired Telnet Shell Demo

Part 2: Capability Interfaces — In-Kernel (Phase B)

Phase B turns the Phase A smoke path into first-class TCP capabilities without moving any code out of the kernel. The NetworkManager, TcpListener, and TcpSocket objects become kernel-side CapObjects that user processes invoke through the existing capability ring. The in-kernel smoltcp stack stays where it is; what changes is that it is reached over capability dispatch instead of a hard-coded boot-time call. UDP and raw Nic exposure are not part of this milestone.

Phase B is the first point where a userspace process — the native shell, a boot-package demo, a language runtime — can open a TCP socket. It is also the first point where a visible networking milestone exists at the capability level.

Visible Phase B milestone — Telnet Shell Demo (historical; delivered and later retired with the kernel socket owner). Boot capOS in QEMU with -netdev user,id=n0,hostfwd=tcp:127.0.0.1:2323-:23 -device virtio-net-pci,netdev=n0. Init starts a dedicated telnet-gateway service with scoped port-23 listen authority and restricted shell-launch authority, then gives the child shell only the exact grants described below. On accept, the gateway refuses a bounded initial Telnet option negotiation burst and acts as the terminal host for that connection. It exposes a socket-backed TerminalSession to capos-shell, not a raw TcpSocket, ByteStream, or StdIO replacement for the shell’s existing terminal boundary. From the host:

$ telnet 127.0.0.1 2323
capos login: <anon>
capos$ help
capos$ exit
Connection closed by foreign host.

The same boot proves the shell does not know or care whether its interactive terminal is UART, framebuffer, or TCP-backed Telnet — the TerminalSession provider is interchangeable while the shell-facing authority stays the same. It also exercises the full TCP listener/accept path, not just the outbound connect path used by the Phase A HTTP smoke.

telnet (RFC 854) is deliberate demo wiring: plaintext, no crypto, no authentication of its own. The QEMU target binds the host forward to 127.0.0.1:2323 only and forwards to guest port 23, so the proof is a host-local development demo rather than a remote-access feature. It is not a production access path and will be replaced by the SSH gateway described in SSH Shell Gateway once host-key, user-key, account, audit, and persistence prerequisites are implementable. The value is that Telnet is the cheapest forcing function for a server-side TCP capability and for a socket-backed terminal host. The shell still requires credential verification through the existing login flow (Boot to Shell); the Telnet transport only replaces the physical UART, not the login policy.

Phase B prerequisites

PrerequisiteStateWhy
Capability syscallsStage 4 done (sync)All Nic/socket access goes through the ring
Scheduling + preemptionStage 5 core doneSocket ops block/wake via the scheduler
IPC + capability transferStage 6 3.6 doneListener hands socket caps to the accepting process
Timer capability7.0.0 doneHistorical smoltcp poll clock and socket timeouts; the kernel smoltcp runtime is now retired
Scheduler-driven smoltcp pollretiredThe retained smoltcp runtime was polled from scheduler ticks on real TICK_COUNT; Phase C cleanup removed it
TCP kernel CapObjectsretiredNetworkManager, TcpListener, and TcpSocket previously wrapped the retained smoltcp runtime; qemu-only kernel socket entry points now fail closed
Socket-backed TerminalSession handoffretiredTcpSocket.intoTerminalSession previously consumed a connected socket and returned a move-only TerminalSession cap; rebuild this proof on the userspace network stack before using it as validation
Shell launch bundle handoffretiredtelnet-gateway previously consumed an accepted TcpSocket into a move-only TerminalSession; the gateway demos are removed and remote-shell coverage lives in the in-guest login smokes (run-login, run-default-web-ui)

Phase B does not depend on DeviceMmio, Interrupt, or DMAPool — the NIC driver stays in the kernel. Security Verification Track S.11.2 is a Phase C prerequisite, not a Phase B one.

Phase B schema (kernel CapObjects)

These interfaces are now defined in the canonical shared schema (schema/capos.capnp). The current build pipeline watches and generates bindings for schema/capos.capnp; additional networking schema files remain unnecessary for Phase B.

interface NetworkManager {
    getConfig         @0 () -> (addr :Data, netmask :Data, gateway :Data);
    createTcpListener @1 (port :UInt16) -> (listenerIndex :UInt16);
    connectTcp        @2 (addr :Data, port :UInt16) -> (socketIndex :UInt16);
    # POSIX adapter Phase P1.2 Phase A: bind a UDP socket; the created
    # cap is delivered as a transferred result cap.
    createUdpSocket   @3 (localAddr :Data, localPort :UInt16) -> (socketIndex :UInt16);
}

interface TcpListener {
    accept @0 () -> (socketIndex :UInt16, peerAddr :Data, peerPort :UInt16);
    close  @1 () -> ();
}

interface TcpSocket {
    send                @0 (data :Data) -> (bytesSent :UInt32);
    recv                @1 (maxLen :UInt32) -> (data :Data);
    close               @2 () -> ();
    intoTerminalSession @3 () -> (terminalIndex :UInt16);  # retired; fails closed
}

interface UdpSocket {
    sendTo   @0 (addr :Data, port :UInt16, data :Data) -> (bytesSent :UInt32);
    recvFrom @1 (maxLen :UInt32) -> (addr :Data, port :UInt16, data :Data);
    close    @2 () -> ();
}

Nic stays a separate lower-layer cap (schema shown below) and remains kernel-internal in Phase B. UdpSocket landed for the POSIX adapter Phase P1.2 Phase A DNS path: the kernel implements it on top of the same retained smoltcp runtime, and userspace acquires it through NetworkManager.createUdpSocket. It is not part of the Telnet Shell Demo contract.

The ring transport cannot return direct Cap’n Proto capability fields, so capability-producing methods return result-cap indices in the serialized result and append CapTransferResult records after the message bytes. Runtime clients adopt those result caps by index.

accept and recv are blocking capability calls for the Phase B demo: they complete when a connection or received bytes are available, when the socket is closed, or when the caller’s cap_enter timeout/cancellation path fires. recv(maxLen) clamps to the kernel/ring result-buffer limits, and send may return a partial byte count. A readiness/poll interface can be added later without being required for the first remote shell proof.

Telnet gateway launch contract

This contract is historical: the telnet-gateway demo is removed with the kernel socket owner and the kernel SocketTerminalSession. It is retained as the authority-model reference for any future userspace terminal host. telnet-gateway was the terminal host for the remote connection. Its minimum authority was:

  • Manifest-forwarded TcpListenAuthority badge 23, held by init and forwarded to the gateway as the only listener-creation authority for the demo path.
  • Manifest-forwarded RestrictedShellLauncher, held by init and forwarded to the gateway as the only shell process launch authority.
  • Pass-through grants for the caps the current shell requires at startup: creds, sessions, audit, broker, and system_info.
  • An anonymous UserSession minted through SessionManager and checked through AuthorityBroker.shellBundle("anonymous") before launch. The shell still performs password login inside capos-shell and upgrades the session after credential verification.
  • A way to provide the child shell a cap named terminal whose interface id is TerminalSession, backed by the accepted TCP socket.

The gateway must not grant the child raw NetworkManager, TcpListener, TcpListenAuthority, TcpSocket, broad ProcessSpawner, or RestrictedShellLauncher authority. The retired implementation used the kernel socket wrapper (TcpSocket.intoTerminalSession, now failing closed) to produce an actual TerminalSession CapObject; the shell-facing contract stays TerminalSession for any future userspace terminal host.

Phase B exit criteria

  • schema/capos.capnp defined the TCP types above; kernel implemented them as CapObjects on top of the existing smoltcp interface. Initial implementation landed in commit 7446e04 at 2026-04-25 14:48 UTC; review follow-up added timer-safe deferred completion cleanup and make qemu-network-client-harness userspace coverage for outbound sockets and listener accept. This is historical Phase B evidence; qemu-only kernel socket entry points now fail closed.
  • smoltcp polling was driven from the scheduler, not a synthetic clock, so sockets could survive longer than a single early-boot burst. That runtime is retired.
  • A trusted telnet-gateway boot service used TcpListener/TcpSocket, refused the bounded initial Telnet negotiation needed by normal host clients, and launched capos-shell for the accepted connection with a socket-backed TerminalSession plus the shell’s existing login/session caps. The child shell did not receive raw network, TCP listener/socket, broad spawn, scoped-listener, or restricted-shell-launcher authority. This target is retired.
  • A dedicated CUE manifest (system-telnet.cue) and a make run-telnet target historically booted the above and ran a scripted host-side smoke that completed a login + one command + clean exit over telnet 127.0.0.1 2323. make run-telnet now exits with a retirement diagnostic.

Part 3: Userspace Decomposition (Phase C)

Phase C moves the NIC driver and the TCP/IP stack out of the kernel into separate userspace processes, so the kernel is left with only DeviceMmio / Interrupt / DMAPool dispatch and the cap-ring transport. Phase B must be complete first — Phase C is about relocating the code that Phase B already wrapped in capabilities, not about adding new interfaces at the socket layer.

Sequencing relative to the cloud usable-instance milestone. The Network-Reachable Datapath Scope Decision (2026-06-02) records that the real-GCE-boot milestone’s “reachable network stack” requirement means raw-frame TX/RX over the live NIC (the polled production provider), which the billable cloudboot gate already checks. The L4 socket reachability that Phase C delivers is therefore a separate future track sequenced after that milestone, not a milestone blocker.

IPv6 Support Status And Task Lane

Current capOS L4 socket behavior has one production forward path: the Phase C userspace service-object stack. The old qemu-only retained smoltcp runtime that configured 10.0.2.15/24, installed a default IPv4 route through 10.0.2.2, resolved the gateway with ARP, and proved outbound ICMPv4 plus TCP HTTP is retired. Non-qemu production manifests no longer grant the legacy kernel-owned socket caps; requests for kernel network_manager or tcp_listen_authority fail at bootstrap instead of falling through to virtio_stub.rs, and qemu-only kernel TCP/UDP socket entry points fail closed. The userspace IPv6 lane now has local link-local / Neighbor Discovery, Router Advertisement / SLAAC, GCE-style DHCPv6 address configuration, ICMPv6 Echo Reply, and IPv6 TCP listener/connect proofs.

The socket-address ABI is now explicit about address family rather than overloading a raw four-byte assumption. schema/capos.capnp defines IpAddressFamily (unspecified / ipv4 / ipv6) and documents a length contract on every address Data field: empty is unspecified (only where the method allows it), 4 bytes is ipv4, and 16 bytes is ipv6. getConfig reports the configured addressFamily and an ipv6Supported flag, so an all-zero IPv4 config is never misread as an IPv6 state. kernel/src/cap/network.rs decodes addresses through a family-typed read_ip_address, accepts IPv4 on the legacy stack, and fails closed on IPv6 there with a distinct ipv6Unsupported-class error and on any other length with a malformedAddress class – so legacy IPv4-only callers reject IPv6 explicitly instead of treating every non-four-byte value as a generic error. capos-rt surfaces the family and IPv6-support flag on NetworkConfig. The wire format stays source-compatible for existing 4-byte IPv4 callers. The behavior behind the userspace-service ABI now has bounded local IPv6 routing, diagnostics, and TCP L4 proofs; private GCE reachability and public IPv6 ingress remain unproved.

The pinned userspace smoltcp dependency is version 0.13.0 in the networking demo crates, not in kernel/Cargo.toml. capOS enables only the features each userspace proof needs. The crate has IPv6, SLAAC, and ICMP socket features available, and it does not provide a socket-dhcpv6 feature matching its DHCPv4 socket. With the address-family ABI landed, remaining IPv6 work is explicit userspace stack behavior and GCE reachability rather than kernel feature enablement.

The protocol gap is larger than “turn on IPv6”: with the local link-local/Neighbor Discovery, Router Advertisement / SLAAC, GCE-style DHCPv6, ICMPv6 Echo Reply, and IPv6 TCP listener/connect proofs done, capOS still has no private GCE IPv6 reachability proof or GCE IPv6 firewall proof. The standards and cloud grounding are:

  • RFC 4861: Neighbor Discovery, Router Solicitation/Advertisement, address resolution, and router defaults.
  • RFC 4862: stateless address autoconfiguration, link-local address generation, and Duplicate Address Detection.
  • RFC 4443: ICMPv6 including Echo Request / Echo Reply behavior.
  • RFC 8415: DHCPv6 client and server exchanges on UDP 546/547.
  • Compute Engine IPv6 configuration: dual-stack or IPv6-only subnet requirement, one /96 per interface, first /128 configured by DHCPv6 from the metadata server, default route via route advertisement, and link-local addresses used for Neighbor Discovery.
  • Google Cloud VPC firewall rules: IPv6 rules are supported, each firewall rule uses either IPv4 or IPv6 ranges, and IPv6 ingress needs an explicit allow rule before public access is reachable.

The resulting task lane is linked from Hardware, Boot, and Storage. The cloud-prod-ipv6-architecture-status-grounding scope decision is done (2026-06-03), and the address-family ABI entry point cloud-prod-network-address-abi-ipv6 is done (2026-06-03) as historical qemu-only kernel socket evidence. That target is now retired after kernel socket-owner removal; current address-family/socket behavior is covered by the Phase C userspace IPv4 and IPv6 gates below. The local link-local/Neighbor Discovery proof cloud-prod-ipv6-link-local-nd-local-proof is done (2026-06-08), proved by make run-cloud-prod-ipv6-link-local-nd. The local Router Advertisement / SLAAC proof cloud-prod-ipv6-ra-slaac-local-proof is done (2026-06-08), proved by make run-cloud-prod-ipv6-ra-slaac. The local GCE-style DHCPv6 address configuration proof cloud-prod-ipv6-dhcpv6-gce-config-local-proof is done (2026-06-08), proved by make run-cloud-prod-ipv6-dhcpv6-gce-config. The local ICMPv6 Echo Reply proof cloud-prod-icmpv6-echo-reply-local-proof is done (2026-06-08), proved by make run-cloud-prod-icmpv6-echo-reply. The local IPv6 TCP L4 proof cloud-prod-ipv6-tcp-l4-local-proof is done (2026-06-08), proved by make run-cloud-prod-ipv6-tcp-l4. The lane then sequences private GCE IPv6 and public IPv6 ingress/TLS policy tasks on top of that userspace-stack substrate.

IPv6 does not block the first public GCE Web UI proof while that proof remains scoped to IPv4 DHCP, ARP, Phase C L4, private GCE reachability, and reviewed public HTTPS ingress. It becomes relevant for a later dual-stack or IPv6-only cloud proof and for public IPv6 ingress policy.

Network Usability, Resolver, And Post-smoltcp Lane

The network usability backlog is Network Usability and Post-smoltcp. It records the user-facing work that starts after raw frames and the first userspace L4 proof: operator status tooling, DHCPv4 lease lifecycle, a typed system DnsResolver cap, POSIX getaddrinfo bridging, ping/ping6 diagnostics, socket readiness/cancel/backpressure semantics, packet trace authority, and transport policy/status.

Current boundaries are explicit there: the first local DHCP/IPv4 configuration proof is now done by cloud-prod-network-stack-dhcp-ipv4-config-local-proof and is on the first GCE Web UI critical path, while DHCP renewal/rebind/expiry, DNS option publication, and operator-visible lease status remain follow-up work. The local bounded ICMPv4 Echo Reply proof is also done by cloud-prod-icmp-echo-reply-local-proof, proved by make run-cloud-prod-icmp-echo-reply; it answers a bounded local same-subnet ping and rejects malformed or oversized requests, but it exercises ICMP protocol logic over an in-process QueuePhyDevice, not the real bound NIC. The real-NIC inbound path is now also done by cloud-prod-icmp-echo-reply-real-nic-datapath-local-proof, proved by make run-cloud-prod-icmp-echo-reply-real-nic-datapath: a kernel-owned responder on the legacy virtio 0.9 datapath acquires a DHCP lease over the real NIC, then receives an inbound Echo Request over the real RX vring and transmits an RFC 792 Echo Reply over the same NIC’s TX vring (a host peer over a QEMU socket netdev drives the inbound stimulus, since SLIRP drops inbound host->guest ICMP Echo). Both remain diagnostics rather than Web UI readiness; the real-NIC proof is the local pre-spend prerequisite for the billable private GCE ICMP proof and the same responder serves that live run. The POSIX DNS smoke is a hand-rolled A-query over UdpSocket, not a system resolver service or typed resolver capability. DNS, operator ping tools, IPv6, packet tracing, and advanced transport policy are usability/completeness lanes, not first public Web UI blockers unless a later deployment policy explicitly promotes one.

The backlog keeps smoltcp relocation (Phase C slices 7a-7c: run the selected smoltcp build in userspace, preserve the socket contract) distinct from transport policy/status (the capOS control plane around it). The selected userspace stack is smoltcp 0.13.0 and now has bounded local UDP socket-cap, TCP listener/socket-cap, sustained receive, and serve-from-userspace production socket-cap proofs. DHCPv4, DHCPv6, IPv6 L4, and ICMPv6 are explicit protocol proof lanes rather than ambient production readiness claims; retained qemu-only fixtures remain separate from the production cloudboot path. The done IPv6 protocol proofs (cloud-prod-ipv6-dhcpv6-gce-config, cloud-prod-ipv6-tcp-l4) build their smoltcp interface on an in-process HarnessPhyDevice and self-declare metadata_only=true; the IPv6 datapath over the real bound NIC is now done by cloud-prod-ipv6-real-nic-datapath-local-proof, proved by make run-cloud-prod-ipv6-real-nic-datapath: a userspace smoltcp service on a real-Nic-backed phy (the IPv4 DHCP datapath NicPhyDevice pattern) learns the default route from a Router Advertisement, configures the GCE-shaped /128 via DHCPv6 Solicit/Advertise/Request/Reply, and completes one ICMPv6 Echo probe – every frame over Nic.transmit/Nic.receivePoll against a host peer on a QEMU socket netdev (SLIRP has no stateful DHCPv6 server). That proof records the real-NIC provenance with no metadata_only/in-process disclaimer and is the local pre-spend prerequisite for the billable private GCE IPv6 reachability proof. No current capOS build enables socket-tcp-reno/socket-tcp-cubic, so capOS runs with CongestionControl::None by build configuration, not as a reviewed policy choice. The network-transport-policy-status-decomposition task records that audit and decomposes read-only transport status, keepalive/ timeout policy inputs, and a deferred congestion-control evaluation gated on workload evidence.

Architecture

+--------------------------------------------------+
|  Application Process                             |
|    holds: TcpSocket cap, UdpSocket cap, ...      |
|    calls: connect(), send(), recv() via capnp    |
+---------------------------+----------------------+
                            | IPC (capnp messages)
+---------------------------v----------------------+
|  Network Stack Process (userspace)               |
|    smoltcp TCP/IP stack                          |
|    holds: NIC cap (from driver), Timer cap       |
|    implements: TcpSocket, UdpSocket, Dns caps    |
+---------------------------+----------------------+
                            | IPC (capnp messages)
+---------------------------v----------------------+
|  NIC Driver Process (userspace)                  |
|    virtio-net driver                             |
|    holds: DeviceMmio cap, Interrupt cap, DMAPool |
|    implements: Nic cap                           |
+---------------------------+----------------------+
                            | capability syscalls
+---------------------------v----------------------+
|  Kernel                                          |
|    DeviceMmio cap: maps BAR into driver process  |
|    Interrupt cap: routes virtio IRQ to driver    |
|    DMAPool cap: DMA-eligible frames w/o raw PAs  |
|    Timer cap: provides monotonic clock           |
+--------------------------------------------------+

Three separate processes, each with minimal authority:

  1. NIC driver — only has access to the specific virtio-net device registers, its interrupt line, and DMA-eligible frames. Implements the Nic interface.
  2. Network stack — holds the Nic capability from the driver. Runs smoltcp. Implements higher-level socket interfaces.
  3. Application — holds socket capabilities from the network stack. Cannot touch the NIC or raw packets directly.

Phase C prerequisites (beyond Phase B)

PrerequisiteOwning gateWhy
Interrupt capabilityDDF Task 5 + S.11.2 driver-transition gateNIC driver receives IRQs without ambient authority
DeviceMmio capabilityDDF Task 5 + S.11.2 driver-transition gateNIC driver accesses device registers under bounded ownership
DMAPool capabilityDDF Task 5 + S.11.1 invariants + S.11.2 gateDMA-eligible frames without raw physical grants
Provider NIC smokeDDF Task 6First end-to-end provider-driver path through reviewed userspace authority instead of the in-kernel ledger

See DMA Isolation for the concrete invariants the three capabilities must satisfy and the Security Verification Track S.11.2 gate that unblocks moving the NIC driver out of the kernel. DDF Task 5 expands those invariants into a reviewable cap-table and ProcessSpawner manifest surface; DDF Task 6 is the first provider NIC smoke that consumes them end-to-end.

Current Phase C evidence includes the userspace virtio-net driver slices through the clean independent Nic.transmit/Nic.receive split, the 7a local userspace smoltcp substrate over that Nic cap, the 7b userspace UDP socket-cap layer, the 7c-i inter-process UdpSocket proof, the 7c-ii(a) inter-process TcpListener/TcpSocket proof, the sustained-receive TCP substrate, the 7c-ii(b) local serve-from-userspace production socket-cap proof, and retirement of the non-qemu legacy kernel socket grant path. The 7c-ii(b) proof starts the userspace network-stack process as the non-qemu cloudboot init process, spawns an application client with only Console plus a userspace-served TcpListenAuthority, and completes one local hostfwd TCP request/response through served TcpListener/TcpSocket caps. It is still narrower than the exit criteria below: the proof process keeps the existing DeviceMmio/DMAPool/Interrupt bring-up caps in-process until the future driver-service split, the long-lived service shape is still future work, and the selected GCE Web UI milestone now consumes the done DHCP/IPv4 configuration proof while still needing the local remote-session Web UI L4 proof, private GCE reachability, and the tracked Web UI hardening gates. The legacy kernel cap/network.rs / virtio_stub.rs socket route is fixture/negative-path cleanup territory, not the architecture to extend.

Phase C exit criteria

  • NIC driver runs in its own userspace process, holding only DeviceMmio, Interrupt, and DMAPool caps.
  • Network stack runs in a second userspace process, holding only the Nic cap from the driver and a Timer cap.
  • A successor socket-backed terminal or Web UI proof is rebuilt on the userspace network stack; the Phase B Telnet fixture is retired after kernel socket-owner removal.
  • The kernel contains no smoltcp dependency and no virtio-net code on the hot path.

Lower-layer capability schema (drafts — used by Phase C)

Phase B does not expose these to userspace; Phase C does. Timer is already implemented (see schema/capos.capnp).

Phase C track opened (2026-06-02). The Phase C Userspace NIC Driver Relocation design adopts this inline-Data frame ABI as-is (a DmaBuffer-handle zero-copy variant was considered and rejected to keep the change small; the frame stays in a kernel-owned bounce buffer the polled provider already proved). The methods carry the capOS result/reason/sideEffect evidence triple, and receive also reports the observed EtherType. See that doc for the cap-surface gap (no pending security ruling – the writable common-config window extends the accepted notify-doorbell selected-write discipline) and the bounded slice chain.

Slice 1 landed (2026-06-02). The unimplemented Nic interface below is now in schema/capos.capnp so the later coupled-TX/RX slices (3-4) extend it rather than introduce it; no CapObject implements it yet. Slice 1 (cloud-prod-nic-driver-userspace-features-ok-local-proof) also relocated the virtio device handshake to FEATURES_OK into a userspace driver shim over a writable selected-write common-config DeviceMmio window (the four handshake registers admitted on DeviceMmio.write32, queue-address writes fail closed); proof make run-cloud-prod-nic-driver-userspace-features-ok.

The landed Nic schema (inline Data + the capOS evidence triple):

interface Nic {
    transmit @0 (frame :Data)
        -> (result :Text, reason :Text, sideEffect :Text);
    receive  @1 ()
        -> (frame :Data, observedEthertype :UInt16,
            result :Text, reason :Text, sideEffect :Text);
    macAddress @2 () -> (addr :Data, result :Text, reason :Text, sideEffect :Text);
    linkStatus @3 () -> (up :Bool, result :Text, reason :Text, sideEffect :Text);
}

The driver relocation reuses the production DeviceMmio cap (a read-only BAR window with selected writes) and Interrupt cap (schema/capos.capnp) rather than the simplified map/wait sketches earlier drafts of this section used.

Part 4: Cross-cutting

Userspace language runtimes that need sockets

Userspace language runtimes that map their stdlib socket APIs onto capOS capabilities consume the same TcpSocket/UdpSocket surface this proposal defines, so the Phase A-B kernel-resident state above is what their socket imports currently fail closed against:

  • The POSIX adapter (libcapos-posix/) already maps socket(AF_INET, SOCK_DGRAM, 0)/sendto/recvfrom/close onto the Phase B UdpSocket cap for the Phase P1.2 Phase B DNS resolver smoke; see Userspace Binaries and POSIX Adapter.
  • WASI Preview 1 sock_send / sock_recv route through the WASI host adapter on top of the same caps. Phase W.6 (sockets) remains blocked on socket authority surfacing through the wasm-host CapSet; the W.2 ERRNO_NOSYS refusal harness in Language Support Status and Plans (WASI / WebAssembly row) is the current evidence that no socket authority leaks before that gate.

Neither track changes the trust-boundary debt: socket-using userspace runtimes still depend on the kernel-resident smoltcp stack until Phase C relocates it.

TLS Layering

TLS does not live in this proposal: the TcpSocket here is the bottom of the transport stack; a TlsSocket wraps it and is configured from the certificate, trust-store, OCSP, and verifier caps defined in Certificates and TLS. Keys consumed by TLS come from Cryptography and Key Management.

Draft shape (tracked in the certificates proposal):

interface TlsSocket {
    # Client handshake: wrap an outbound TCP socket with a client config.
    connect @0 (tcp :TcpSocket, config :TlsClientConfig) -> ();
    # Server handshake: accept on a TCP socket with a server config.
    accept  @1 (tcp :TcpSocket, config :TlsServerConfig) -> ();
    send    @2 (data :Data) -> (bytesSent :UInt32);
    recv    @3 (maxLen :UInt32) -> (data :Data);
    close   @4 () -> ();
    peerCertificate @5 () -> (chain :CertificateChain);
    alpnSelected    @6 () -> (protocol :Text);
}

Open Questions

  1. DMA memory management. Dedicated DmaAllocator capability vs extending FrameAllocator with allocDma?
  2. Socket readiness model. Phase B uses blocking accept/recv calls for the demo. The long-term interface still needs a readiness/poll or cancellation shape for multiplexed services.
  3. Buffer ownership. Copy into IPC message vs shared memory vs capability lending?

References

Crates

Specs

Prior Art

QEMU

Scope Decision: Real-GCE “Reachable Network Stack” – Raw-Frame TX/RX vs L4 Sockets

Decision

Option A. For the second cloud milestone (“usable cloud instance”, docs/backlog/hardware-boot-storage.md), the network data-path reachability bar – “a reachable network data path” / “reachable network stack” – means raw-frame (ethernet) TX/RX reachability over the live GCE NIC: the production polled userspace virtio-net provider exchanging frames over the real function is the reachability proof. Slices 1-4 of the GCE polling-path track plus the slice-6 billable boot close that data-path reachability bar.

L4 sockets (TCP/UDP reachable from a userspace application) are a separate future track – networking-proposal Phase C – and are explicitly not a real-GCE-boot data-path blocker. This decision does not start that track; it records that the track exists, is sequenced after the milestone, and is gated by its own Phase C prerequisites rather than by the cloud usable-instance data-path bar.

Scope boundary: data-path reachability vs L4 terminal access

The milestone bullet (docs/backlog/hardware-boot-storage.md, “Second cloud milestone: usable cloud instance”) states two network requirements, not one: add network drivers and “prove SSH/WebShell or other network terminal access over the cloud NIC.” SSH and WebShell are inherently L4 (TCP) – a raw frame cannot carry an SSH session. Option A therefore disambiguates only the first requirement (the network data path / “reachable network stack”, which is also what the billable gate checks). It does not claim that raw frames satisfy the SSH/WebShell terminal-access requirement. L4 network terminal access (SSH/WebShell) is deferred to Phase C and is tracked there; the operator access path demonstrated today is the serial-console shell (cloudboot access-path serial-console-shell marker), not a network terminal. Option A is thus a deliberate re-scoping of the milestone’s network-reachability gate down to the raw-frame data path, with L4 terminal access sequenced after the milestone – not a claim that the milestone delivers SSH/WebShell.

Rationale

The decisive principle for the data-path bar: the milestone’s automatically gated network proof is whatever the billable harness actually checks. The billable gate is make cloudboot-test (tools/cloudboot/run-test.sh). Reading that harness directly settles the ambiguity in the “reachable network stack” phrasing in one observation – it never checks an L4 socket round-trip. (The milestone’s separate SSH/WebShell terminal-access requirement is not harness-gated today and is handled under “Scope boundary” above: deferred to Phase C.)

What the cloudboot harness actually gates on

run-test.sh has exactly two success gates over kernel network behavior, and both are below the L4 layer:

  • Boot landmark. run-test.sh:BOOT_LANDMARK is the literal string capos kernel starting; main’s step 5 polls the serial port until that landmark appears (run-test.sh:main, the grep -q "${BOOT_LANDMARK}" poll loop). No TCP, no UDP, no handshake.
  • Provider-NIC proof (optional, raw-frame). Under --require-provider-nic-proof (run-test.sh:REQUIRE_NIC_PROOF), the run fails unless the serial output contains the run-test.sh:NIC_PROOF_MARKER line (cloudboot-evidence: provider-nic-bound <token>). The gate is pure marker presence (serial_marker_tokens "${NIC_PROOF_MARKER}" non-empty); it parses no socket state and performs no connect/send/recv against the instance.

The provider-nic-bound marker is, by its own documented contract (tools/cloudboot/README.md, “Serial evidence-marker contract”), a raw-frame bind proof: the non-qemu kernel composes the DeviceMmio + DMAPool/DMABuffer + MSI-X Interrupt grant proofs over one virtio function, programs the MSI-X table entry, and tears down with stale-handle assertions. It explicitly does NOT write any virtio common-config register, does NOT activate the device, and emits a summary line recording device_autonomous_raise=not-attempted. There is no IP address, no socket, and no L4 protocol anywhere in the marker contract. The harness’s structured provider.json schema (tools/cloudboot/README.md, “provider.json schema”) likewise has no TCP/UDP/socket/L4 field – the network-facing fields are provider_nic_proof, enumerated_device_classes, enumerated_device_inventory, dma_pool_grant, interrupt_route_allocated, interrupt_route_delivered, and storage_bind_proof, all device/frame-level.

Choosing Option B would mean adopting a milestone acceptance bar (an L4 socket round-trip) that the billable gate does not enforce, and blocking the milestone on a large Phase C chain that the milestone’s own proof substrate never exercises. That is not an honest reading of the gate.

What the production polled path can and cannot reach today

Can reach (raw frame): kernel/src/cap/virtio_net_polled_provider.rs is the always-built (non-qemu) production provider. It exercises raw-frame DMABuffer movement over the live virtio function: the provider submits the brokered RX receive buffer and observes its completion by polling the used ring (InterruptCapVirtioNetPolledProvider::invoke_wait reads the latched PublishedRx used.idx/used[0] captured in attempt_rx_submit), with zero interrupts – no device_interrupt::wait_kernel_injected_dispatch, no inject_real_lapic_int_for_proof on the wait/ack path. The TX leg is a kernel-half SLIRP stimulus (a manager-owned broadcast-ARP frame authored on queue 1 to elicit the inbound reply, attempt_rx_submit “Stimulus” step), not a provider-submitted frame. One real device->host RX DMA of used_len=76 (an ethernet frame, ethertype 0x0806 ARP) has been observed this way. This is the ethernet-frame level: frames traverse the live function in both directions, with the provider owning the RX receive path.

Cannot reach (L4): there is no TCP/UDP socket layer in the production data path. The entire L4 surface is cfg(feature = "qemu")-gated and replaced in the cloud kernel by kernel/src/virtio_stub.rs, whose socket entry points all fail closed:

  • virtio_stub.rs:create_tcp_listener -> NetworkError::DeviceUnavailable
  • virtio_stub.rs:connect_tcp_ipv4 -> NetworkError::DeviceUnavailable
  • virtio_stub.rs:create_udp_socket -> NetworkError::DeviceUnavailable
  • virtio_stub.rs:send_tcp / recv_tcp -> NetworkError::InvalidSocket
  • virtio_stub.rs:accept_tcp -> NetworkError::InvalidListener
  • virtio_stub.rs:network_config -> all-zero addr/netmask/gateway
  • virtio_stub.rs:poll_scheduler -> no-op

The cap/network.rs TCP/UDP socket CapObject family (TcpListener/TcpSocket/UdpSocket, deferred accept/recv waiters, the socket-terminal handoff) is wired to crate::virtio::poll_scheduler – i.e. to the stub in production – so in the cloud kernel a userspace caller holding a socket cap gets DeviceUnavailable/InvalidSocket, not a connection. The in-kernel smoltcp stack, TCP listeners, accepted-socket state, the cooked-mode line discipline, and the Telnet IAC filter live only in the cfg(qemu) kernel/src/virtio.rs build.

Why Option B is genuinely a separate, larger track

Option B is networking-proposal Part 3: Userspace Decomposition (Phase C): relocating smoltcp and the cap/network.rs socket caps out of the cfg(qemu) kernel/src/virtio.rs into a userspace NIC-driver process (holding DeviceMmio/Interrupt/DMAPool) and a userspace network-stack process (holding the Nic cap + Timer), with applications holding socket caps. Its declared exit criterion is “the kernel contains no smoltcp dependency and no virtio-net code on the hot path.” Its prerequisite table (networking-proposal “Phase C prerequisites”) requires production grantable DMAPool/DeviceMmio/Interrupt lifecycles, real provider-driver interrupt wait/ack/mask/unmask consumption, durable audit consumption, an IOMMU domain or explicit production bounce-buffer policy, and full driver ownership handoff – and the proposal itself states current DDF evidence is “narrower than these Phase C prerequisites.” This is a multi-slice chain, not a finishing touch on the milestone.

Sequencing it after the milestone is also consistent with the GCE polling-path decision already recorded in the backlog (2026-06-01): the production data path is polled, device-autonomous MSI-X is a parallel efficiency follow-up, and the milestone is deliberately decoupled from interrupt delivery. Raw-frame reachability is the layer that decision already commits to; L4 sits above it.

Consequence

  • Slices 1-4 (the real polled provider, its default-manifest graduation, the real provider-nic-bound source, and the polled-provider stale-authority teardown) plus slice 6 (the billable make cloudboot-test --require-provider-nic-proof boot) close the usable-cloud-instance milestone’s network data-path reachability bar – the requirement the billable gate actually checks.
  • The milestone’s separate SSH/WebShell / network terminal access requirement is not closed by these slices; it is L4 and is deferred to Phase C as future work. The access path demonstrated on the current cloud kernel is the serial-console shell, not a network terminal.
  • L4 sockets remain future work under networking-proposal Phase C, gated by the Phase C prerequisites, not by the data-path bar. No child task chain is created by this decision; Phase C is tracked where it already lives (the networking proposal and the DDF Task 5/6 prerequisites in docs/backlog/hardware-boot-storage.md).

2026-06-08 Follow-Up: Phase C Web UI Chain

The later Phase C serve-from-userspace proof does not reopen the 2026-06-02 raw-frame-vs-L4 decision above. That decision remains the historical scope record for the closed usable-cloud-instance raw-frame data-path bar. The selected milestone has since moved to GCE Self-Hosted Web UI, whose proof chain owns L4 and Web UI reachability through separate task records.

The relevant Phase C design home is Phase C Userspace NIC Driver Relocation. Its local 7c proof is now landed in cloud-prod-userspace-network-stack-smoltcp-local-proof: the non-qemu cloudboot manifest starts the userspace smoltcp network-stack process, serves a scoped TcpListenAuthority, and completes one local host-forwarded TCP request/response through served TcpListener/TcpSocket caps. That is local cloudboot L4 evidence, not private GCE reachability and not public operator ingress.

The current Web UI ladder is task-owned:

This follow-up changes documentation scope only. It does not change any remaining task status, selected milestone, cloud resource posture, public ingress authority, TLS custody, or production release authority.

Inputs weighed

  • tools/cloudboot/run-test.sh (BOOT_LANDMARK, NIC_PROOF_MARKER, REQUIRE_NIC_PROOF, main, PROVIDER_JSON_REQUIRED_KEYS) and tools/cloudboot/README.md (“Serial evidence-marker contract”, “provider.json schema”, “Gate semantics”) – the billable gate, and the single most decisive input.
  • kernel/src/virtio_stub.rs – the production L4 surface (all socket entry points fail closed).
  • kernel/src/cap/network.rs – the L4 socket CapObject contract, wired to the stubbed poll_scheduler in production.
  • kernel/src/cap/virtio_net_polled_provider.rs – the always-built raw-frame polled provider (real device->host RX DMA used_len=76, zero interrupts).
  • docs/proposals/networking-proposal.md, Part 3 (Phase C architecture, prerequisites, exit criteria) – the scope of Option B.
  • docs/backlog/hardware-boot-storage.md, “Cloud Device Tracks – Real GCE Polling Path (decoupled from MSI-X)” – the track this decision is slice 5 of.

Phase C: Userspace virtio-net Driver Relocation

This is the L4 track opened by the Network-Reachable Datapath Scope Decision (Option A): raw-frame TX/RX reachability is the cloud milestone bar, and the L4 socket path – relocating smoltcp and the cap/network.rs socket caps out of the cfg(qemu) kernel/src/virtio.rs into userspace processes (networking-proposal Part 3, Phase C) – is a separate future track. This doc designs that track and sequences its slices.

The first Phase C Web UI path remains IPv4-scoped: userspace L4 plus DHCP/IPv4 configuration, ARP, and the private/public GCE Web UI proofs. IPv6 is tracked as a separate network-stack capability lane in Networking and the hardware/cloud backlog. Phase C must preserve enough address-family shape for that lane, but lack of IPv6 does not block the first IPv4 GCE Web UI proof.

Cap-Surface Delta: What The Userspace Driver Needs

The current DeviceMmio / DMAPool / Interrupt cap surface does not yet host the virtio-net driver in userspace as built, but the missing pieces are bounded extensions of accepted patterns and a reuse of the landed production DMA-isolation track – not new isolation built from scratch. Per-primitive evidence:

  • DeviceMmio gives a read-only BAR window with one selected write today. DeviceMmio.map returns a read-only BAR page; raw write32 is refused (register_write = "blocked"), with exactly one selected write permitted – the notify doorbell at @5 (notify_doorbell, kernel/src/cap/device_mmio.rs). A driver must additionally write the virtio common-config window (device status, feature-select/feature, queue-select, queue-size, queue-address/queue-enable). The relocation adds these as further selected writes under the same accepted range-check + read-back discipline the notify doorbell already enforces (see “The Common-Config Window” below) – not a new write primitive.
  • DMAPool does not yet export a device-usable address to this driver, but the export discipline is landed. DMAPool gives one bounce page; the host-physical / device address is not exported in the bounce posture (host_physical_user_visible = false, direct_dma = "blocked", iova_export = "disabled-future-only", kernel/src/cap/dma_buffer.rs). The vring is kernel-owned today (kernel/src/cap/virtio_net_polled_provider.rs), so userspace does not yet place its own descriptors. The mechanism to let it do so safely – a manager-owned bounce buffer or a domain-scoped IOMMU IOVA, never a raw host-physical address – is already landed (the production DMA-isolation track; see “The Userspace-Ownable vring Slice” below); the slice-2 work wires it to the driver’s vring.
  • Interrupt is wait-only over a kernel-latched used ring. Interrupt.wait reads a kernel-latched used-ring index, acknowledge is a no-op, and mask/unmask are refused. A real driver owns its IRQ lifecycle (mask, unmask, EOI ordering).

Classifying the virtio-net bring-up steps against what userspace can do today makes the gap concrete – almost every step is kernel-only:

Bring-up stepUserspace-doable today?
Device reset (write status = 0)No – needs writable status register
ACKNOWLEDGE / DRIVER status bitsNo – needs writable status register
Feature negotiate (select + read + write + FEATURES_OK)No – needs writable feature-select/feature + status
Queue program (queue-select, queue-size, queue-address, queue-enable)No – needs writable common-config + a device-usable vring address
vring allocation (avail/used/descriptor tables)No – vring is kernel-owned; no device-usable buffer address export
DRIVER_OK (write status)No – needs writable status register
MSI-X program / vector assignmentNo – kernel-owned
Submit + notify (ring doorbell)Partial – the one selected doorbell write @5 exists
Poll used ringPartial – via the kernel-latched index Interrupt.wait reads
Teardown (reset, scrub, release)No – kernel-owned reset path

The Nic ABI (Inline Data, Per the Proposal Draft)

This track keeps the networking-proposal Part 3 frame ABI: frames cross the cap boundary as inline Data (transmit @0 (frame :Data), receive @1 () -> (frame :Data), networking-proposal:443). The kernel copies the frame into and out of the manager-owned bounce buffer the polled provider already established, so no host-physical address or device-usable buffer handle is exported to userspace and host_physical_user_visible=0 is preserved. capOS method convention adds the result/reason/sideEffect evidence triple and the observed EtherType to the result:

interface Nic {
    transmit @0 (frame :Data)
        -> (result :Text, reason :Text, sideEffect :Text);
    receive  @1 ()
        -> (frame :Data, observedEthertype :UInt16,
            result :Text, reason :Text, sideEffect :Text);
    macAddress @2 () -> (addr :Data);
    linkStatus @3 () -> (up :Bool);
}

Why inline Data, not a zero-copy buffer handle. A DmaBuffer-handle (zero-copy) ABI was considered – it would avoid the per-frame copy – but rejected to keep the change small: it introduces a new buffer-ownership protocol across the cap boundary (who allocates, who frees, lifetime versus the call) on top of the security work this track already requires, for a copy cost that does not matter at research scale. Inline Data matches the accepted proposal draft, keeps the frame staging kernel-owned exactly as the polled provider proved, and defers any zero-copy optimization to a later, separately justified slice.

The Common-Config Window (Selected-Write, No New Ruling)

Relocation writes the virtio common-config window from userspace through a bounded, range-checked, selected-write path modeled on the existing single selected write (notify_doorbell @5, kernel/src/cap/device_mmio.rs; device_manager::provider_notify_doorbell_write_for_cap). This is the next register in the accepted selected-write pattern, not a new security relaxation requiring a ruling. DeviceMmio already refuses raw write32 (register_write = "blocked") and admits exactly the claimed selected write, range-checked against the decoded BAR and followed by a kernel-asserted read-back; the handshake registers are added to that same admission list under the same discipline.

The bounded design:

  • A selected-write common-config window: only the named virtio common-config registers needed for handshake (device status, feature-select, device-feature, driver-feature) are writable in slice 1, each range-checked against the claimed BAR.
  • Read-back-assertion discipline: every selected write is followed by a read-back the kernel asserts, so a userspace driver cannot leave the device in an unverified state.
  • Queue-address registers stay fail-closed in slice 1 and are admitted to the same selected-write list only in slice 2, where each programmed value must resolve to a device-usable address the writing driver was granted (the DMA-isolation discipline below decides), so userspace can never point the device at arbitrary physical memory.

No new ruling is pending: the project already decided this posture through the accepted selected-write discipline and the IOVA-export discipline below. Slice 1 is ready.

The Userspace-Ownable vring Slice (Reuses Landed DMA Isolation)

The expensive-sounding piece – slice 2, a userspace-ownable vring plus a device-usable buffer-address export – is wiring already-landed isolation to the driver’s vring, not building isolation from scratch. The two backends and the no-host-physical export discipline are all landed:

  • Bounce-buffer path (production default on no-IOMMU shapes). The runtime DMA-backend probe (kernel/src/dma_backend.rs, select_and_report / probe_verified_usable_iommu) selects the labeled bounce-buffer fallback fail-closed when no usable guest IOMMU is verified, and the manager-owned bounce-buffer DMAPool / DMABuffer lifecycle (kernel/src/device_dma.rs, scrub-before-free, owner/slot generations, quiesce-before-release) is the landed authority a driver uses for device-visible buffer memory (cloud-prod-dmapool-bounce-buffer-grant-proof).
  • IOMMU-IOVA path (graduates when the probe verifies usable hardware). The Intel VT-d remapping path (kernel/src/iommu.rs, cfg(qemu) today) programs per-device domains, maps manager-owned DMAPool pages, and exports only a domain-scoped IOVA – never a host-physical address (ddf-iommu-remapping-production-closeout, ddf-iommu-production-dmapool-ledger-integration, ddf-iommu-per-device-domain-granularity, ddf-iommu-production-revoke-teardown-hostile-smokes, ddf-real-dma-iommu-direct-path).
  • No-host-physical export discipline. The IOVA-export discipline (ddf-iommu-iova-export-discipline) and the host_physical_user_visible = false / iova_export = "disabled-future-only" posture (kernel/src/cap/dma_buffer.rs) guarantee a driver receives only a device-usable address (bounce handle or domain-scoped IOVA), never a raw host physical address.

The accepted contract is docs/dma-isolation-design.md (“Cloud DMA Backend” runtime-selection rule and the IOVA-export-discipline clause), and the S.11.2 hostile-smoke matrix is already enforced for both backends. The remaining slice-2 work is therefore to let the userspace driver allocate its vring through the granted DMAPool (bounce, or IOVA-backed when the probe verifies usable hardware), learn the device-usable address for each ring, and program those addresses into the queue-address registers over the slice-1 writable window – under the landed fail-closed / scrub / quiesce / revoke discipline. The networking-proposal Phase C prerequisites table and S.11.2 are satisfied by the landed track, not deferred to it.

Bounded Slice Chain

  1. Userspace status + feature handshake to FEATURES_OK over a writable selected-write common-config window (the next register in the accepted notify-doorbell selected-write discipline). [DONE 2026-06-02.] The cap::devicemmio_grant_source_prod source stages the virtio-net common-config window as a writable selected-write DeviceMmio grant (stage_virtio_net_common_config); the userspace shim drives the handshake over DeviceMmio.read32/write32, the write admission (device_manager::stub::write_devicemmio_u32) admits only the four handshake registers (range-checked + read-back-asserted) and refuses queue-address writes; the unimplemented Nic stub is in schema/capos.capnp. Proof make run-cloud-prod-nic-driver-userspace-features-ok. Task record: docs/tasks/done/2026-06-02/cloud-prod-nic-driver-userspace-features-ok-local-proof.md.
  2. Userspace-ownable vring + device-usable address export – reuses the landed production DMA isolation (bounce policy + dma_backend probe + IOMMU IOVA-export, S.11.2 already enforced); the work is wiring it to the driver’s vring. [DONE 2026-06-03.] Under cloud_virtio_net_userspace_ownable_vring_proof (implies slice 1) the userspace shim co-receives the writable common-config DeviceMmio grant and a bounce-buffer DMAPool grant on the same virtio-net function; it allocates its descriptor / available / used ring pages, learns each buffer’s opaque device-usable handle from DMABuffer.info (deviceIova, scope bounce-handle), and programs queue_desc / queue_driver / queue_device over the slice-1 window. device_manager::stub::write_devicemmio_u32 (admit_virtio_queue_address_write) resolves each handle against the live DMAPool grant ledger (resolve_virtio_vring_device_address) to the real bounce host-physical address, programs that address (never the handle), and read-back-asserts; queue-address reads (0x20..0x38) are refused so the host-physical address is never exposed, and out-of-grant / host-physical / stale-generation writes fail closed. queue_enable stays fail-closed. Proof make run-cloud-prod-nic-driver-userspace-ownable-vring. Task record: docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-ownable-vring-local-proof.md.
  3. Userspace queue-program + DRIVER_OK over the (now device-addressable) vring. [DONE 2026-06-03.] Under cloud_virtio_net_userspace_queue_enable_driver_ok_proof (implies slice 2) the userspace shim completes device bring-up: after slice 2’s queue-address programming it writes queue_enable = 1 (0x1c) for its programmed TX queue and sets DRIVER_OK over the already-writable device-status register. device_manager::stub::write_devicemmio_u32 admits the queue_enable write only when the active queue’s vring memory is live and page-fitting (selected_queue_ready_to_enable): it reads the active queue_desc/queue_driver/queue_device back kernel-side and requires each to currently hold the host-physical address of a live granted DMABuffer (a freed buffer’s stale address cannot arm a use-after-free DMA target), and requires the active queue_size to fit every split-ring structure inside one granted bounce page; an enable of an unprogrammed, freed, or oversized queue fails closed, and the enable is read-back-asserted. Once enabled, the queue’s vring base registers are immutable – a queue-address repoint is refused so the driver cannot mutate the vring under a running device. The DRIVER_OK device-status write is kernel-asserted: the kernel re-reads device-status and fails closed unless the device latched the ACKNOWLEDGE | DRIVER | FEATURES_OK | DRIVER_OK byte exactly (rejecting FAILED and DEVICE_NEEDS_RESET). Queue-address reads stay refused; no host-physical is exposed; no new DMA isolation backend. Proof make run-cloud-prod-nic-driver-userspace-queue-enable-driver-ok. Task record: docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-queue-enable-driver-ok-local-proof.md. 4a. Userspace RX queue 0 bring-up + buffer-identity binding + ring-buffer pinning, then the first real RX DMA from the shim-owned vring. Split into two landed/ready sub-slices because the real-DMA hybrid bring-up is large:
    • 4a-i [DONE 2026-06-03]. The shim brings up RX queue 0 over its own vring (slices 1-3 brought up only the TX queue; the queue_enable admission is queue-agnostic). device_manager::stub retains each programmed queue’s vring physes + originating DMABuffer handle identity on ProductionDeviceRecord (admit_virtio_queue_address_write), binds queue_enable to that identity (a freed buffer’s stale handle, or a freed-then-reallocated frame at the same host-physical address, fails closed with devicemmio-queue-enable-identity-mismatch), and pins the ring buffers against freeBuffer / process-teardown release while the queue is enabled (dmabuffer-pinned-enabled-vring), releasing only on disable/reset with quiesce. This completes the vring buffer-lifetime binding slice 3 left point-in-time at the bring-up boundary. No device DMA. Proof make run-cloud-prod-nic-driver-userspace-rx-bringup. Task record: docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-rx-bringup-and-buffer-pinning-local-proof.md.
    • 4a-ii [DONE 2026-06-03]. The first real RX DMA from the shim vring: the shim also brings up TX queue 1 over its own vring, posts one device-writable RX receive buffer on queue 0 (DMABuffer.submitDescriptor), and rings the production DeviceMmio.notifyDoorbell @5 (the previously Err(stale_handle) provider_notify_doorbell_write_for_cap, now live; cap::devicemmio_grant_source_prod maps the notify region kernel-side and captures the per-queue notify slot offsets). The kernel (cap::virtio_net_userspace_rx_dma_proof) drives the RX publish over the shim’s retained RX physes + a kernel-half SLIRP TX ARP stimulus over the shim’s retained TX physes + one real device->host RX DMA (used_len > 0, observed EtherType 0x0806), latches the used-ring index (int_injected = 0, no Interrupt cap), and resets the device – quiescing the queues and releasing the ring-buffer pins – WITHOUT the Nic cap. Proof make run-cloud-prod-nic-driver-userspace-rx-bringup (extended). The deterministic freed-then-reallocated-frame identity negative is split to a follow-up: the next-fit frame allocator (capos-lib FrameBitmap, free_frame does not rewind next_hint) never returns a just-freed frame on the next allocation, so a deterministic same-phys realloc – needed to reach the slice-4a identity gate rather than the slice-3 phys gate – requires an allocator reuse seam. The data path is cooperative-shim-safe (kernel authors the descriptor + avail inside the drive window, re-validates the posted buffer live + unmapped before publishing, resets on every post-publish path); the HOSTILE-shim residuals (kernel-exclusive/unmappable enabled vring rings + reset-failure payload quarantine) are closed in 4a-iv; the identity negative is closed in 4a-iii. Task records: docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-rx-dma-local-proof.md; follow-ups docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-rx-dma-identity-realloc-negative-local-proof.md, docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-rx-dma-hostile-shim-hardening-local-proof.md.
    • 4a-iv [DONE 2026-06-03]. HOSTILE-shim hardening over the 4a-ii data path. The shim owns its vring ring buffers (granted bounce DMABuffers it can DMABuffer.map), so the kernel cannot trust ring contents it did not author and lock. The closed gaps: (1) a programmed/enabled vring ring buffer is made kernel-exclusivevalidate_dmabuffer_map_admission refuses a NEW DMABuffer.map of it while the queue is enabled, selected_queue_identity_bound refuses queue_enable while a ring buffer still carries a live user mapping (a kept pre-enable VMA) and refuses arming a queue whose ring buffers are not pairwise-distinct pages – both within the queue and across the device’s other enabled queues (an aliased desc/driver/device, or rx.desc == tx.desc, would let the kernel-authored ring writes corrupt a descriptor into a non-bounce DMA target), and record_dmabuffer_user_mapping refuses to record a mapping on a programmed/enabled ring (an in-flight SMP map), and the RX-DMA submit/drive admission refuses to publish an enabled ring buffer as the RX DMA payload (the shim cannot point the device at its own ring page); (2) at queue_enable the kernel WIPES the queue’s descriptor table slot 0, available ring, AND used ring (virtio_net_userspace_rx_dma_proof::sanitize_enabled_queue_rings), so a shim that pre-wrote avail.idx / a tampered descriptor / a spoofed used.idx while the queue was disabled cannot pre-publish it into the enabled window – the device sees an empty queue with no pre-staged completion until the kernel-authored drive publishes and a real device DMA advances used.idx; (3) a per-bdf RX-DMA payload drive pin / reset-failure quarantine (device_manager::stub begin_rx_dma_drive_pin / clear_rx_dma_drive_pin / mark_rx_dma_payload_quarantine_permanent, consulted by the map admission + the record path + the freeBuffer/teardown detach) is set atomically with the live+unmapped re-validation under the device-table lock for the drive duration, cleared after a confirmed reset, and promoted to a permanent quarantine on the catastrophic reset-failure path (never downgraded by a later drive). The smoke (make run-cloud-prod-nic-driver-userspace-rx-bringup, extended) proves a still-mapped ring blocks queue_enable, the post-enable map refusal on the descriptor + driver rings, and that a hostile pre-enable descriptor/avail.idx tamper does NOT survive (RX DMA still completes with used_id=0, ARP EtherType). The drive pin’s SMP map/free race and the reset-failure quarantine are structural fail-closed hardening: the single-CPU cloudboot proof cannot reach the race and QEMU virtio reset always succeeds, so they are not separately QEMU-observable. Residual [CLOSED 2026-06-03 in 4a-v]: two SMP-only, bounce-confined races between DMABuffer.map and a device-authority state transition – the cap-side map_page_into_user installed the user PTE before the manager-side record (a manager-rejected map left a transient provisional PTE), and the queue_enable no-live-mapping check was not atomic with the retained enabled flag flip (a concurrent map could record a mapping in the enable window) – were not reachable by the single-CPU proof and are closed by slice 4a-v below. Predecessor task record: docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-rx-dma-hostile-shim-hardening-local-proof.md.
    • 4a-iii [DONE 2026-06-03]. The deterministic freed-then-reallocated-frame queue_enable identity negative carved out of 4a-ii. A proof-only one-shot bounce-frame reuse seam (mem::frame::proof_try_alloc_specific_frame_zeroed consumed by a device_dma reuse hint armed on each bounce free, both gated behind cloud_virtio_net_userspace_rx_bringup_proof, never compiled into production) makes a same-host-physical DMABuffer realloc reachable from the userspace harness despite the production next-fit FrameBitmap. The smoke programs a transient ring buffer into rx queue_desc, frees it, relands a fresh-handle buffer on the same frame, and proves queue_enable fails closed on the slice-4a identity gate (authority_result = devicemmio-queue-enable-identity-mismatch) – the recorded phys is live again so the slice-3 phys gate passes, and the marker flips identity_realloc_negative=enforced. The seam does not relax the next-fit policy or the identity gate. Proof make run-cloud-prod-nic-driver-userspace-rx-bringup (extended). Task record: docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-rx-dma-identity-realloc-negative-local-proof.md.
    • 4a-v [DONE 2026-06-03]. Closes the two SMP-only, bounce-confined DMABuffer.map-vs-device-authority races split out of 4a-iv. (1) kernel/src/cap/dma_buffer.rs::map_page_into_user now takes the manager-side mapping record (record_dmabuffer_user_mapping, which acquires the device-table lock) BEFORE installing the user PTE, with the caller address-space lock held across the manager call (lock order address-space -> device-table -> quarantine, no reverse nesting): a manager-rejected map returns before any PTE is installed, so a concurrent SMP thread can never touch a transient provisional mapping at the deterministic auto-picked base. (2) kernel/src/device_manager/stub.rs::try_enable_selected_queue folds the queue_enable no-live-mapping identity check and a retained enable_pinned commit pin into one device-table critical section, set before the MMIO arm write and rolled back on a readback mismatch, so a concurrent map cannot record a mapping in the enable window. The commit pin is kept distinct from the device-armed enabled bit (set after the MMIO write): map admission / freeBuffer pinning consult enabled || enable_pinned, while the RX-DMA drive gates on enabled alone, so a concurrent drive cannot run – and reset-clear the retained state – inside the arm window. A per-queue enable_in_progress claim serializes concurrent queue_enable transitions (enable vs enable, enable vs disable) so a racing queue_enable = 0 can never clear an in-flight enable’s commit pin, and the drive’s reset cleanup skips clearing while a transition is in flight so a stale drive cannot clear a newer enable’s pin. The two transitions now serialize on the device-table lock: one fails closed. The interleavings are not single-CPU-reachable and the live kernel statics are not host-testable, so the proof is an exhaustive Loom interleaving model (capos-config/tests/dmabuffer_map_enable_loom.rs, run via cargo test-dmabuffer-map-enable-loom) that asserts no schedule arms a queue with a live user PTE on its ring buffer or installs a PTE without an accepted record; the single-CPU make run-cloud-prod-nic-driver-userspace-rx-bringup regression continues to pass unchanged. Residual [CLOSED 2026-06-03]: the follow-up fenced the drive’s post-reset cleanup on a per-queue transition generation. RetainedVringQueue::generation is bumped on each completed transition (mark_selected_queue_armed enable arm + finish_selected_queue_disable disable finish); the drive reads it (retained_queue_generation) IMMEDIATELY before reset_device – the reset boundary, not the far-earlier enabled sample – and the cleanup (mark_retained_vring_queue_disabled_if_epoch) clears the pins only while the queue is still in the epoch the reset quiesced. A disable + re-enable that fully completes AFTER the reset advances the generation, so the cleanup becomes a no-op instead of clearing the freshly armed epoch’s pins; one that completed BEFORE the reset is included in the captured generation, so its pins clear (no stale over-pin). The only residual is a transition completing in the tiny read->reset-MMIO gap whose arm loses the race: a fail-closed over-pin, never an under-pin. The Loom model gained fenced_stale_drive_cannot_clear_a_completed_re_enables_pins, which fails with the fence removed (GENERATION_FENCE = false). Task record: docs/tasks/done/2026-06-03/cloud-prod-nic-driver-queue-enable-drive-reset-epoch-fence-local-proof.md. Predecessor (4a-v): docs/tasks/done/2026-06-03/dmabuffer-map-record-before-pte-install-ordering-local-proof.md. 4b. Nic-cap driver process, coupled TX/RX round-trip [DONE 2026-06-03]. Implements the slice-1 Nic interface stub as a live CapObject (kernel/src/cap/nic_grant_source_prod.rs, granted via the new nic KernelCapSource registered in capos-config; client NicClient in capos-rt). Over the SAME shim-brought-up device (RX queue 0 + TX queue 1 enabled by slices 1-3 + 4a), the cap drives the shim’s retained vring physes through virtio_net_userspace_rx_dma_proof::{nic_transmit, nic_receive, nic_quiesce}: receive() internally drives the coupled ARP-TX-stimulus + RX-poll and returns the received frame inline plus the observed EtherType; transmit() stages a frame into a manager-owned TX bounce page over the retained TX vring and rings the doorbell; macAddress()/linkStatus() read the kernel-mapped virtio-net device-config region. Frames cross the cap boundary as inline Data copied through manager-owned kernel bounce pages (host_physical_user_visible = 0; no host-physical / device-handle exposure); the device is left live for the cap’s lifetime and quiesced once on cap release (nic_quiesce: reset + queues-cleared + release the enabled-vring pins). Completion stays kernel-latched used-ring polled (int_injected = 0, no Interrupt cap). The clean independent TX/RX split is deferred to slice 6; userspace IRQ ownership to slice 5. Proof make run-cloud-prod-nic-driver-userspace-nic-cap-roundtrip boots the device from userspace, round-trips two sequential frames through the typed Nic cap (observed EtherType 0x0806 over QEMU SLIRP), and emits one cloudboot-evidence: nic-driver-userspace-nic-cap-roundtrip <token> marker with roundtrips=2. The same proof releases the parent DMAPool cap and one pinned ring DMABuffer cap before Nic release, then shows Nic quiesce replaying the blocked buffer detach and the pending parent pool detach completing after the remaining ring buffers are freed. Depends on 4a-ii (the shim-owned-vring data path). Task record: docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-nic-cap-roundtrip-local-proof.md.
  4. Userspace IRQ ownership [DONE 2026-06-03]. The userspace NIC driver’s Interrupt cap for its device RX route becomes real, replacing slice-4b’s kernel-latched used-ring polled completion (int_injected = 0). Under cloud_virtio_net_userspace_irq_ownership_proof (implies slice 4b) a new Interrupt grant source (kernel/src/cap/virtio_net_userspace_irq_ownership_proof.rs, replacing the admission-only interrupt_grant_source_prod source via the KernelCapSource::Interrupt arm) programs the staged virtio-net function’s RX MSI-X route (entry 0) mask-first through the landed always-built cap::interrupt_programmed::program_attach_arm_unmask and issues a cap whose: wait blocks on a real interrupt dispatch through the route’s MSI-X / LAPIC dispatch slot (device_interrupt::wait_kernel_injected_dispatch; delivery_count advances, so int_injected flips from 0 – slice 4b had no Interrupt cap on the data path at all). The wake is a bounded kernel-injected dispatch through the route’s real deferred-EOI machinery, not yet a device-autonomous MSI-X raise causally tied to a specific frame; Nic.receive still reads the frame bytes from the used ring, so this slice delivers IRQ-lifecycle ownership (the driver drives the real wait/acknowledge/ mask/unmask), not interrupt-coalesced RX completion. acknowledge retires exactly one deferred LAPIC EOI through device_interrupt::acknowledge_deferred_lapic_eoi_for_route (hardwareDispatchAckDelta = 1); and mask/unmask toggle the route’s own MSI-X vector-control bit (mask-first per PCI 3.0 §6.8.2) plus the manager-attached route state, through pci::set_msix_table_entry_mask + device_interrupt::{mask,unmask}_device_manager_attached_route. The route is torn down (interrupt_programmed::teardown) on cap release. The driver holds this Interrupt cap alongside the slice-4b Nic/DeviceMmio/DMAPool caps on the same function; it brings the device up from userspace, then drives the owned RX-interrupt lifecycle and reads the completed frame back through Nic.receive. The PCI function-level MSI-X enable bit is not toggled and no device-autonomous raise is attempted (device_autonomous_raise=not-attempted, waiter_wake=kernel-injected-dispatch); the landed DMA isolation, the slice-2/3 vring grants, and the buffer-identity / ring-buffer pinning are reused unchanged (host_physical_user_visible = 0, queue-address reads still refused). No new Interrupt interface or method (the existing wait/acknowledge/mask/ unmask become real for this route). Proof make run-cloud-prod-nic-driver-userspace-irq-ownership emits one cloudboot-evidence: nic-driver-userspace-irq-ownership <token> marker. Task record: docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-irq-ownership-local-proof.md.
  5. Clean TX/RX split [DONE 2026-06-03]. Decouples the coupled receive path into independent TX and RX submission. Under cloud_virtio_net_userspace_clean_tx_rx_split_proof (implies slice 5) the Nic cap’s receive @1 dispatches to the new virtio_net_userspace_rx_dma_proof::nic_receive_independent instead of the coupled nic_receive: it posts a manager-owned RX receive buffer on the retained RX vring, waits on the driver’s OWNED RX interrupt route (the slice-5 device_interrupt::wait_kernel_injected_dispatch dispatch slot, resolved through virtio_net_userspace_irq_ownership_proof::owned_rx_route), retires the deferred LAPIC EOI, and reads the completed frame from the RX used ring – with no internal ARP-TX self-stimulus (it never submits to the TX vring). The RX frame is driven by an external stimulus: the consumer’s preceding independent Nic.transmit of a real broadcast ARP request, which QEMU SLIRP answers; the inbound reply is held in the host net queue until the RX buffer is posted. Nic.transmit stays independent (submits the caller’s frame and rings the TX doorbell with no RX poll; surfaces rx_polls=0). The wake stays the bounded kernel-injected dispatch slice 5 owns (waiter_wake=kernel-injected-dispatch, device_autonomous_raise=not-attempted). Reuses the landed owned-vring / owned-IRQ / DMA-isolation unchanged: no new selected-write register, no new MSI-X surface, no new Nic/Interrupt method (make generated-code-check green), no host-physical / handle exposure (host_physical_user_visible = 0, queue-address reads refused). The driver does an independent transmit then a separate independent receive, neither performing the other’s submission (tx_independent=ok, rx_independent=ok, receive_self_stimulus=removed). Proof make run-cloud-prod-nic-driver-userspace-clean-tx-rx-split emits one cloudboot-evidence: nic-driver-userspace-clean-tx-rx-split <token> marker. Task record: cloud-prod-nic-driver-userspace-clean-tx-rx-split-local-proof.
  6. Network-stack process + smoltcp relocation – a second userspace process holding the Nic cap and a bounded time source, running smoltcp, implementing the socket caps while preserving the cap/network.rs contract. Slice 6 is now done, so this slice is decomposed into bounded increments rather than attempted as one step:
    • 7a (first increment, DONE 2026-06-03): network-stack-process skeleton. A userspace process runs a minimal smoltcp Interface over a phy::Device adapter backed by the landed independent Nic.transmit/Nic.receive (slice 6), clocked by a Timer cap, and drives one observable Ethernet exchange through SLIRP: smoltcp – not hand-rolled frame code – ARPs the gateway out through Nic.transmit, consumes the reply in through Nic.receive, and emits the queued IPv4/UDP datagram, so the neighbour cache observably advances. No socket caps, no cap/network.rs relocation, virtio_stub.rs unchanged. Proof make run-cloud-prod-network-stack-process-smoltcp-skeleton. Implementation note: the landed Nic cap is not yet self-sufficient – its transmit/receive ride on the userspace driver shim’s retained vring (the kernel does not own the vring), so the skeleton process performs the slice-1-6 bring-up itself (it also holds the DeviceMmio/DMAPool/Interrupt caps) before running smoltcp. Splitting the bring-up into a separate long-lived NIC-driver service, so the network-stack process holds only Nic + Timer + Console, is folded into the 7c contract-relocation increment; it does not change the proven smoltcp-substrate claim. Task record: cloud-prod-network-stack-process-smoltcp-skeleton-local-proof.
    • 7b (socket layer, DONE 2026-06-03): socket caps over the userspace smoltcp stack – a userspace UdpSocket cap layer (UdpSocketCapLayer) implements the UdpSocket schema’s sendTo/recvFrom semantics over the 7a Interface and proves one bounded UDP request/response: a DNS A query for example.com to SLIRP’s resolver at 10.0.2.3:53 via sendTo, then the decoded response via recvFrom. smoltcp drives every frame through the Nic cap (ARP reply + DNS reply both fetched through Nic.receive, host_physical_user_visible = 0 preserved); Timer clocks the poll. The socket layer is in-process – it does not yet serve the socket interfaces as inter-process transferable capabilities, and it does not touch cap/network.rs (virtio_stub.rs stays fail-closed). Proof make run-cloud-prod-network-stack-smoltcp-socket-caps (one cloudboot-evidence: network-stack-smoltcp-socket-caps <token> marker). Task record: cloud-prod-network-stack-smoltcp-socket-caps-local-proof. Serving the socket interfaces as inter-process caps (a NetworkManager-like broker) and TcpListener/TcpSocket are folded into the 7c contract relocation.
    • 7c (contract relocation): preserve the cap/network.rs contract behind the userspace network-stack process so the production L4 entry points (virtio_stub.rs) stop returning DeviceUnavailable for the armed manifest. This is the body of the whole-slice record (cloud-prod-userspace-network-stack-smoltcp-local-proof), decomposed because it bundles three independently-large pieces:
      • 7c-i (inter-process socket cap, DONE 2026-06-03): serve the slice-7b userspace UdpSocketCapLayer as a real inter-process transferable capability. A network-stack server process brings the device up, builds the userspace smoltcp UdpSocket layer, and serves the UdpSocket schema (sendTo/recvFrom/close) over an exported Endpoint; a separate client process re-interprets the served cap as a UdpSocket and drives one bounded DNS A query/response through the production UdpSocketClient, with smoltcp still moving every frame through the Nic cap (host_physical_user_visible = 0). Proof make run-cloud-prod-network-stack-smoltcp-udp-socket-cap-ipc. This is the prerequisite for the kernel contract relocation. cap/network.rs / virtio_stub.rs are unchanged. Task record: cloud-prod-network-stack-smoltcp-udp-socket-cap-ipc-local-proof.
      • 7c-ii (cap/network.rs relocation, DONE 2026-06-07): route the armed Phase C manifest’s production L4 socket entry points to the userspace network-stack service so the local proof no longer reaches virtio_stub.rs DeviceUnavailable. This is itself decomposed (see “7c-ii Mechanism and Decomposition” below) into 7c-ii(a) – serve TcpListener/TcpSocket as inter-process caps with accept-returns-a-result-cap, an architecture-agnostic prerequisite (cloud-prod-network-stack-smoltcp-tcp-socket-cap-ipc-local-proof) – and 7c-ii(b) – the local serve-from-userspace proof. In 7c-ii(b) (DONE 2026-06-07), the network-stack process boots the non-qemu cloudboot kernel, spawns an application client with only Console plus a userspace-served TcpListenAuthority, returns a userspace-served TcpListener from listen, returns a TcpSocket result cap from accept, and completes one hostfwd TCP request/response through recv/send. The selected architecture remains serve-from-userspace: applications consume the existing typed socket interfaces from a userspace network-stack process instead of extending the legacy kernel-routed socket owner.
      • 7c-iii (TcpListener/TcpSocket, DONE 2026-06-04): the userspace smoltcp stack now serves a real TcpListener/TcpSocket round trip over the Nic cap, using the sustained-receive Nic.receivePoll @4 (slice 7d) for the multi-frame TCP exchange. A single cloudboot service brings the device up from userspace, runs a smoltcp TCP socket listening on port 8080 driven by the non-resetting receivePoll @4 pump, and – against an external host TCP client over a QEMU hostfwd relay – completes one bounded TCP handshake + request/response (asserting the received request equals the expected probe and echoing it back). smoltcp, not hand-rolled frame code, moves every frame; the device stays armed across the SYN/SYN-ACK/ACK/request/response/FIN exchange with no per-frame reset. host_physical_user_visible = 0; queue-address reads refused; the bounce RX pool quiesces + scrubs on teardown. This increment proved the generic TCP socket substrate in-process; 7c-ii(a) later served TcpListener/TcpSocket inter-process with accept-returns-a-result-cap. Neither increment changes cap/network.rs / virtio_stub.rs; that final production-manifest wiring is 7c-ii(b). Proof make run-cloud-prod-network-stack-smoltcp-tcp-listener-roundtrip. Task record: cloud-prod-network-stack-smoltcp-tcp-listener-roundtrip-local-proof. Built on the prerequisite cloud-prod-nic-driver-userspace-sustained-receive-pool-local-proof.
  7. Kernel smoltcp / virtio-net removal (Phase C exit)done 2026-06-08. The kernel no longer depends on smoltcp, and the qemu-only cap/network.rs socket entry points now fail closed instead of reaching an in-kernel TCP/UDP runtime. The retained virtio-net code is a lower-layer QEMU fixture for PCI/MMIO/virtqueue, ARP, ICMP, and descriptor-generation proofs; it is not the production cloud socket owner. Task record: cloud-prod-phase-c-kernel-smoltcp-virtio-net-removal.

Each slice is parented to the appropriate predecessor; slices 7-8 deliver the L4 socket reachability the scope decision sequenced after the cloud milestone.

7c-ii Mechanism and Decomposition

7c-ii is “preserve the cap/network.rs contract behind the userspace network-stack process.” Before the serve-from-userspace path landed, the non-qemu cloud kernel could still grant production L4 methods (NetworkManager.createTcpListener, TcpListenAuthority.listen, TcpListener.accept, TcpSocket.send/recv, the UDP methods) through kernel-side CapObjects in kernel/src/cap/network.rs that call crate::virtio::*. In that build, crate::virtio resolves to kernel/src/virtio_stub.rs: creation entry points return NetworkError::DeviceUnavailable, while existing-handle operations fail closed with invalid-handle errors. Before Phase C exit cleanup, the working smoltcp runtime existed only in the cfg(qemu) kernel/src/virtio.rs; it has now been removed, and the qemu socket entry points match the same fail-closed shape. Relocating the contract means production L4 methods are satisfied by the userspace network-stack service instead of the stub, and non-qemu bootstrap grants for the legacy kernel NetworkManager and TcpListenAuthority sources now fail closed before those CapObjects are minted.

Mechanism constraint. A literal reading – the kernel keeps serving the L4 caps and forwards each call to the userspace service – requires the kernel to originate a capability Call to a userspace-served Endpoint and complete the original caller once the service returns. That kernel-as-client-of-a-userspace- service path does not exist today: the ring/endpoint machinery (kernel/src/cap/ring.rs, kernel/src/cap/endpoint.rs, kernel/src/cap/transfer.rs) only dispatches SQEs from a userspace process’s ring; the kernel never enqueues a Call against a userspace endpoint nor parks waiting for the Return. Building that inversion is a new kernel IPC subsystem, not a drive-by, and it adds kernel surface that the Phase C exit deliberately avoids by retiring the kernel L4 owner instead.

7c-ii(b) Architecture Decision: Serve From Userspace

On 2026-06-07 the operator selected serve-from-userspace for the final production-manifest wiring. The armed Phase C manifest receives a userspace-served NetworkManager or TcpListenAuthority cap from the network-stack service. The legacy kernel cap/network.rs / virtio_stub.rs socket path is now fenced to cfg(qemu) fixture manifests and stale negative paths: non-qemu manifests that request kernel network_manager or tcp_listen_authority are rejected during bootstrap, so missing served authority does not fall back to the old kernel socket owner.

The rejected alternative for this stage is kernel-brokered forwarding: keeping the kernel as the socket cap server while it calls into a registered userspace network-stack service. That route would add a new kernel-originated Endpoint/deferred-completion subsystem. The selected direction keeps the microkernel boundary cleaner by moving behavior to userspace where that does not compromise security.

Grounding for this decision:

  • kernel/src/cap/network.rs implements the current kernel-served NetworkManager, TcpListenAuthority, TcpListener, TcpSocket, and UdpSocket objects, including result-cap transfer for listener and socket creation.
  • kernel/src/virtio_stub.rs remains the non-qemu negative-result endpoint for stale kernel networking call sites, but bootstrap no longer grants its NetworkManager or TcpListenAuthority callers in production manifests.
  • kernel/src/cap/ring.rs, kernel/src/cap/endpoint.rs, and kernel/src/cap/transfer.rs implement userspace-originated Endpoint calls, receive/return, and capability transfer. They do not currently implement a kernel-originated call into a userspace endpoint.
  • The 7c-i and 7c-ii(a) task records prove userspace-served socket caps over the existing Endpoint and RETURN result-cap path: cloud-prod-network-stack-smoltcp-udp-socket-cap-ipc-local-proof and cloud-prod-network-stack-smoltcp-tcp-socket-cap-ipc-local-proof.
  • docs/research/capability-systems-survey.md and docs/research/spritely-captp-ocapn.md reinforce the local capOS rule used here: explicit object references are authority, and forwarding/proxying should make authority flow and lifetime explicit rather than hiding it behind ambient names.

Current constraints that both options must preserve:

  • Existing caller contract. NetworkManager.createTcpListener, TcpListenAuthority.listen, TcpListener.accept, TcpSocket.send/recv, and the UDP methods keep the same typed Cap’n Proto surfaces unless a later implementation task explicitly changes schema and regenerated bindings.
  • Manifest authority. Only the armed Phase C manifest receives the new production L4 path. Missing registration, missing served cap, stale service identity, or an unarmed manifest fails closed instead of falling back to a broad kernel escape hatch; the legacy kernel socket sources are qemu-only fixtures.
  • No new packet authority. The socket service consumes the existing Nic cap and the landed receivePoll @4 sustained-receive path. This decision does not add DMA, MMIO, IRQ, queue-address, host-physical, GCE, public ingress, TLS, or certificate authority.
  • Private/local Web UI critical path. The first Web UI proof remains local and then private GCE reachability. Public ingress, TLS, firewall/DNS exposure, and live-cloud runs stay gated by their separate task records.
AxisKernel-brokered forwarding (rejected for 7c-ii(b))Serve-from-userspace (selected)
Cap server identityThe kernel keeps serving cap/network.rs objects for the armed manifest and forwards each socket operation to a registered userspace network-stack service.The armed manifest receives a userspace-served NetworkManager or TcpListenAuthority cap from the network-stack service, following the 7c-i / 7c-ii(a) serving pattern.
Kernel IPC deltaAdds a kernel-as-client forwarding path: service registration, endpoint identity validation, kernel-originated call construction, transfer handling, cancellation, and deferred completion back to the original caller.Adds no kernel-originated endpoint call path. The existing userspace caller-to-userspace endpoint path carries the socket methods and result caps.
ABI and IPC riskHigher. The existing ring/endpoint code accepts userspace SQEs and endpoint returns; it does not yet encode the kernel as an endpoint caller. The implementation must specify caller-session metadata, transfer rollback, service disappearance, cancellation, and result-cap insertion semantics for forwarded calls.Lower kernel ABI risk. Schema can remain unchanged if the served cap implements the current socket interfaces. Risk shifts to manifest grant wiring, service startup ordering, endpoint lifetime, and making stale or missing service authority fail closed.
Production-surface compatibilityPreserves the literal “kernel-routed cap/network.rs surface” for the armed manifest, which minimizes caller-visible routing change.Makes the production socket cap userspace-served for the armed manifest. Callers still use the same typed socket interfaces, but the authority source is the network-stack service rather than cap/network.rs.
Fail-closed behaviorRequires a bounded registration table and an explicit “no registered service” error path. Forwarding must not silently reach virtio_stub.rs except as a deliberate unarmed-manifest failure mode.Uses manifest-level grant selection and service liveness checks. If the served cap is absent, stale, or not granted, the non-qemu manifest fails closed; virtio_stub.rs remains only a stale-call/fixture negative path.
Validation burdenMust prove forwarding with at least one socket operation, result-cap transfer through the forwarded path, service crash/cancel cleanup, missing-service failure, and no kernel-thread parking. Before Phase C exit, make run-net covered the old qemu-only socket path.Must prove the armed manifest gets the served network cap, completes the socket round trip through the userspace service, rejects an unarmed or missing-service manifest, and preserves the existing inter-process socket-cap proofs. Phase C exit keeps only lower-layer QEMU virtio-net fixture coverage after the old kernel L4 owner is removed.
Phase C exit interactionLeaves a new kernel forwarding subsystem after kernel smoltcp and virtio-net hot-path removal. Slice 8 would have had to decide whether that subsystem was permanent generic IPC infrastructure or Phase C-specific scaffolding.Aligns with the Phase C exit direction that kernel networking becomes routing/grant setup while L4 service behavior lives out of kernel. Slice 8 removed kernel smoltcp/virtio-net L4 ownership and stale socket paths.

Follow-up task-state changes after the selection:

This decision unblocks 7c-ii(b) implementation. It preserves the fact that 7c-ii(a) was sequenced first because serving TcpListener/TcpSocket as inter-process caps is useful under either architecture, and it now becomes the direct serving substrate for the selected path.

Result-cap transfer needs no new ABI. The RETURN transfer-descriptor ABI that lets a served method return a freshly-minted capability already exists: a CAP_OP_RETURN carries xfer_cap_count transfer descriptors, the kernel inserts them into the caller’s table and publishes CAP_CQE_TRANSFER_RESULT_CAPS (kernel/src/cap/ring.rs::dispatch_return -> insert_prepared_transfer_caps -> write_endpoint_return_result), and the client decodes the returned cap with capos-rt’s CompletedCall::result_cap. The 7c-i UDP server already serves caps over an exported Endpoint through exactly this RETURN path, so the userspace-Endpoint-server side is proven; the kernel’s own cap/network.rs::handle_accept (via insert_socket_result_cap) and the telnet-gateway demo (listener.accept_wait; both since retired) proved the complementary accept-returns-a-TcpSocket-cap shape and its client-consume side. What 7c-ii(a) adds is wiring – a userspace Endpoint server returning a TcpSocket result-cap from TcpListener.accept – not a new ABI. Its hard part is the smoltcp-pump-while-serving interleaving for a blocking multi-frame accept, which is why 7c-iii deferred inter-process TCP serving.

Downstream self-hosted Web UI tasks consume slice 7 without making the Phase C exit cleanup a blocker for first GCE operator proof:

Network usability and post-smoltcp follow-ups are decomposed in Network Usability and Post-smoltcp. They do not change the Phase C critical path above: the local DHCP/IPv4 configuration proof is now done for the first GCE Web UI path, while the system DnsResolver cap, POSIX getaddrinfo, ping/ping6 tools, packet tracing, socket readiness policy, and transport tuning/status are follow-on usability or diagnostics lanes.

Slices 7a-7c are smoltcp relocation: they run the selected smoltcp 0.13.0 build in userspace and preserve the socket contract, adding no new transport mechanic. Transport policy/status — read-only transport status, keepalive/ timeout policy inputs, and the deferred congestion-control evaluation — is a distinct control-plane lane decomposed in the backlog under network-transport-policy-status-decomposition, not part of the relocation slices. The userspace relocation track now has landed UDP and TCP substrate proofs, and the TCP build still runs with CongestionControl::None by build configuration; selecting Reno/CUBIC is a build-feature flip, and any custom TCP mechanic requires separate workload evidence and a reviewed task.

The Sustained-Receive Nic ABI (Former Prerequisite For 7c-iii)

7c-iii (TcpListener/TcpSocket) was blocked on the shape of the landed Nic.receive. This section records the precise constraint, the sustained-receive primitive that lifted it while keeping the settled DMA isolation intact, and the ABI decision.

Status: landed. The Nic.receivePoll @4 method and the kernel-owned bounce RX pool primitive designed below shipped in slice 7d (cloud-prod-nic-driver-userspace-sustained-receive-pool-local-proof): cap::virtio_net_userspace_rx_dma_proof::nic_receive_poll arms a pool of NIC_RX_POOL_SIZE manager-owned bounce RX buffers and recycles them individually (copy-out + scrub + slot-generation bump + re-post) with no per-frame device reset; the receivePoll @4 dispatch arm lives in cap::nic_grant_source_prod. The multi-frame proof (make run-cloud-prod-nic-driver-userspace-sustained-receive-pool) drains more than one frame with at least one non-resetting framePresent = false poll and keeps the DMA-isolation assertions green (host_physical_user_visible = 0, queue-address reads refused, quiesce + scrub at teardown). receive @1 is unchanged. The rest of this section is the as-built design record.

The constraint, precisely (cite the real symbols)

The production Nic.receive @1 dispatches (under the frontier slice-6 cloud_virtio_net_userspace_clean_tx_rx_split_proof feature) to virtio_net_userspace_rx_dma_proof::nic_receive_independent (kernel/src/cap/virtio_net_userspace_rx_dma_proof.rs:1437, dispatched from the receive @1 arm in kernel/src/cap/nic_grant_source_prod.rs:309). Each call:

  1. allocates one fresh kernel bounce frame (frame::alloc_frame_zeroed),
  2. authors one RX descriptor + avail entry pointing at it and rings the RX doorbell,
  3. waits on the driver’s owned RX interrupt route (device_interrupt::wait_kernel_injected_dispatch) and retires the deferred LAPIC EOI,
  4. polls the RX used ring for one completion, copies the frame out, and
  5. frees that bounce frame (frame::try_free_frame).

Two properties make this single-frame primitive unable to serve TCP:

  • No pool stays armed between calls. Exactly one RX buffer is posted per call and freed when the call returns; between receive calls the device has no posted RX buffer to master into. A frame that arrives outside a call has nowhere to land.
  • No non-resetting “no frame yet”. On a successful frame the independent path keeps the device live (it does not reset – success arm at virtio_net_userspace_rx_dma_proof.rs:1535), but an empty/timeout poll (RxDmaFailure::UsedRingPollExhausted) takes the error arm and quiesces the device (nic_quiesce_device: reset_device + assert queues cleared + release pins). The predecessor proof path drive_rx_dma (virtio_net_userspace_rx_dma_proof.rs:331) resets on every outcome; the independent path narrowed that to reset-on-empty-poll, but the reset-on-empty remains. So a speculative “is there a frame?” poll with nothing waiting tears the device down.

smoltcp drives the opposite shape: its poll() loop calls the phy::Device RX token speculatively and frequently, expecting “no frame yet” to be a cheap, side-effect-free answer against a device that stays armed with multiple posted buffers, so the asynchronous, multi-frame TCP exchange (SYN-ACK, data segments, ACKs, retransmits arriving whenever the peer chooses, sometimes several in quick succession) can be drained as frames arrive. That is why 7c-iii needed the since-landed sustained-receive ABI before the remaining 7c-ii(b) final-wiring task could stay on hold solely for the operator architecture decision.

The reset-on-empty is not a bug: it is part of the settled DMA-isolation model (docs/dma-isolation-design.md). Frames cross through manager-owned bounce pages; userspace never sees a host-physical or device-usable address; and after a buffer is reclaimed the device must be proven to have stopped mastering it (“if in-flight DMA cannot be proven stopped, revocation escalates to device reset”, docs/dma-isolation-design.md DMAPool Invariants -> Reset). The single-frame path proves that the crude way – reset the whole device. The design problem is to keep the isolation guarantee without resetting the device on every empty poll or every reclaimed buffer.

The sustained-receive primitive

Keep the device armed with a kernel-owned bounce RX pool of N buffers and recycle buffers individually instead of resetting the device:

  • Arm. At first use (or at cap setup) the kernel allocates N manager-owned bounce RX buffers from the driver’s granted DMAPool and posts all N to the RX vring avail ring. The device masters only into these kernel-owned bounce pages; userspace still receives no host-physical or device-usable address (host_physical_user_visible = 0), exactly as the single-frame path proved.
  • Drain one arrived frame (per poll). The kernel reads the RX used ring. If used.idx advanced, a frame landed in bounce slot k; the kernel treats slot k as device-written untrusted input (docs/dma-isolation-design.md “Receive buffers are treated as device-written untrusted input until validated by the driver or stack”), copies the frame bytes out into the inline Data reply bounded by the posted buffer length, then recycles slot k.
  • No frame yet. If used.idx did not advance, the call returns “no frame” with no reset and the device stays armed. This is the cheap speculative poll smoltcp needs.
  • Teardown. on_release (and any unprovable-in-flight-DMA error) still quiesces: reset_device, assert both queues cleared, scrub the whole pool, release the enabled-vring pins, and drop the pool – identical to the existing nic_quiesce discipline. Reset remains the escalation path; it is simply no longer the per-frame path.

The per-buffer invariant that replaces “reset before reclaim”: a bounce slot is re-exposed to the device only after its copy-out completes and its slot ownership generation is bumped, with the slot scrubbed before the re-post. This is the production handle-epoch slot identity (slot + slot_generation, docs/dma-isolation-design.md Production Handle Epoch Invariants and DMAPool Invariants: “Buffer operations additionally check the buffer slot and slot generation before descriptor validation, completion accounting, free, scrub, or reuse”) applied at buffer-recycle granularity instead of device-reset granularity:

  1. the device wrote slot k and signalled completion via the used ring (it is no longer mastering that slot – the per-buffer analogue of “in-flight DMA is proven stopped”);
  2. the kernel copies the bytes out;
  3. the kernel scrubs slot k (residual-state rule, docs/dma-isolation-design.md Residual state) and bumps its slot generation, so a stale descriptor, free, or completion for the prior occupant fails closed;
  4. only then does the kernel re-post slot k to the avail ring (re-arm).

As built, the pool buffers are kernel-private frame::alloc_frame_zeroed pages (never a userspace DMABuffer handle), so the slice does not go through the single-frame device_dma begin_rx_dma_drive_pin drive-pin code path (that pin guards a userspace-submitted DMABuffer’s live-unmapped re-validation, which has no analogue for a manager-private frame). Instead the slice applies the same per-buffer slot-identity discipline modeled on the production handle-epoch slot identity (slot + slot_generation, docs/dma-isolation-design.md) as kernel bookkeeping local to the pool: the device masters only into kernel-owned pages, each slot is scrubbed and its generation bumped before re-exposure, and teardown still quiesces (reset) before reclaim. No new isolation backend, no new IOVA-export rule, no host-physical or device-usable address exported (host_physical_user_visible = 0).

ABI decision: extend Nic with a non-resetting poll receive (option a)

Chosen: option (a) – add a non-resetting poll-receive method to the Nic schema and keep receive @1 as the legacy single-shot:

interface Nic {
    transmit  @0 (frame :Data) -> (result, reason, sideEffect);
    receive   @1 () -> (frame :Data, observedEthertype :UInt16,
                        result, reason, sideEffect);   # legacy single-shot
    macAddress @2 () -> (addr :Data, result, reason, sideEffect);
    linkStatus @3 () -> (up :Bool, result, reason, sideEffect);
    # Sustained, non-resetting receive over the armed kernel-owned bounce RX
    # pool. Returns the next arrived frame, or framePresent = false with no
    # device reset when none has arrived. The device stays armed.
    receivePoll @4 () -> (frame :Data, observedEthertype :UInt16,
                          framePresent :Bool,
                          result :Text, reason :Text, sideEffect :Text);
}

How receivePoll @4 reports its two outcomes without a reset:

OutcomeframePresentframeresult / reason / sideEffect
frame arrivedtrueframe bytes inlineok / frame-received / buffer-recycled (slot copied-out, generation bumped, re-posted)
no frame yetfalseemptyok / no-frame / device-armed (no reset; pool still posted)
fail closedfalseemptyfailed / <reason> / device quiesced (escalation only, not the empty-poll path)

receive @1 semantics are unchanged (it still resets on empty poll), so the 7b UDP request/response proof stays green; only the new method is non-resetting.

Why not option (b) (in-stack sustained-receive loop against a kernel primitive). Option (b) – the network-stack process drives a bounded sustained-receive loop against a kernel primitive directly – was rejected because it still requires the identical new kernel RX-pool machinery (the design above) and drives it outside the typed Nic cap boundary, fragmenting the device-facing authority the whole track funnels through one Nic cap (slices 4b-7). Option (a) keeps the network-stack process holding exactly the Nic cap it already holds, maps the new method one-to-one onto the smoltcp phy::Device RX token (“give me the next arrived frame or nothing”), and is the direct extension of the inline-Data receive ABI the track already accepted. The cost of (a) is bounded and already named in the abi hazard: one new schema method, a make generated-code-check regeneration of the checked-in capnp bindings, and updating every exhaustive Nic method match (receivePoll @4 arm in kernel/src/cap/nic_grant_source_prod.rs; the NicClient in capos-rt). A receiveBatch returning List(Data) was considered and deferred: the smoltcp RX token consumes one frame at a time, so single-frame poll is the clean match and batching is a separately-justified later optimization.

Design Grounding

  • docs/proposals/network-reachable-datapath-scope-decision.md – the parent scope decision (Option A) that opened this track.
  • docs/proposals/networking-proposal.md, Part 3 (Phase C) – the accepted decomposition, its prerequisites table and exit criteria, and the Nic draft this doc adopts (inline Data).
  • docs/dma-isolation-design.md S.11.2 – the DMA-isolation invariants and the driver-transition gate, already satisfied by the landed bounce / IOMMU-IOVA track that slice 2 reuses.
  • kernel/src/cap/device_mmio.rs (notify_doorbell), kernel/src/cap/dma_buffer.rs (export labels), kernel/src/cap/virtio_net_polled_provider.rs (kernel-owned vring) – the primitives whose current limits define the cap-surface gap.

Real-Filesystem Decision: Role-Split, Not One Format

Decision

capOS does not adopt a single general-purpose on-disk filesystem. It adopts a role-split in which each storage role uses the format that fits it, behind the same capability interfaces:

  • (A) capOS-managed data and state stays capnp-native. Evolve the existing CAPOSWF1 writable-filesystem and CAPOSST1 persistent-store fixed layouts (kernel/src/cap/writable_fs.rs, kernel/src/cap/persistent_store.rs); do not replace them with a general-purpose format. These already have a crash-consistency proof in tree (make run-storage-writable-recovery), so a format swap would discard a tested durability story for no consumer benefit.
  • (B) Host-populated and interop images gain READ-ONLY FAT32. Add a read-only FAT32 Directory/File backer over the existing BlockDevice, using the fatfs no_std crate. FAT32 is the one standard interop format with a maintained no_std read crate and zero licensing risk (the FAT long-name patents have expired; fatfs is MIT). It is already structurally part of the boot path – the EFI System Partition Limine reads is FAT32 (docs/backlog/hardware-boot-storage.md).
  • (C) Host tooling consolidates onto one capnp image tool. Retire the per-format tools/mkstorage-*.py byte-offset scripts (each hand-encodes a fixed layout at literal offsets) in favor of one schema-driven image tool, so the on-disk layout has a single typed source of truth instead of N parallel offset hazards.

Why the Capability Layer Is Unchanged

The Directory, File, and Store interfaces in schema/capos.capnp are the contract; the on-disk format lives below them as another CapObject backer, so adding FAT32 adds no schema surface and no new caller-visible behavior. The interfaces already model every operation a format backer must answer:

These kernel backers (readonly_fs.rs, writable_fs.rs, persistent_store.rs, and the RAM file/directory/store/namespace caps) are proof/fixture surface, not production storage routes – they are gated behind the qemu feature (with storage_fat_read / cloud_*_over_nvme_proof variants) and fail closed in the default production kernel. Production storage is userspace-served by the demos/storage-fs-service, demos/storage-persist-service, and demos/store-service services; see Kernel Storage Cap Backers Are Fixtures. The role-split below still governs which on-disk format sits beneath the cap interfaces in those proofs and in any future userspace format backer.

  • Directory: open @0, list @1, mkdir @2, remove @3, sub @4, create @5, rename @6 (schema/capos.capnp:1824).
  • File: read @0, write @1, stat @2, truncate @3, sync @4, close @5 (schema/capos.capnp:1793).
  • Store: put @0, get @1, has @2, delete @3 (schema/capos.capnp:1857).

A read-only backer answers the read/list/open/stat methods and fails closed on every mutation, exactly as readonly_fs.rs does today (kernel/src/cap/readonly_fs.rs:618 rejects mkdir/remove/sub/create/ rename). Attenuation is structural, not a rights bitmask: a read-only File is a wrapper that rejects write/truncate/sync, per the schema comment at schema/capos.capnp:1798.

Known caveat (partially lifted): stat/info timestamps were originally stubbed to zero in every filesystem backer. The Slice 4 timestamp increments lift this for the CAPOSWF1 writable filesystem only – it now persists real created/modified timestamps in the node record, carries the corresponding ClockProvenance label from the same WallClock source, and returns the timestamp values from File.stat (proof make run-storage-writable). The read-only CAPOSRO1 and persistent_store CAPOSST1 backers still expose zero/unknown timestamp state, and FAT32 read can surface real FAT directory-entry timestamps later; those remain named Slice 4 follow-ups.

Why Not ext4 / exFAT / littlefs / FAT-Write

  • ext4-read: deferred under an explicit trigger. capOS reads no real third-party filesystem today and does not need to for boot: Limine reads the FAT32 ESP, the kernel image is include_bytes! or read from ISO 9660 (kernel/src/iso/), and the cloud boot disk is a capOS-authored GPT + FAT-ESP, never a provider ext4 root. That collapses the usual “must read the provider’s ext4 root” argument. ext4-read is deferred behind a single explicit trigger: capOS must read a disk it did not format. Until that exists, ext4’s large read-only parser surface buys nothing.
  • ext4-write: rejected. It would be the first writable real-disk format and has no crash-consistency story in tree; landing it without a recovery proof regresses the durability bar CAPOSWF1 already meets.
  • exFAT: rejected. Patent surface, no role advantage over FAT32 for the host-interop slot.
  • littlefs / SimpleFS: rejected. FFI plus vendoring cost with no winning role – managed state is already served by the capnp-native layouts, and host-interop wants a format the host actually writes (FAT32).
  • FAT-write: rejected for now. No crash-consistency story; it would be the first writable format landing without a recovery proof. FAT32 stays read-only in this decision.

Decision Matrix

Axes: host-interop fit; no_std read/write implementation cost; crash-consistency story; capability/capnp fit; cloud-disk-read need today; licensing; available crates.

FormatHost-interopno_std read / write costCrash-consistencycapnp fitCloud-disk-read needLicensingCrates
FAT32 (read-only)High (host writes it; ESP already FAT32)Read: low (fatfs) / write: out of scopen/a (read-only)Backer below Directory/Filen/a (capOS authors its disks)Clean (FAT patents expired; fatfs MIT)fatfs no_std
exFATMediumHigh / Highn/aSamen/aPatent surfaceNone no_std mature
ext4-readLow (no consumer today)High (large parser) / —n/a (read-only)SameNone today (trigger only)CleanNone mature no_std
ext4-writeLowVery high / very highNone in treeSameNoneCleanNone mature no_std
littlefs / SimpleFSLowMedium (FFI+vendor) / mediumHas its own storySameNoneCleanFFI/vendor
capnp-native (CAPOSWF1/CAPOSST1)None (capOS-only)Already in treeProven (run-storage-writable-recovery)Nativen/aCleanIn tree

Phased Plan

  • Slice 0 (this doc). Record the role-split decision and the matrix.
  • Slice 1 (landed 2026-06-02 20:59 UTC). Vendored fatfs (with VENDORED_FROM.md, vendor/fatfs-no_std/) and added a read-only FAT32 Directory/File backer over virtio-blk: kernel/src/cap/fat_fs.rs, a BlockStorage adapter over the virtio-blk BlockDevice driving the vendored fatfs read path. Host image built with real mkfs.fat + mcopy (2 files, one multi-cluster). Smoke make run-storage-fat-read reads the multi-cluster file back through Directory.open -> File.read and asserts the bytes plus the fail-closed mutations. Grant-source realization deviation: the task text proposed a new fat_fs_root KernelCapSource, but KernelCapSource is a schema/capos.capnp enum (and capos-config decode) outside the task’s write_scope. The backer is instead selected under a new storage_fat_read kernel feature on the existing read_only_fs_root source – mirroring how that source already selects its Virtio vs NVMe backend – so it needs no new KernelCapSource and no schema change, keeping the conflict surface disjoint from the in-flight NVMe graduation (which edits readonly_fs/writable_fs/persistent_store). Provenance map: FAT32 (read-only backer). Task record: cloud-prod-fat32-readonly-over-virtio-blockdevice-local-proof.
  • Slice 2 (landed 2026-06-03 01:44 UTC). FAT32 read over the NVMe BlockDevice arm. Its prerequisite – the NVMe read-arm graduation (cloud-prod-nvme-storage-graduate-readarm-local-proof) – had landed, so the slice stacks on an always-built read arm rather than a per-proof feature: it added an Nvme BlockSource variant to fat_fs.rs (deferred mount via FatMount, mirroring readonly_fs’s NVMe arm) and proves a host-authored mkfs.fat image (the pre-populated NVMe medium content, no manager seed) read back over the graduated NVMe read arm behind the unchanged Directory/File cap contract. Selected by a new non-qemu cloud_fat_read_over_nvme_proof feature on the existing read_only_fs_root source (no new KernelCapSource, no schema change); its cap-waiter Interrupt route + provider-fat-read-over-nvme marker come from kernel/src/cap/fat_read_over_nvme_proof.rs. Because the FAT cluster-chain walk issues many single reads per boot, the proof raises the I/O queue depth to 64. Proof: make run-cloud-provider-fat-read-over-nvme. Task record: cloud-prod-fat32-readonly-over-nvme-blockdevice-local-proof.
  • Slice 3 (first increment landed 2026-06-03 03:36 UTC; second increment landed 2026-06-03 04:08 UTC; third increment landed 2026-06-03 05:47 UTC; fourth increment landed 2026-06-03 08:25 UTC; seeded installable writable increment landed 2026-06-06 13:38 UTC at ac0c5e2d; final fixture retirement; CAPOSST1 + empty/seeded co-located CAPOSWF1 + CAPOSRO1 + NVMe-writable CAPOSWF1). The host capnp image tool retired the hand-encoded capnp-layout Python fixtures one layout at a time. The first increment ported the CAPOSST1 persistent-Store image producer from the retired byte-offset script tools/mkstore-image.py to a typed Rust host tool (tools/mkstore-image/, a standalone host crate built on the host target via cargo test-mkstore-image, like tools/mkmanifest/). Later increments added --writable, --readonly-fs, --writable-nvme, and seeded --writable modes for the empty co-located CAPOSST1+CAPOSWF1 image, CAPOSRO1 read-only filesystem image, fixed-size (NVME_NAMESPACE_BLOCKS = 32768-block / 16 MiB) NVMe-writable CAPOSWF1 namespace image, and installable-system seeded writable variants. The kernel CAPOSST1/CAPOSWF1/CAPOSRO1 layouts (including NVME_NAMESPACE_BLOCKS), the Store/Directory/File contracts, and the disk bytes the kernel reads are all unchanged: the earlier migration proved byte identity against the retired Python outputs, and cargo test-mkstore-image now pins the maintained Rust outputs with golden byte checks. The re-pointed reboot/recovery/read-only proofs stay green reading the tool-produced image. The host-authored FAT image path (tools/mkstorage-fat-read-image.py) stays on real mkfs.fat/mcopy tooling — it is not a hand-rolled capnp byte-offset layout, so it is not a target for the typed capnp image tool. The Python capnp-layout builders have been retired; the Rust tool is the maintained capnp-native fixture path.
  • Slice 4 (decomposed; FAT and capnp-native increments landed in part). capnp-native enhancements: real stat timestamps and store compaction on the managed layouts. The first bounded increment landed – the CAPOSWF1 writable filesystem now persists created/modified timestamps in the node record’s reserved trailing bytes (no field moved, record stays 128 bytes, format version unchanged) and returns them from File.stat, sourced from the WallClock timebase, with the on-disk layout and the forced-poweroff recovery proof held byte-stable (cloud-prod-fs-capnp-native-stat-timestamps-local-proof, proofs make run-storage-writable / make run-storage-writable-recovery). The provenance increment threads the same WallClock source into the writable backer and uses the node-record provenance bytes to carry the ClockProvenance label alongside created/modified; File.stat remains schema-stable and the local proof records the stored labels through the storage smoke log. The FAT increment now surfaces valid FAT directory-entry created/modified values from the host-authored read-only image through the same schema-stable File.stat fields over both virtio-blk and NVMe. The proof logs distinguish metadata_provenance=fat-directory-entry from CAPOSWF1’s WallClock provenance and keep FAT’s timezone-free/two-second-modified-time limits explicit. The second bounded increment landed CAPOSST1 persistent-Store compaction: when a new put would exhaust the entry table or data cursor and tombstones exist, the kernel rewrites live entries through a shadow generation before recommitting the canonical front generation; make run-storage-persist proves pre-compaction write, delete/tombstone, compaction-triggered write, reboot, post-reboot reads, and tombstone absence (storage-caposst1-store-compaction-local-proof). Remaining follow-ups: timestamps and timestamp provenance on the other managed/read-only layouts (CAPOSST1 Store, CAPOSRO1).
  • Slice 5 (deferred). ext4-read, only once the explicit trigger (“must read a disk capOS did not format”) materializes.

Relationship to the NVMe Graduation

The NVMe BlockDevice graduation and real-FS work are stacked, not competing:

  • The graduation sits below BlockDevice – it moves the NVMe read/write/flush arms into always-built production behind fail-closed runtime probes (cloud-prod-nvme-storage-graduate-readarm-local-proof).
  • Real-FS sits above BlockDevice – it adds new CapObject backers (fat_fs.rs) that read through whatever BlockDevice provides.

Slice 1 deliberately reads over virtio-blk and adds a new file, so its conflict surface is disjoint from the graduation’s edits to the existing storage modules. Slice 2 is the join point, sequenced after the graduation landed: it consumes the always-built NVMe read arm (it does not modify it) by adding the Nvme BlockSource arm to the same fat_fs.rs.

Design Grounding

  • kernel/src/cap/readonly_fs.rs – the read-only Directory/File over BlockSource pattern Slice 1 mirrors, including the fail-closed mutation arm.
  • kernel/src/cap/writable_fs.rs, kernel/src/cap/persistent_store.rs – the capnp-native managed layouts (CAPOSWF1/CAPOSST1) the decision evolves rather than replaces.
  • schema/capos.capnp – the Directory/File/Store contract the format backers serve.
  • docs/backlog/hardware-boot-storage.md – the storage track and the FAT32 ESP/GPT boot-disk facts that collapse the ext4 argument.

Proposal: capos-service

Renamed from libcapos-service to capos-service to keep the planned Rust framework crate name distinct from the C-substrate staticlib (libcapos.a, built from the libcapos/ crate). The two layers are unrelated: libcapos is the C ABI for C consumers, and capos-service is the Rust framework userspace services link against.

Define a userspace service framework above capos-rt for long-running capOS services. The library should provide common lifecycle, endpoint, readiness, shutdown, context, metrics, and budgeting mechanics without adding a generic kernel Service capability or a kernel-level phase machine.

Current State

Slice 1 is implemented. capos-service/ is a standalone no_std crate, not a root workspace member, and depends on capos-rt without modifying the runtime. It exposes a minimal ServiceMain/ServiceRuntime framework with ordered initialize, dependency-wait, ready, run, drain, shutdown, and cleanup phases. The first converted proof was demos/telnet-gateway: the gateway performed CapSet validation and scoped listener setup through the lifecycle framework and printed a capos-service readiness marker. That demo is since removed with the kernel socket owner (make run-telnet is retired); the crate currently has no in-tree consumer, and compile coverage comes from make capos-service-check until the next service adopts the framework.

The initial crate deliberately does not add metrics, resource-budget hooks, endpoint serve-loop helpers, graceful handoff, or generic shutdown authority. Those remain later slices, grounded by Resource Accounting and Quotas and the error-boundary rules in Error Handling.

The immediate target is terminal/networking lifecycle: byte-stream terminal hosting, Telnet/TLS/SSH gateway plumbing, listener accept loops, shell launch, proxying, cleanup, and observable shutdown. HTTP/fetch services come later.

Problem

Current services duplicate the same shape:

  • discover bootstrap caps;
  • wait for dependencies;
  • mark readiness through log output or implicit behavior;
  • run accept or endpoint receive loops;
  • spawn children or proxy byte streams;
  • release result caps and temporary state;
  • log or count failures;
  • shut down after EOF, error, process exit, or supervisor request.

Duplicating that lifecycle is tolerable for proofs, but it is a poor foundation for production gateway, storage, agent, monitoring, and network services. Repeated hand-rolled loops are also where capability leaks, stuck children, incorrect close ordering, and hidden unbounded work appear.

Layering Decision

The stack remains:

schema/capos.capnp
  stable authority-bearing interfaces

capos-rt
  raw runtime and transport:
  bootstrap, CapSet, ring client, typed handles, completion matching,
  release flushing, exception decoding

capos-service
  generic userspace service container:
  lifecycle, endpoint loops, readiness, shutdown, background tasks,
  metrics, context, resource hooks

domain libraries
  HTTP/fetch, terminal host, storage, supervisor, agent tools

init/supervisors
  compose services by passing capabilities, not global names

capos-service is not a new authority source. It wraps and narrows capabilities the process already holds. The kernel still sees ordinary typed capability calls and ordinary process lifecycle.

Core Surface

Initial framework pieces:

  • Service lifecycle: initialize, dependency wait, ready, run, drain, shutdown, and final cleanup.
  • Endpoint serve loops: generated or handwritten helpers for RECV, decode, dispatch, RETURN, exception return, cancellation, and release.
  • Readiness handles: typed local handles or service-exported readiness caps, not global service names.
  • Shutdown and drain: cancellable waits, child/process-handle cleanup, listener stop, in-flight request drain, bounded force-close.
  • Background tasks: timers, periodic health checks, metrics export, and discovery loops with explicit cancellation.
  • Request/session context: owned context object per request or session containing caller-session metadata, derived policy, resource reservations, transfer state, timing, and audit correlation.
  • Metrics hooks: bounded counters and summaries; no unbounded per-user, per-cap-id, or per-method labels by default.
  • Resource budgeting: reservation/donation hooks that call into the relevant ledger owner; the framework records what was reserved and releases it on every exit path.
  • Error boundary: preserve the error-handling split from error-handling-proposal.md: CQE status for transport/kernel dispatch failure, CapException for capability infrastructure failure, and schema result unions for normal domain outcomes.
  • Graceful handoff hooks: transfer or drain listeners, endpoint loops, child handles, background tasks, and in-flight request state during upgrade or supervisor-directed replacement. Handoff must be explicit; silent cloning of authority or abandoning in-flight work is a bug.

First Target: Terminal And Networking

The first useful slice should be:

  1. TerminalSessionFromByteStream / byte-stream terminal host.
  2. Lifecycle wrapper around accept, session minting, proxying, and cleanup.
  3. Request/session context and metrics hooks.
  4. Network service container for listener-backed services.
  5. HTTP/fetch lifecycle only after terminal/networking proves the cleanup and authority model.

This ordering deliberately exercises the hard lifecycle edges before adding HTTP convenience: authenticated session creation, shell spawn, bidirectional byte proxying, EOF/close/error ordering, repeated connect/disconnect, and release of terminal/session/process result caps.

Authority Rules

  • The framework must not accept ambient service names, raw global handles, or stringly typed service discovery.
  • Hooks receive narrow capabilities, not ambient process authority.
  • Request/session context is lifecycle-owned and cannot outlive the request/session that created it.
  • Background tasks are budgeted, cancellable, and observable during shutdown.
  • Retry policy must encode side-effect safety through idempotency, operation ids, or a domain-specific no-retry rule.
  • Pool keys for reusable resources include every authority and identity field that changes policy: target, protocol, TLS identity, cap/object epoch, caller/session reference, namespace, tenant, and transformation policy.
  • Cache keys must include tenant, session, and authority dimensions where those dimensions affect disclosure or correctness.
  • Protocol parsers must drain or close before stream reuse.
  • Readiness means the service can actually accept authorized work; config parse success is not enough.
  • Shutdown must either drain, cancel, or explicitly transfer all in-flight work.

Non-Goals

  • No generic kernel Service capability.
  • No kernel callback registry or phase machine.
  • No plugin ABI that passes phase_id and bytes through a single generic cap.
  • No global service discovery namespace.
  • No HTTP-first framework that delays terminal/networking lifecycle cleanup.
  • No replacement for capos-rt transport primitives.

Implementation Sequence

  1. Implemented: draft shared ServiceMain/ServiceRuntime shape for one process and convert the plaintext Telnet gateway to prove the lifecycle wrapper without changing its QEMU behavior.
  2. Factor byte-stream terminal host lifecycle around TerminalSessionFromByteStream.
  3. Convert another focused terminal or gateway proof only after the byte-stream terminal host split is ready.
  4. Add request/session context and bounded metrics hooks.
  5. Add readiness and shutdown/drain helpers.
  6. Add endpoint serve-loop helpers that preserve typed schema authority.
  7. Add resource reservation/donation hooks.
  8. Consider HTTP/fetch domain library only after terminal/networking proofs pass.

Verification

Initial proof gates:

make docs
make run-terminal
make run-telnet or qemu-telnet-harness
focused close/reconnect proof
hidden password behavior remains byte-identical
child shell receives no raw network/spawn/listener authority
gateway cleanup releases terminal/session/process handles on EOF/error/shutdown

Later endpoint-helper gates should add targeted tests for exception return, result-cap release, cancellation, and resource rollback.

  • Service Architecture defines the capability-based service composition, authority-at-spawn, and service graph policy that capos-service consumers must respect; the framework wraps capabilities granted through that model rather than minting new authority.
  • Cloud Deployment describes the cloud VM surface (provider storage/NIC drivers, cloud clocking, instance bootstrap) that future capos-service listener and gateway services will run on top of once the userspace DeviceMmio/DMAPool/Interrupt authority gate exists.
  • Pingora research records the framework precedent and rejects importing Pingora’s HTTP proxy model into the kernel.
  • Telnet over TLS Shell and SSH Shell Gateway define the terminal factory and remote-ingress boundaries.
  • Error Handling defines the three error layers that generated clients and service helpers must preserve.
  • Resource Accounting and Quotas defines the ledger vocabulary for budgeting/donation hooks.

Proposal: Capability-Based Binaries, Language Support, and Compatibility Adapters

How userspace binaries receive, use, and compose capabilities, from the native Rust runtime through future language runtimes and compatibility adapters.

Current State

The init binary (init/src/main.rs) and smoke services are no_std Rust binaries over capos-rt. The runtime owns _start, fixed heap initialization, CapSet parsing, exit/cap_enter syscall wrappers, typed clients, result-cap adoption, queued release flushing, and panic output. Init reads the BootPackage manifest, validates the metadata-only service graph, spawns child services through ProcessSpawner, waits on ProcessHandles, and exits. The former raw bootstrap syscall and demo-support runtime shims are historical; demo support now keeps only low-level transport helpers for intentionally malformed SQE/CQE smokes.

Userspace now has a checked-in targets/x86_64-unknown-capos.json custom target that exposes target_os = "capos" while preserving the current static ELF, soft-float, no_std baseline. The kernel remains on the repository default x86_64-unknown-none target. init, demos, shell, and the capos-rt smoke binary build through custom-target Cargo aliases, and checked-in CUE manifests embed userspace from target/x86_64-unknown-capos/release paths. The remaining future work is hardening this target contract into a broader toolchain and packaging interface rather than treating it as a probe.

The kernel-side roadmap provides the capability ring (SQ/CQ shared memory plus cap_enter, implemented), scheduling, and IPC. This proposal covers the userspace half: what binaries look like, how they are built, and how existing software can be adapted to a system with no ambient authority.

Part 1: Native Userspace Runtime (capos-rt)

The Historical Problem

Before capos-rt, every userspace binary had to:

  • Define _start and a panic handler
  • Set up an allocator
  • Construct raw syscall wrappers
  • Manually serialize/deserialize capnp messages
  • Know the syscall ABI (register layout, method IDs)

That was acceptable for one proof-of-concept binary. It does not scale to dozens of services, and the current tree has moved those mechanics into capos-rt.

Solution: A Userspace Runtime Crate

capos-rt is a no_std + alloc Rust crate that every native capOS binary depends on. It provides:

1. Entry point and allocator setup.

#![allow(unused)]
fn main() {
use capos_rt::{Console, ConsoleClient, Runtime};

fn service_main(mut runtime: Runtime) -> i64 {
    let console = match runtime.capset().get_typed::<Console>(b"console") {
        Ok(cap) => cap,
        Err(_) => return 1,
    };
    let mut ring = match runtime.ring_client() {
        Ok(ring) => ring,
        Err(_) => return 2,
    };
    let mut client = ConsoleClient::new(console);
    match client.write_line_wait(&mut ring, "Hello from capOS", u64::MAX) {
        Ok(()) => 0,
        Err(_) => 3,
    }
}

capos_rt::entry_point!(service_main);
}

2. Syscall layer. Raw syscall asm wrapped in safe Rust functions. The entire syscall surface is 2 calls – new operations are SQE opcodes, not new syscalls:

  • sys_exit(code) – terminate the current thread; the process exits when this was its last live thread (syscall 1)
  • sys_cap_enter(min_complete, timeout_ns) – flush pending SQEs, then wait until N completions are available or the timeout expires (syscall 2)

The accepted in-process threading contract preserves this two-syscall surface: thread exit is available through both the raw terminal syscall and the typed ThreadControl.exitThread capability call.

Capability invocations go through the per-process SQ/CQ ring. capos-rt provides helpers for writing SQEs and reading CQEs:

#![allow(unused)]
fn main() {
/// Submit a CALL SQE to the capability ring and wait for the CQE.
pub fn cap_call(
    ring: &mut CapRing,
    cap_id: u32,
    method_id: u16,
    params: &[u8],
    result_buf: &mut [u8],
) -> Result<usize, CapError> {
    ring.push_call_sqe(cap_id, method_id, params);
    sys_cap_enter(1, u64::MAX);
    ring.pop_cqe(result_buf)
}
}

3. Cap’n Proto integration. The current runtime uses handwritten typed clients over schema-defined method ids and message shapes. Shared generated schema bindings live through capos-config; broad generated client bindings for capos-rt remain future work. The runtime owns transport lifetime and completion matching, while each typed client owns its interface-specific message encoding.

4. CapSet – the initial capability environment.

At spawn time, the kernel writes the process’s initial capabilities into the read-only CapSet page and passes its address to _start. capos-rt parses this into a typed lookup surface over name, local CapId, and interface id.

#![allow(unused)]
fn main() {
struct CapEntry {
    cap_id: u32,        // authority-bearing slot in the process CapTable
    interface_id: u64,  // Cap'n Proto interface TYPE_ID for type checking
}

impl CapSet {
    /// Get a typed capability by manifest name.
    pub fn get_typed<T: CapabilityType>(
        &self,
        name: &[u8],
    ) -> Result<Capability<T>, CapSetError> { ... }

    /// Iterate manifest-order entries for diagnostics and shell inspection.
    pub fn iter(&self) -> impl Iterator<Item = CapSetEntryRef> { ... }
}
}

interface_id is not a handle. It is metadata carrying the Cap’n Proto TYPE_ID for the interface expected by the typed client. The handle is cap_id. A typed client constructor must check that entry.interface_id == T::TYPE_ID, then store the local CapId. Normal CALL SQEs do not need to repeat the interface ID because each capability table entry exposes one public interface. The ring SQE keeps fixed-size reserved padding for ABI stability, not a required interface field for the system transport.

This matters for the system transport because several capabilities can expose the same interface while representing different authority: a serial console, a log-buffer console, and a console proxy all have the Console TYPE_ID, but different CapId values.

Crate Structure

capos-rt/
  Cargo.toml          # no_std + alloc, depends on capnp
  build.rs            # userspace linker arguments
  src/
    lib.rs            # type markers, owned handles, entry_point! macro
    entry.rs          # _start, Runtime, bootstrap validation
    syscall.rs        # raw asm syscall wrappers
    capset.rs         # CapSet lookup and iteration helpers
    client.rs         # handwritten typed clients
    ring.rs           # single-owner ring client and completion matching
    alloc.rs          # userspace heap allocator setup

capos-rt is NOT a workspace member (same as init/ – needs different target/linker handling from the kernel). It’s a path dependency for userspace crates.

Init On The Current Runtime

init/src/main.rs is already a capos-rt user. Its init_main(Runtime) entry is registered with capos_rt::entry_point!, obtains typed bootstrap caps from the runtime CapSet, reads the BootPackage manifest, validates the service graph, resolves spawn grants, launches children through ProcessSpawnerClient, waits on ProcessHandleClient, and reports failures through the Console client.

Part 2: Capability-Based Binary Model

Binary Format

ELF64, same as now. The kernel’s ELF loader (kernel/src/elf.rs) already handles PT_LOAD segments. No changes to the binary format itself.

What changed from the early prototype to the current runtime baseline is the ABI contract between kernel and binary:

AspectHistorical prototypeCurrent capos-rt baseline
Entry pointcrate-local _start(), no argsruntime-owned _start(ring_addr, pid, capset_addr)
Syscall ABIad-hoc (rax=0 write, rax=1 exit)SQ/CQ ring + sys_cap_enter + sys_exit
Capability accessnoneread-only CapSet page validated by capos-rt
SerializationnoneCap’n Proto messages encoded by typed clients
Allocatornone or crate-localruntime-owned fixed heap

Initial Capability Passing

The kernel communicates bootstrap state through _start arguments and fixed userspace mappings. The implemented shape is:

  • ring_addr: the process capability ring, expected to equal RING_VADDR.
  • pid: the process identifier for diagnostics/runtime bookkeeping.
  • capset_addr: read-only bootstrap CapSet page populated from the manifest and spawn grants.

Earlier options considered:

Option A: Well-known page. Kernel maps a read-only page at a fixed virtual address (e.g., 0x1000) containing a capnp-serialized InitialCaps message:

struct InitialCaps {
    entries @0 :List(InitialCapEntry);
}

struct InitialCapEntry {
    name @0 :Text;
    id @1 :UInt32;
    interfaceId @2 :UInt64;
}

Option B: Register convention. Pass pointer and length in rdi/rsi at entry. Simpler, but the data still needs to live somewhere in user memory.

Option C: Stack. Push the cap descriptor onto the user stack before iretq. Similar to how Linux passes auxv to _start.

Option A is cleanest – the page is always there, no calling-convention dependency, and it naturally extends to passing additional boot info later.

Service Binary Lifecycle

1. Kernel loads ELF, creates address space, populates cap table
2. Kernel maps InitialCaps page at well-known address
3. Kernel enters userspace at _start

4. capos-rt _start:
   a. Initialize heap allocator
   b. Parse InitialCaps page into CapSet
   c. Call user's main(CapSet)

5. User main:
   a. Extract needed caps from CapSet
   b. Do work (invoke caps, serve requests)
   c. Optionally export caps to parent once ProcessHandle export lookup exists

6. On return from main (or sys_exit):
   a. Kernel destroys process
   b. All caps in process's cap table are dropped
   c. Parent's ProcessHandle receives exit notification

Part 3: Language Support Roadmap

The current manual status page for this subject is Programming Languages. This proposal owns the longer roadmap and should not be read as implemented support for every language listed below.

Implemented Baseline: Rust (no_std + alloc)

Rust is the only implemented booted language path. Native services use #![no_std], alloc, capos-rt, static ELF binaries, and the targets/x86_64-unknown-capos.json userspace target. This fits the current kernel because it does not require a libc, dynamic linker, process environment, global filesystem, or ambient socket namespace.

Rust remains the default implementation language for core capOS services until the runtime, schema, and packaging contracts are stable. That is a project priority, not a rule that every future service must be written in Rust.

Future: Rust std

Rust std support is not implemented. It requires an operating-system backend for filesystem, networking, threads, time, standard I/O, process, environment, and synchronization APIs. On capOS those APIs must get authority from granted capabilities such as Directory, File, TcpSocket, Timer, ThreadSpawner, ThreadControl, ParkSpace, StdIO, and ProcessSpawner.

The project has not selected whether Rust std should be implemented directly over native capOS capabilities, through a POSIX compatibility adapter, or in a hybrid form. Until that decision is made, native no_std + alloc Rust over capos-rt remains the supported Rust path.

C via libcapos

The C substrate is in tree at Phase 0. The libcapos/ crate compiles to libcapos.a, a thin Rust staticlib that exposes the capos-rt syscall, ring CALL, CapSet lookup, and global allocator under an extern "C" ABI. C binaries link statically against the archive, share the userspace ELF layout used by Rust demos, and run inside the existing capos-rt _start chain. make run-c-hello boots a C main() that calls Console.writeLine, Timer.now, EntropySource.fill, and VirtualMemory wrappers through libcapos and exits cleanly. make run-c-pipe boots a second native C smoke that creates a kernel pipe through the typed ProcessSpawner.createPipe wrapper, writes and reads a marker through typed Pipe wrappers, closes the writer, observes EOF, and exits cleanly.

The current substrate is intentionally narrow: capability primitives, hand-written typed wrappers (capos_console_write_line, capos_timer_now, capos_entropy_fill, the capos_virtual_memory_{map,unmap,protect} trio, capos_process_spawner_create_pipe, and capos_pipe_{read,write,close}), raw syscalls, and the heap shim. The Pipe wrapper is a typed bridge over the existing transferred-result-cap path; it does not make capos_cap_call() a general transfer ABI, which still refuses transfer-bearing completions with CAPOS_E_TRANSFER_NOT_SUPPORTED. Anything POSIX-shaped (errno, fd table, open/read/write, signals, fork/exec, sockets) belongs in the separate libcapos-posix layer above libcapos. Generated typed wrappers for the remaining capabilities (NetworkManager, Endpoint, etc.), a stable C ABI for cap-transfer (today the v0 surface refuses transfer-bearing completions with CAPOS_E_TRANSFER_NOT_SUPPORTED), and per-thread runtime routing are also future work. Until that routing or a POSIX pthread layer lands, libcapos v0 is fail-closed for C-created capOS threads: capos_cap_call rejects bootstrap ThreadSpawner capabilities with CAPOS_E_THREADING_UNSUPPORTED, and concurrent or re-entrant runtime borrows return CAPOS_E_RUNTIME_BUSY.

The target libcapos shape is a static library providing:

#include <capos.h>

// Ring-based capability invocation (synchronous wrapper around SQ/CQ ring)
int cap_call(cap_ring_t *ring, uint32_t cap_id, uint16_t method_id,
             const void *params, size_t params_len,
             void *result, size_t result_len);

// Typed wrappers (generated from .capnp schema)
int console_write(cap_t console, const void *data, size_t len);
int console_write_line(cap_t console, const char *text);

// CapSet access
cap_t capset_get(const char *name);
uint64_t capset_interface_id(const char *name);

// Syscalls (the entire syscall surface -- 2 calls total)
_Noreturn void sys_exit(int code);                   // terminate current thread
uint32_t sys_cap_enter(uint32_t min_complete,        // flush SQEs + wait
                       uint64_t timeout_ns);

Implementation: libcapos is Rust compiled to a static .a with a C ABI (#[no_mangle] extern "C"). The capnp message construction happens in Rust behind the C API. This avoids requiring a C capnp implementation.

C binaries would link against libcapos.a and use the same static userspace ELF model as Rust binaries. Startup, allocator setup, CapSet access, and ring submission should be owned by libcapos, not repeated in every C program.

Future: C++

C++ support waits on the C substrate and explicit ABI decisions: exceptions, RTTI, TLS, allocator behavior, unwind policy, static initialization, and the scope of any standard-library subset. A freestanding arena/container subset is plausible earlier than hosted C++.

The previously inspected pg83/std library remains a later experiment, not a shortcut to full C++ support. Its low-level arena/container pieces are relevant; its hosted/POSIX assumptions still require the same capOS adapter work as other C++ libraries.

Future: Go (GOOS=capos)

Go is the next high-priority runtime after regular Rust. It needs in-process threading, futex-like wait/wake, TLS/runtime metadata support, GC integration, and a network poller mapped to capOS capabilities. See Go Runtime for the dedicated plan.

Go has higher priority than C++ because it unlocks CUE and a large practical tooling/runtime ecosystem. Go via WASI may be useful for CPU-bound CUE evaluation before native Go exists, but it is not a substitute for native Go network services or full runtime behavior.

Future: Python

Python is not implemented on booted capOS. It has three plausible paths:

  1. Native CPython through a POSIX compatibility adapter. This depends on the C/libc substrate plus file, stdio, timer, networking, and process adapters. It is the likely path for trusted system scripts and Python tools that need capOS storage or networking.
  2. MicroPython through the native C substrate. This is a smaller early scripting option with less runtime surface than CPython.
  3. WASI or Emscripten-hosted Python. This is useful for sandboxed or compute-oriented Python. It still runs a Python interpreter; WebAssembly is the sandbox and host ABI, not a way to avoid Python runtime work.

As of this review, upstream CPython support helps only the WebAssembly path: PEP 11 lists wasm32-unknown-wasip1 as Tier 2 and wasm32-unknown-emscripten as Tier 3, and PEP 776 records Emscripten support for Python 3.14. Those facts do not provide native capOS bindings for files, sockets, threads, process launch, or capabilities.

Future: Lua

Lua is a future capability-scoped scripting runner. The dedicated Lua Scripting proposal defines capos-lua as an ordinary userspace process with exact grants, curated standard libraries, unforgeable capability userdata, and no raw CapIds exposed to scripts. Upstream PUC Lua is a C implementation, so the native path waits on the C/libcapos substrate unless the project uses a pure-Rust Lua-like VM as a bootstrap proof.

Future: JavaScript / TypeScript

JavaScript support means running an engine as an ordinary capOS process. A small QuickJS-style native runner is the likely first experiment after C support. V8 or SpiderMonkey are much larger C++ runtime ports. TypeScript is normally compiled before execution and should not imply a kernel or base-system TypeScript compiler.

Partially landed: WASI and WebAssembly

The WASI host adapter Phase W.4 closed 2026-05-07 20:09 UTC (docs/proposals/wasi-host-adapter-proposal.md, docs/proposals/wasi-host-adapter-proposal.md). Languages that compile to WASI Preview 1 can now run on capOS through the wasm-host process (capos-wasm/, vendored wasmi 1.0.9), with imports backed by granted capOS capabilities. The current Preview 1 surface covers stdout/stderr writes, manifest-granted argv, bounded manifest-granted environment entries through initConfig.init.wasiEnv, monotonic clock time/resolution, no-op sched_yield, stdio fd metadata, stdio seek refusal as ERRNO_SPIPE, clean shutdown, and random_get when the manifest grants EntropySource. The regression smokes are make run-wasi-hello-rust (Rust wasm32-wasip1 payload), make run-wasi-hello-c (C wasm32-wasi payload), make run-wasi-cli-args, make run-wasi-env, make run-wasi-random, make run-wasi-random-ungranted, and make run-wasi-stdio-fd. Filesystem (W.5), sockets (W.6), and Preview 2 / Component Model (W.7+) remain future phases; make run-wasi-preview1-refusals keeps proving representative blocked storage/socket imports return ERRNO_NOSYS = 52 without authority.

Important distinction: WASI works differently for compiled vs. interpreted languages:

  • Compiled languages (Rust, C) compile directly to .wasm — no interpreter in the loop. WASI is a clean, efficient execution path.
  • Interpreted languages (Python, JS, Lua) still need their interpreter (CPython, QuickJS, etc.) — it’s just compiled to .wasm instead of native code. The stack becomes: script → interpreter.wasm → WASI runtime → kernel. You pay for a wasm sandbox layer on top of the interpreter you’d need anyway.

For interpreted languages, WASI sandboxing is valuable when running untrusted plugins or user-submitted scripts. For trusted system scripts, native CPython, QuickJS, or Lua over a POSIX or capability-native adapter may be simpler and faster once the native C substrate exists.

Future: Managed Runtimes

Languages with large managed runtimes such as Java and .NET need their runtime ported or a WASI-style host path. This is large effort and low priority.

Part 4: POSIX Compatibility Adapter

Status note: the full design lives in POSIX Adapter proposal and the implementation decomposition in POSIX Adapter, which are the canonical source for phase status. Phases P1.1 (libcapos C-substrate v0 + C hello smoke, closed 2026-05-05 13:28 UTC), P1.2 Phase A (UDP cap surface + capos-rt UdpSocketClient, closed 2026-05-05 18:02 UTC), P1.2 Phase B (kernel UDP path, libcapos-posix crate, dns.c vendoring, demo + manifest, closed 2026-05-05 21:21 UTC), and P1.3 (Pipe cap + recording-shim fork-for-exec + posix_spawn successor, closed 2026-05-07 09:55 UTC) have landed. The remaining open phase is the dash port successor (Task 4). The Namespace + File cap surface from Storage and Naming proposal has landed far enough for the v0 smoke; current POSIX-adapter work is now dash vendoring/patching, the multi-translation-unit C build, and the run-posix-shell-smoke harness. The signal/time stub slice is closed by make run-posix-signal-time. The sketch below remains for context; the dedicated proposal and plan are the source of truth for FdTable shape, supported-function matrix, and open questions.

Why POSIX at All?

capOS is not POSIX and doesn’t want to be. But:

  1. Existing software. Most useful software assumes POSIX. A DNS resolver, an HTTP server, a database – all speak open()/read()/write()/socket(). Without an adapter, every piece of software must be rewritten.

  2. Developer familiarity. Programmers know POSIX. A compatibility adapter lowers the barrier to writing capOS software, even if native caps are better.

  3. Gradual migration. Port software first with POSIX-shaped APIs, then incrementally convert to native capabilities for tighter sandboxing.

The goal is not full POSIX compliance. It is a pragmatic adapter that maps selected POSIX concepts to capabilities so existing software can run with bounded modification while preserving capability-based authority.

Architecture: libcapos-posix

Application (C/Rust, uses POSIX APIs)
  │
  │  open(), read(), write(), socket(), ...
  │
  v
libcapos-posix (POSIX-to-capability adapter)
  │
  │  Maps fds to caps, paths to granted directory/namespace lookups
  │
  v
libcapos (native capability invocation)
  │
  │  SQ/CQ ring + cap_enter syscall
  │
  v
Kernel (capability dispatch)

libcapos-posix is a static library that provides POSIX-like function signatures over granted capabilities. It is not an authority source and should not be described as “Linux compatibility.” A process without file/directory authority cannot open files; a process without socket authority cannot create sockets; a process without launcher or spawner authority cannot create children.

Current v0 surface (shipped as libcapos-posix.a alongside libcapos.a; see libcapos-posix/ and the canonical POSIX Adapter proposal):

  • Static-array fd table with a 32-fd cap (P1.2 Phase A decision §5).
  • Single-thread __errno_location() TLS cell (P1.2 Phase A decision §4).
  • socket(AF_INET, SOCK_DGRAM, 0) / sendto / recvfrom / close over the kernel UdpSocket capability (P1.2 Phase B).
  • pipe / read / write / dup / dup2 / close over the kernel Pipe capability via ProcessSpawner.createPipe (P1.3).
  • fork / execve / waitpid / _exit / posix_inherit_stdio via the recording-shim ProcessSpawner.spawn Move-grant path (P1.3 §6 decision: Variant A). fork() returns 0 unconditionally and opens a TLS recording window; dup2() / close() between fork and execve record into the window; execve() drains the recording into stdio_<N> spawn grants and returns the synthetic child pid (a deliberate v0 deviation from POSIX).
  • Direct posix_spawn / posix_spawn_file_actions_init / _destroy / _adddup2 / _addclose over the same Move-grant action-replay code path; argv / envp are accepted but ignored until a LaunchParameters surface lands.
  • open / read / write / close / lseek over the bootstrap root Directory and minted File caps; opendir / readdir / closedir over minted Directory caps.
  • Console/Terminal stdio adoption, focused printf / string / ctype helpers, manifest-backed getenv / setenv / putenv / unsetenv, and single-identity getpid / getuid / getgid stubs.
  • clock_gettime(CLOCK_MONOTONIC, ...) / gettimeofday(&tv, NULL) / time / nanosleep / sleep over the kernel Timer capability.
  • signal / sigaction store handlers without delivery; kill and raise fail closed until typed process-control authority exists.

C headers ship under libcapos-posix/include/capos/posix/ (errno.h, dirent.h, fcntl.h, signal.h, spawn.h, stdio.h, stdlib.h, string.h, sys/socket.h, sys/wait.h, time.h, unistd.h, and focused subsets such as ctype.h). libcapos-posix reuses libcapos’s installed Runtime through the renamed extern crate libcapos_::runtime::with(...) to avoid colliding with libcapos’s C-side capos_* exports.

Not yet implemented for the dash-port successor: file metadata/remove calls such as stat / fstat / access / unlink, TCP socket wrappers, select / poll / epoll, real asynchronous signal delivery, job control, chdir / cwd-relative path resolution, and broad FILE * stream semantics. These remain on the dash port successor track (Task 4 of docs/proposals/posix-adapter-proposal.md) or later typed-authority work.

File Descriptor Table

POSIX programs think in file descriptors. capOS has capabilities. The implemented v0 translation is a fixed 32-slot per-process fd table inside libcapos-posix. Slots may be backed by Console, UDP socket, Pipe, File, Directory, TerminalSession, or a moved-out sentinel used by the recording-shim execve() path.

Fd 0/1/2 are initialized only from explicit authority:

  • stdio_<N> Pipe grants seeded by a parent spawn action take precedence.
  • A bootstrap TerminalSession cap may adopt empty stdio slots when the program calls posix_inherit_stdio().
  • A bootstrap Console cap fills empty fd 1 and fd 2 for simple smokes.
  • Fd 0 stays closed unless the process received pipe or terminal input authority.

Path Resolution

POSIX open("/etc/config.toml", O_RDONLY) becomes:

  1. libcapos-posix looks up the bootstrap-granted root Directory cap named root.
  2. It rejects relative paths, .., and non-UTF-8 or oversized path segments.
  3. It walks intermediate components with Directory.sub().
  4. It opens the leaf with Directory.open() or Directory.sub().
  5. It installs a File or Directory fd slot with per-fd position / iteration state.

The future Namespace + Store resolver remains documented in the POSIX adapter proposal, but the shipped v0 dash-port proof uses the RAM-backed root Directory capability because that is the implemented kernel authority.

Supported POSIX Functions

Grouped by what capability backs them:

Console cap -> stdio:

POSIXcapOS translation
write(1, buf, len)console.write(buf[..len])
write(2, buf, len)console.write(buf[..len]) (or log cap)
read(0, buf, len)Pipe or TerminalSession-backed stdin when granted

Directory + File caps -> file I/O:

POSIXcapOS translation
open(path, flags)root Directory walk -> Directory.open() -> fd
read(fd, buf, len)File.read(offset, len) using per-fd position
write(fd, buf, len)File.write(offset, bytes) using per-fd position
close(fd)drop/release the backing cap slot
lseek(fd, off, whence)update per-fd file position
opendir/readdir/closedirDirectory.list() plus per-fd iteration

Pipe + ProcessSpawner caps -> subprocess I/O:

POSIXcapOS translation
pipe(fds)ProcessSpawner.createPipe() -> two Pipe-backed fds
fork() + execve()recording shim -> ProcessSpawner.spawn()
posix_spawn()direct action replay -> ProcessSpawner.spawn()
waitpid(pid, &status, 0)ProcessHandle.wait()

UdpSocket caps -> networking:

POSIXcapOS translation
socket(AF_INET, SOCK_DGRAM, 0)NetworkManager.createUdpSocket() -> fd
sendto / recvfromUdpSocket.sendTo() / UdpSocket.recvFrom()
close(fd)release the owned UdpSocket cap

Timer + local stubs:

POSIXcapOS translation
clock_gettime / gettimeofday / timeTimer.now()
nanosleep / sleepTimer.sleep()
signal / sigactionstore handler locally, never deliver
kill / raisevalidate signal number, then fail closed

Not supported or still partial:

POSIXWhy not
bare fork() state cloningNo address space cloning; only fork-for-exec is recorded
in-place exec() replacementSpawn creates a fresh process
real signal delivery / job controlNeeds typed process-control and terminal authority
chmod/chownNo permission bits. Authority is structural
mmap(MAP_SHARED)No shared memory yet (future: SharedMemory cap)
ioctlNo device files. Use typed capability methods
ptraceNo debugging interface yet
select/poll/epollRequires async cap invocation (Stage 5+). Initial version is blocking only

Process Creation Compatibility

capOS process creation is spawn-style, not fork/exec-style. A new process is a fresh ELF instance selected by ProcessSpawner, with an explicit initial CapSet assembled from granted capabilities. The parent address space is not cloned, and an existing process image is not replaced in place.

posix_spawn() is the compatibility primitive for subprocess creation. libcapos-posix (P1.3, closed 2026-05-07 09:55 UTC) maps it to ProcessSpawner.spawn(), translates posix_spawn_file_actions into fd-table setup and Move-grant stdio_<N> capability grants on the spawn ABI. argv / envp are accepted but ignored until a LaunchParameters surface lands. make run-posix-spawn-smoke is the end-to-end proof.

Full fork() is intentionally not a native kernel primitive. Supporting it would require copy-on-write address-space cloning, parent/child register return semantics, fd-table duplication, a per-capability inheritance policy, safe handling for outstanding SQEs/CQEs, and defined behavior for endpoint calls, timers, waits, and process handles that are in flight at the fork point. Threaded POSIX processes add another constraint: only the calling thread is cloned, while locks and async-signal-safe state must remain coherent in the child.

P1.3 also shipped a narrow recording-shim fork() for the common fork-for-exec pattern that does not require general address-space cloning. fork() returns 0 unconditionally and opens a TLS recording window; dup2() / close() between fork and execve record into the window without mutating the parent fd table; execve() drains the recording into Move-grant stdio_<N> spawn grants and returns the synthetic child pid as its own return value. The pseudo-child branch is still the parent process, so a failed execve() MUST NOT call _exit() – it must surface the error to the parent’s normal error path. The user pattern is pid_t child = fork(); if (child == 0) { dup2(); close(); child = execve(...); } /* parent flow */. Earlier iterations used x86_64 setjmp/longjmp to fake fork-return-twice; that was replaced because longjmp back into fork()’s already- returned stack frame was undefined behaviour. make run-posix-pipe-smoke is the end-to-end proof.

make run-posix-dns-smoke exercises socket(AF_INET, SOCK_DGRAM, 0) / sendto / recvfrom against the kernel UdpSocket capability through a hand-rolled DNS A query in demos/posix-dns-resolver/. The current smoke does not compile the vendored dns.c whole because the v0 libcapos-posix POSIX surface is narrower than dns.c expects (poll.h, netinet/in.h, arpa/inet.h, netdb.h, sys/select.h, sys/un.h); widening that surface is follow-on work on the dash port track.

Security Model

The POSIX compatibility adapter does not weaken capability security. Every POSIX call translates to a capability invocation on caps the process was actually granted:

  • open("/etc/passwd") fails if the process lacks a bootstrap root Directory cap or that directory tree does not contain etc/passwd – not because of permission bits, but because no granted authority resolves the path.
  • socket(AF_INET, SOCK_DGRAM, 0) fails if the process was not granted a NetworkManager cap; TCP stream wrappers remain future work.
  • fork() only opens the recording window for the supported fork-for-exec pattern; bare address-space cloning remains unsupported.

A POSIX binary on capOS is more constrained than on Linux, not less. The compatibility adapter provides familiar function signatures, not familiar authority.

Building POSIX-Compatible Binaries

my-app/
  Cargo.toml        # depends on capos-posix (which depends on capos-rt)
  src/main.rs       # uses libc-style APIs

Or for C:

#include <capos/posix/fcntl.h>       // open, O_RDONLY
#include <capos/posix/sys/socket.h>  // socket, sendto, recvfrom
#include <capos/posix/unistd.h>      // read, write, close

int main() {
    // Works -- stdout is mapped to Console cap
    write(1, "hello\n", 6);

    // Works -- if the process was granted a root Directory cap
    int fd = open("/config.toml", O_RDONLY);
    char buf[4096];
    ssize_t n = read(fd, buf, sizeof(buf));
    close(fd);

    // Works -- if NetworkManager cap was granted; TCP is not in v0
    int sock = socket(AF_INET, SOCK_DGRAM, 0);
    close(sock);
}

The linker pulls in libcapos-posix.a -> libcapos.a -> startup code. Same ELF output, same kernel loader.

musl as a Base (Optional, Later)

For broader C compatibility (printf, string functions, math), libcapos-posix can be layered under musl libc. musl has a clean syscall interface – all system calls go through a single __syscall() function. Replacing that function with capability-based dispatch gives you full libc on top of capOS capabilities:

// musl's syscall entry point -- we replace this
long __syscall(long n, ...) {
    switch (n) {
        case SYS_write: return capos_write(fd, buf, len);
        case SYS_open:  return capos_open(path, flags, mode);
        case SYS_socket: return capos_socket(domain, type, protocol);
        // ...
        default: return -ENOSYS;
    }
}

This is the same approach Fuchsia uses with fdio + musl, and Redox OS uses with relibc. It works and it gives you printf, fopen, getaddrinfo, and most of the C standard library.

Priority: after native capos-rt and libcapos are stable. musl integration is a significant engineering effort and should only be done when there’s actual software to port.

Part 5: WASI Host Adapter

Note: the full design lives in WASI Host Adapter proposal and the implementation decomposition in WASI Host Adapter. The sketch below remains for context; the dedicated proposal is the source of truth for runtime selection (wasmi for v0; wasmtime / WAMR as W.7+ migration), capability-mapping surface, per-instance CapSet plumbing, phase decomposition, and open questions.

Why WASI Fits capOS Better Than POSIX

WASI (WebAssembly System Interface) was designed from the start as a capability-based system interface. Its concepts map almost directly to capOS:

WASI conceptcapOS equivalent
fd (pre-opened directory)Namespace cap
fd (socket)TcpSocket/UdpSocket cap
fd_write on stdoutConsole.write()
Pre-opened dirs at startupCapSet at spawn
No ambient filesystem accessNo ambient authority
path_open scoped to pre-opened dirnamespace.resolve() scoped to granted prefix

WASI programs already assume they get no ambient authority. A WASI binary compiled for capOS still needs a host adapter, but the security model is closer to capOS than POSIX because preopened handles are explicit.

Architecture: Wasm Runtime as a capOS Service

WASI binary (.wasm)
  │
  │  WASI syscalls (fd_read, fd_write, path_open, ...)
  │
  v
wasm-runtime process (Wasmtime/wasm-micro-runtime, native capOS binary)
  │
  │  Translates WASI calls to capability invocations
  │  Each wasm instance gets its own CapSet
  │
  v
libcapos (native capability invocation)
  │
  v
Kernel

The wasm runtime is itself a native capOS process. It receives caps from its parent and partitions them among the wasm modules it hosts. This gives you:

  • Language independence. Any language with a useful WASI target can be evaluated through the same host adapter.
  • Extra sandboxing. Wasm memory isolation combines with capOS capability scoping.
  • Less porting effort for software that already targets WASI, assuming its required imports are implemented by the host adapter.
  • Density. Multiple wasm modules in one process, each with different caps

WASI vs Native Performance

Wasm adds overhead: bounds-checked memory, indirect calls, and host-call marshalling. For foundational system services, native Rust remains the default choice until there is a concrete reason to choose otherwise. For application code and portable tools, the sandboxing and reuse may be worth the overhead.

WASI Implementation Phases

The current shipped state is owned by WASI Host Adapter and WASI Host Adapter proposal; the phase status summary below is a pointer, not the source of truth.

Phase W.0 (planning, closed): runtime decision recorded as wasmi for v0; WAMR / wasmtime are W.7+ migration candidates. The earlier “wasm-micro-runtime as a C binary via libcapos” sketch is superseded by wasmi-as-a-Rust-crate inside the standalone capos-wasm/ package. Cross-cutting Open Questions §1 (per-instance vs per-process) and §3 (poll_oneoff semantics over the capOS ring) resolved 2026-05-13 16:46 UTC: one wasm instance per capos-wasm process, and poll_oneoff stays ERRNO_NOSYS in v0 with subscription kinds extended one at a time through W.5/W.6 against a single blocking cap_enter.

Phase W.1 (host scaffold, closed 2026-05-05 19:12 UTC): capos-wasm/ standalone userspace crate over vendored wasmi 1.0.9 (vendor/wasmi-no_std/wasmi-1.0.9/); make capos-wasm-build.

Phase W.2 (Preview 1 stdout-only, closed 2026-05-07 10:53 UTC): wasm-host userspace binary, empty-instantiation smoke (make run-wasm-host), Preview 1 stdout-only import resolver (args_get / environ_get empty, clock_time_get(MONOTONIC), proc_exit, fd_write(1, …) / fd_write(2, …); everything else including random_get returns ERRNO_NOSYS), manifest-payload load path through an optional BootPackage cap, Rust hello, wasi (make run-wasi-hello-rust), and C hello, wasi (make run-wasi-hello-c).

Phase W.3 (per-instance argv grant, closed 2026-05-07 18:25 UTC): bounded initConfig.init.wasiArgs text grant on top of the existing manifest CapSet, validated against WASI_ARGS_MAX_COUNT = 32, WASI_ARGS_MAX_ARG_BYTES = 4096, and WASI_ARGS_MAX_TOTAL_BYTES = 8192. The wasm-host installs the bundle on HostState before instantiation, and Preview 1 args_get / args_sizes_get reflect it. make run-wasi-cli-args is the end-to-end proof. A 2026-05-13 follow-up adds the same bounded-text shape for initConfig.init.wasiEnv (WASI_ENV_MAX_COUNT = 32, WASI_ENV_MAX_ENTRY_BYTES = 4096, WASI_ENV_MAX_TOTAL_BYTES = 8192) with make run-wasi-env and make wasi-env-negative-check.

Phase W.4 (random_get production + clocks production-ready, closed 2026-05-07 20:09 UTC): Preview 1 random_get routed through the kernel EntropySource cap when the manifest grants it, chunked at the cap’s MAX_ENTROPY_FILL_BYTES = 64 ceiling and capped per Preview 1 invocation at RANDOM_GET_MAX_BYTES = 65_536 bytes; ungranted variant refuses with ERRNO_NOSYS = 52. make run-wasi-random and make run-wasi-random-ungranted are the granted/ungranted proofs. clock_time_get(CLOCKID_REALTIME) keeps returning ERRNO_NOSYS until a typed WallClock cap exists. A 2026-05-13 compatibility-import slice promotes authority-free Preview 1 imports (clock_res_get(MONOTONIC), sched_yield, stdio fd_fdstat_get metadata, stdio fd_seek returning ERRNO_SPIPE) through make run-wasi-stdio-fd. make run-wasi-preview1-refusals keeps representative blocked storage and socket imports failed closed with ERRNO_NOSYS = 52.

Phase W.5 (filesystem against Namespace / File / Store, blocked): waits on the storage cap surface from Storage and Naming proposal. Until then, make run-wasi-preview1-refusals is the refusal evidence.

Phase W.6 (sockets against TcpSocket / UdpSocket, blocked): waits on a userspace network stack process (or an interim Fetch / HttpEndpoint shim) from Networking proposal. Same refusal evidence as W.5 in the interim.

Phase W.7 (Preview 2 / Component Model + wasmtime migration, blocked): waits on the std-userspace decision (same blocker as the capnp-rpc remote-session rewrite). When it lands, WIT resources map to typed OwnedCapability<T> slots in the host adapter and the schema gains the Component Model resource bridging variants.

Phase W.8 (TinyGo / Go-on-WASI CUE evaluator, blocked): waits on the same std-userspace decision; native GOOS=capos remains the path for full Go runtime semantics.

Part 6: Putting It All Together – Porting Strategy

Spectrum of Integration

Most native                                              Most compatible
     |                                                          |
     v                                                          v
Native Rust    C with libcapos    POSIX adapter         WASI binary
(capos-rt)     (typed caps)       (libcapos-posix)      (wasm runtime)

- Best perf     - Good perf        - Familiar API        - Any language
- Full cap      - Full cap         - Auto sandboxing     - Auto sandboxing
  control         control            via cap scoping       via wasm + caps
- Most work     - Moderate work    - Less rewrite        - Less rewrite
  to write        to write           for existing C        for WASI targets

Example: Porting a DNS Resolver

Native Rust: Rewrite using capos-rt. Receives UdpSocket cap, serves DNS lookups as a DnsResolver capability. Other processes get a DnsResolver cap instead of calling getaddrinfo(). Clean, typed, minimal authority.

C with POSIX adapter: Take an existing DNS resolver (e.g., musl’s getaddrinfo implementation or a standalone resolver). Compile against libcapos-posix. Give it a UdpSocket cap and a Namespace cap for /etc/resolv.conf. It calls socket(), sendto(), recvfrom() – all translated to cap invocations. Works with minimal changes, but can’t export a typed DnsResolver cap (it speaks POSIX, not caps).

WASI: Compile a Rust DNS resolver to WASI. Run it in the wasm runtime. Same capability scoping, but through the wasm sandbox.

  1. Foundational services: native Rust by default. Drivers, network stack, store, and init are the foundation and should use capabilities natively unless a concrete reviewed reason justifies another runtime.

  2. First applications: native Rust. While the ecosystem is young, applications should use capos-rt directly. This validates the cap model.

  3. C compatibility: when porting specific software. Do not build the POSIX adapter speculatively. Build it when there is a specific C program to port (e.g., a DNS resolver, an HTTP server, a database). Let real porting needs drive which POSIX functions to implement.

  4. WASI: as the general-purpose application runtime. Once the native runtime is stable, the wasm runtime becomes the “run anything” answer. Lower priority than native Rust, but higher priority than full POSIX/musl compat, because WASI’s capability model is a natural fit.

Part 7: Schema Extensions

New schema types needed for the userspace runtime:

# Extend schema/capos.capnp

struct InitialCaps {
    entries @0 :List(InitialCapEntry);
}

struct InitialCapEntry {
    name @0 :Text;
    id @1 :UInt32;
    interfaceId @2 :UInt64;
}

interface ProcessSpawner {
    spawn @0 (name :Text, binaryName :Text, grants :List(CapGrant)) -> (handleIndex :UInt16);
}

struct CapGrant {
    name @0 :Text;
    capId @1 :UInt32;
    interfaceId @2 :UInt64;
}

interface ProcessHandle {
    wait @0 () -> (exitCode :Int64);
}

These definitions now live in schema/capos.capnp as the single source of truth. spawn() returns the ProcessHandle through the ring result-cap list; handleIndex identifies that transferred cap in the completion. The first slice passes a boot-package binaryName instead of raw ELF bytes so spawn requests stay inside the bounded ring parameter buffer; manifest-byte exposure and bulk-buffer spawning remain later work. kill, post-spawn grants, and exported-cap lookup are deferred until their lifecycle semantics are implemented.

Implementation Status And Future Phases

Implemented Baseline: capos-rt

  • capos-rt/ exists as a standalone no_std + alloc runtime crate.
  • capos-rt owns _start, heap initialization, panic output, raw syscall wrappers, bootstrap validation, CapSet parsing, the entry-point macro, the single-owner ring client, typed clients, result-cap adoption, and owned handle release.
  • init/, shell/, demos/, and the runtime smoke binary build for targets/x86_64-unknown-capos.json.
  • QEMU proofs cover typed Console calls, exception decoding, spawn/wait, runtime VirtualMemory, Timer, ThreadControl, ThreadSpawner, ThreadHandle, terminal sessions, and release behavior.

Deliverable: completed. See Userspace Runtime and Programming Languages for current validation.

Future Phase: broader generated/native clients

  • Add generated clients after the schema surface stabilizes.
  • Preserve the existing split where capos-rt owns transport lifetime and interface-specific wrappers own message encoding.
  • Establish the out-of-tree service-binary packaging pattern once the internal userspace target contract is stable.

Deliverable: ordinary native capOS services can depend on generated typed clients without copying runtime transport logic.

libcapos for C – Phase 0 closed

  • extern "C" API exposing capos_cap_call, capos_capset_get, capos_sys_exit, capos_sys_cap_enter, capos_console_write_line, capos_timer_now, capos_entropy_fill, capos_virtual_memory_*, capos_process_spawner_create_pipe, capos_pipe_read, capos_pipe_write, capos_pipe_close, and malloc/free/calloc/ realloc heap shims over the capos-rt global allocator.
  • Public header at libcapos/include/capos/capos.h.
  • Build system: make libcapos produces libcapos/target/x86_64-unknown-capos/release/libcapos.a; make c-hello and make c-pipe link native C smokes with clang + lld using the shared demos/linker.ld.
  • C “hello world” smoke at demos/c-hello/main.c calls Console.writeLine through capos_console_write_line, exercises Timer, EntropySource, and VirtualMemory typed wrappers, verifies capos_cap_call rejects a bootstrap ThreadSpawner cap locally, and exits cleanly. make run-c-hello boots system-c-hello.cue and the smoke greps for the [c-hello] hello from c-hello, entropy, VM, and ThreadSpawner rejection markers plus the kernel process N exited with code 0 line.
  • Native C pipe smoke at demos/c-pipe/main.c uses capos_process_spawner_create_pipe, writes and reads native-c-pipe-marker through typed Pipe wrappers, closes the write end, observes EOF, and exits cleanly. make run-c-pipe boots system-c-pipe.cue and checks the create, read, EOF, and clean-exit markers.

Deliverable: complete – C binary boots, calls Console.writeLine, and exits cleanly through capos_sys_exit.

Deferred to later libcapos phases: generated typed wrappers per interface, transferred result-cap propagation across the C ABI, per-thread routing of the runtime ring, and a libcapos-posix layer.

Future Phase: POSIX compatibility adapter

  • Implement FdTable and path resolution
  • Start with file I/O (open/read/write/close over Namespace + Store)
  • Add socket wrappers when networking is userspace
  • Optionally integrate musl for full libc

Deliverable: an existing C program (e.g., a simple HTTP server) runs on capOS with minimal source changes.

WASI runtime (partially landed)

The WASI host adapter is its own track owned by docs/proposals/wasi-host-adapter-proposal.md and docs/proposals/wasi-host-adapter-proposal.md. Phase decomposition:

  • W.1 (host scaffold; landed 2026-05-05 19:12 UTC): capos-wasm/ standalone crate over vendored wasmi 1.0.9 (vendor/wasmi-no_std/wasmi-1.0.9/), make capos-wasm-build.
  • W.2 (Preview 1 stdout-only; closed 2026-05-07 10:53 UTC): wasm-host userspace binary, make run-wasm-host empty-instantiation smoke, Preview 1 stdout-only import resolver, manifest-payload load path, Rust hello, wasi smoke (make run-wasi-hello-rust), and C hello, wasi smoke (make run-wasi-hello-c). Capabilities backing the host imports today: Console + Timer + BootPackage. v0 chose wasmi-as-Rust-crate over wasm-micro-runtime-as-C-binary; wasmtime / WAMR remain W.7+ migration candidates.
  • W.3 (per-instance CapSet plumbing + LaunchParameters) closed 2026-05-07 18:25 UTC.
  • W.4 (random_get against the in-tree EntropySource cap, plus clocks production-ready) closed 2026-05-07 20:09 UTC.
  • 2026-05-13 compatibility/refusal smokes: make run-wasi-stdio-fd proves promoted authority-free imports no longer return ERRNO_NOSYS; make run-wasi-preview1-refusals keeps storage and socket imports failed closed without authority.
  • W.5 (filesystem against Namespace/File/Store), W.6 (sockets against TcpSocket/UdpSocket), and W.7+ (Preview 2 / Component Model) remain future phases.

Deliverable status: hello.wasm runs on capOS today (both Rust and C payloads), argv and entropy grants are implemented, and authority-free stdio fd compatibility imports are covered by a direct smoke. Filesystem/socket phases are queued behind their authority surfaces.

Open Questions

  1. Allocator strategy. Should the userspace heap be a fixed-size region (simple, but limits memory), or should it grow by invoking a FrameAllocator cap (flexible, but every allocation might syscall)? Likely answer: fixed initial region + grow-on-demand via cap.

  2. Async I/O. The SQ/CQ ring is inherently asynchronous (submit SQEs, poll CQEs), but the initial capos-rt wrappers provide blocking convenience (submit one CALL SQE + cap_enter(1, MAX)). Real services need batched async patterns. Options:

    • Submit multiple SQEs, poll CQEs in an event loop (io_uring style)
    • Runtime green threads or tasks multiplexed through one ring dispatcher; the 7.1 threading contract keeps at most one blocked cap_enter waiter per process ring until a sharded or per-thread ring ABI exists
    • Userspace executor (like tokio) driving the ring
  3. Cap passing in the POSIX adapter. POSIX has SCM_RIGHTS for passing fds over Unix sockets. Should the POSIX adapter support something similar for passing caps? Or is this native-only?

  4. Dynamic linking. Currently all binaries are statically linked. Should capOS support shared libraries? Probably not initially – static linking is simpler and the binaries are small. Revisit if binary size becomes a concern.

  5. WASI component model integration. WASI preview 2 components have typed imports/exports that could map to capnp interfaces. Should the wasm runtime auto-generate capnp-to-WIT adapters from schemas? This would let wasm components participate natively in the capability graph.

  6. Build system. How are userspace binaries packed into the boot image? Currently the Makefile builds init/ separately. With multiple service binaries, need a more scalable approach (build manifest that lists all binaries, Makefile target that builds and packs them all).

Relationship to Other Proposals

  • Service architecture proposal – defines what services exist and how they compose. This proposal defines how those service binaries are built, what runtime they use, and how non-Rust software fits in.
  • Storage and naming proposal – the POSIX open()/read()/write() translation targets the Store and Namespace caps defined there.
  • Networking proposal – the POSIX socket translation targets the TcpSocket/UdpSocket caps from the network stack.

Proposal: Native Shell and POSIX Shell

How interactive operation should work on capOS without reintroducing ambient authority through a Unix-like command line.

Problem

capOS deliberately avoids global paths, inherited file descriptors, ambient network access, and process-wide privilege bits. A conventional shell assumes all of those. If capOS copied a Unix shell model directly, the shell would either be mostly useless or become an ambiently privileged escape hatch around the capability model.

The system needs two related, but distinct, shell layers:

  • Native shell: schema-aware capability REPL and scripting language.
  • POSIX shell: compatibility personality for existing programs and scripts.

Both must be ordinary userspace processes. Neither should receive special kernel privilege. The kernel and trusted capability-serving processes remain the enforcement boundary.

Model-driven interaction on top of the native shell is a separate concern and is defined in Language Models and Agent Runtime. The model runs as its own service with no session authority; the native shell (in “agent mode”) is the runner: it holds the session caps, exposes them to the model as typed tool descriptors with per-tool permission modes, executes tool calls on behalf of the model, streams results back, and keeps the user in the loop.

The first boot-to-shell milestone is text-only: local console login/setup and, later in the same family, a browser-hosted terminal gateway. Graphical shells, desktop UI, compositors, and GUI app launchers are a later tier. See Boot to Shell.

Design Principles

  • A shell starts with only the capabilities it was granted.
  • A shell command compiles to typed capability calls, not stringly syscalls.
  • Child processes receive explicit grants. There is no implicit inheritance of the shell’s full authority.
  • Elevation is a capability request mediated by a trusted broker, not a flag inside the shell.
  • Shell startup is a workload launch from a UserSession, service principal, or recovery profile. Session metadata informs policy and audit; it is not authority.
  • Default interactive cap sets are broker-issued session bundles, not hard-coded shell privileges.
  • POSIX behavior is an adapter over scoped Directory, File, socket factory, and process capabilities. It is not the native authority model.

User identity and policy sit above this shell model. A shell session may be associated with a human, service, guest, anonymous, or pseudonymous principal, but the session’s capabilities remain the authority. RBAC, ABAC, and mandatory policy decide which scoped caps a broker may grant; they do not create a kernel-side uid, role bit, or label check on ordinary capability calls. See User Identity and Policy.

Federated sessions (OIDC-authenticated principals, service accounts using OAuth2 workload identity) are one input shape for this model. OAuth scopes and OIDC claims from a session’s issuer feed AuthorityBroker as ABAC attributes. They never authorize capability calls directly, and raw bearer tokens never appear in shell state. The token-typed capabilities, OAuthClient, OidcIdentityProvider, and the broker-side token handling are defined in OIDC and OAuth2.

Layering

flowchart TD
    Input[Login, guest, anonymous, or service request] --> SessionMgr[SessionManager]
    SessionMgr --> Session[UserSession metadata cap]
    Session --> Broker[AuthorityBroker / PolicyEngine]
    Broker --> Bundle[Scoped session cap bundle]

    Bundle --> Native[Native shell]
    Bundle --> Posix[POSIX shell]

    Posix --> Compat[POSIX compatibility runtime]

    Native --> Ring[capos-rt capability transport]
    Compat --> Ring
    Ring --> Kernel[Kernel cap ring]
    Ring --> Services[Userspace services]

    Native --> Approval[Approval client cap]
    Approval --> Broker
    Broker --> Services
    Broker --> Audit[AuditLog]

The native shell is the primitive interactive surface. The POSIX shell is a compatibility consumer of capOS capabilities, not the model other shells are built on. A language-model service, when present, is invoked through a LanguageModel cap from the native shell running in “agent mode”; the shell is the tool runner, not the model. That flow is defined in Language Models and Agent Runtime and is not expanded in this diagram.

A shell may display a principal name, profile, role set, label, or POSIX UID, but those values are descriptive unless a trusted broker uses them to return a specific capability. Losing a home, logs, launcher, or approval cap cannot be repaired by presenting the same session ID back to the kernel.

Native Shell

The native shell is a typed capability graph operator. Its job is to inspect, invoke, pass, attenuate, release, and trace capabilities.

Current implementation status as of 2026-05-16 21:36 UTC: capos-shell is the standalone no_std crate at shell/ and ships the anonymous-first interactive flow. Focused shell/login manifests still launch it directly as initConfig.init; the default make run manifest now runs it as an init-started service under standalone init, together with the chat / adventure binaries and the remote-session CapSet gateway. On boot the shell mints an anonymous UserSession via SessionManager.anonymous() and receives an empty-allowlist anonymous bundle from AuthorityBroker. login and setup commands use CredentialStore/SessionManager/AuthorityBroker to verify or create the password, mint an operator session, request the operator shell bundle, and swap session/launcher in place. Login prompts for a username as well as a password through a username-aware SessionManager.login() request that carries method, selector, proof, and source metadata. A guest command mints a guest session via SessionManager.guest() and swaps to a broker-issued guest bundle (guest sessions require an explicit manifest seed; no broad authority is granted to guest profiles). Shell exit calls UserSession.logout() to clean up the session context. The default make run manifest includes the native shell, chat/adventure binaries, terminal, console, stdio, chat, adventure, creds, sessions, audit, broker, and system_info caps; its MOTD shows the concrete spawn / run commands for the adventure demo. The current command set is help, caps, binaries, motd, inspect <name>, session, login, setup, guest, spawn, blocking run, wait, and exit, with a launcher-backed binaries command that lists binaries available to the current session (anonymous and guest launcher policies return an empty list). The session-scoped TerminalSession substrate now exists behind make run-terminal, and the bounded SSH terminal-host proof can launch capos-shell over a socket-backed TerminalSession with a public-key UserSession through RestrictedShellLauncher. The generic call @cap.method(...) REPL, schema reflection, richer daily shell profiles, and the full OpenSSH gateway remain future work.

Example init or development session with explicit spawn authority:

capos:init> caps
log        Console
spawn      ProcessSpawner
boot       BootPackage
vm         VirtualMemory

capos:init> call @log.writeLine({ text: "hello" })
ok

capos:init> spawn "tls-smoke" with {
  log: @log
} -> $child
started pid 12

capos:init> wait $child
exit 0

Values

Native shell values should include:

  • @name: a named capability in the current shell context.
  • $name: a local value, result, promise, or process handle.
  • structured values: text, bytes, integers, booleans, lists, and structs.
  • result-cap values returned through the capOS transfer-result path.
  • trace values representing CQE and call-history slices.

The shell should preserve interface metadata with every capability value. A method call is valid only if the target cap exposes the method’s schema.

Commands

Initial commands should be small and explicit:

caps
binaries
inspect @log
methods @spawn
call @log.writeLine({ text: "boot complete" })
spawn "ipc-server" with { log: @log, ep: @serverEp } -> $server
wait $server
run "ipc-client" with { log: @log, ep: client @serverEp }
release @temporary
trace $server
bind scratch = @store.sub("scratch")
derive readonly = @home.sub("config").readOnly()

inspect should show the interface ID, label, transferability, revocation state when available, and callable methods. It should not imply that two caps with the same interface ID are the same authority.

The current prototype intentionally does not yet provide the generic call @cap.method(...) REPL. Until the schema registry and structured value parser exist, native-shell exposes only narrow typed commands and should make that gap visible through planning docs rather than accepting raw method IDs and opaque byte blobs.

Syntax

The syntax should be structured rather than shell-token based. A CUE-like or Cap’n-Proto-literal-like shape fits capOS better than POSIX word splitting:

spawn "net-stack" with {
  log: @log
  nic: @virtioNic
  timer: @timer
}

The shell can still provide abbreviations, but the executable representation should be an ActionPlan object with typed fields.

Composition

Native composition should pass typed caps or structured values, not inherited byte streams by default:

pipe @camera.frames()
  |> spawn "resize" with { input: $, width: 640, height: 480 }
  |> spawn "jpeg-encode" with { input: $, quality: 85 }
  |> call @photos.write({ name: "frame.jpg", data: $ })

If a byte stream is desired, it should be explicit through a ByteStream, File, or POSIX adapter capability. This keeps the “pipe” operator from silently turning every interface into untyped bytes.

Namespaces

There is no global root. A native shell may have a current Directory or Namespace capability, but that is just a default argument:

capos:user> ls @config
services
network

capos:user> cd @config.sub("services")
capos:@config/services> ls
logger
net-stack

The shell cannot traverse above a scoped directory or namespace unless it holds another capability that names that authority.

Session Context

A session-aware shell may hold a self or session cap for UserSession.info() and audit context. That cap is metadata. It can identify the principal, auth strength, expiry, quota profile, and audit identity, but it cannot widen the shell’s CapSet or authorize kernel operations by itself.

The launcher or supervisor starts the shell with a CapSet returned by AuthorityBroker(session, profile). For interactive work, that bundle should usually include scoped terminal, home, logs, launcher, status, and approval caps. For service accounts, guest sessions, anonymous workloads, and recovery mode, the broker returns different bundles under explicit policy profiles.

Shell-launched children inherit only the caps named in the spawn plan. A child may receive a UserSession or session badge for audit, per-client quotas, or service-side selection, but object access still comes from the scoped object caps passed to that child.

Interactive Command Surfaces

Application-specific interactions must stay out of the native shell command set. A chat client, adventure client, or other interactive application should run as an ordinary shell-spawned application or resident service session, not as a builtin such as chat or play adventure.

The near-term target is a prototype bridge, not the final app protocol: capos-shell launches clients with spawn or run, grants them explicit endpoint clients such as stdio: client @stdio, and services StdIO while waiting. That proves exact grants, process handles, child completion, and the terminal bridge without giving a child the shell’s move-only TerminalSession. Legacy badge N syntax is retired from normal client @... grants; delegated client endpoints preserve their service identity by default, and service object capabilities replace badged chat/adventure identity. Explicit selector fixtures remain only in low-level and hostile-path tests.

That StdIO bridge is intentionally limited. It is acceptable for focused QEMU smokes and textual compatibility, but it is the wrong long-term semantic boundary for capOS-native applications. If an adventure client receives a line from StdIO and parses go north, take key, or say hello internally, capOS has only moved string command parsing out of the shell and into the app. That is still weaker than typed capability invocation.

Native interactive applications should expose a command surface:

path=["go"], args={direction:"north"}
path=["take"], args={item:"brass-key"}
path=["say"], args={text:"hello there"}
path=["chat","join"], args={channel:"#lobby"}

The user may still type familiar command <args> forms. The shell or terminal host parses them through generic command metadata, including nested subcommands, argument kinds, completions, and redaction rules. The app receives a structured invocation and converts it to typed service calls. The shell does not hardcode application verbs, and the application does not parse unstructured terminal text for normal operations.

StdIO remains an explicit text I/O capability for transcript output, simple programs, POSIX compatibility, and test harnesses. It should not be the primary command interface for native chat/adventure-style applications. The focused design is in Interactive Command Surfaces.

Remote Session CapSet Clients

Not every remote interaction should become a shell session. A regular host application – CLI, native GUI, Tauri backend, webapp gateway, or service client – should be able to authenticate to capOS, receive a broker-issued remote view of its session CapSet, and call the capabilities it was granted over Cap’n Proto RPC. That path is a programmatic peer of the native shell: both consume a session bundle from AuthorityBroker, but only the shell adds command parsing, terminal state, and child-process workflow.

The remote client must not receive the kernel’s local CapSet page, local cap-table indexes, endpoint selectors, result-cap indexes, or global session identifiers. It receives typed RPC object references backed by a capOS per-session worker. Chat, Paperclips, Adventure, command sessions, and future service APIs should therefore be callable by generated clients without routing through capos-shell. The owning design is Remote Session CapSet Clients. That proposal also covers bidirectional UI composition for web/Tauri/GUI sessions: services can propose task-specific panes or command surfaces through explicit UI caps, but cannot take arbitrary control of the host UI.

Terminal Host Separation

The shell should not be the terminal host forever. The component that owns a UART, web socket, GUI pane, line editing, history, paste handling, resize state, and render policy can be a separate terminal host process. The shell then runs against a terminal entity and can be reused unchanged from local console, GUI, web, and scripted hosts.

TerminalSession remains the foreground text-session authority, but it is an interface between terminal host and shell, not proof that the shell implements the terminal. Shell-spawned applications should normally receive command sessions or explicit StdIO adapters, not the shell’s move-only TerminalSession.

Remote text transports follow the same rule. The Telnet Shell Demo in Networking is a demo-only plaintext terminal host: it accepts a host-loopback QEMU-forwarded TCP connection and gives the shell a socket-backed TerminalSession. The kernel-side socket terminal silently consumes IAC option negotiation in its line discipline, so no userspace pre-handoff recv is required. It must not turn the shell login path into a raw ByteStream, raw TcpSocket, or StdIO substitute, because password entry, echo policy, cancellation, and shell launch authority are defined at the TerminalSession boundary. The QEMU harness for that demo binds the host forward to 127.0.0.1:2323 only and runs caps to prove the child shell did not receive raw NetworkManager, ProcessSpawner, TCP, or unknown capability interfaces. The gateway itself remains a trusted demo bootstrap service until scoped listener and manifest-declared shell-launch grants exist; production remote CLI shell access waits for the SSH gateway layer. The SSH path is specified separately in SSH Shell Gateway: it keeps the same TerminalSession and broker-issued shell-bundle boundary, while adding SSH host authentication, encrypted transport, public-key user authentication, channel policy, and remote-session audit. Its initial schema stubs name the terminal construction and authority surfaces as SshTerminalFactory, TcpListenAuthority, and RestrictedShellLauncher; they now have focused QEMU proofs for scoped listen authority, public-key session minting, restricted shell launch, and a bounded plain-TCP terminal-host handoff. A focused development-only host-key proof grants an explicitly labeled non-production SshHostKey cap in QEMU that performs bounded fixture exchange-hash signing. The full runnable OpenSSH gateway still waits on encrypted transport, SSH packet/channel handling, persistent production key-management-backed signing, and the final run-ssh-shell host harness.

Agent Mode

Model-driven interaction is defined in Language Models and Agent Runtime. This proposal does not describe a separate “agent shell” process. The native shell, running in “agent mode”, is the tool runner: it holds the session cap bundle, exposes caps to a LanguageModel service as typed ToolDescriptor values with per-tool permission modes (auto / consent / stepUp / forbidden), executes the model’s tool calls against its own caps, streams results back into the conversation, and keeps the user in the loop through consent prompts, streaming, and interrupt. There is no separate PlannerAgent or ActionPlan pipeline.

Long-lived OpenClaw-like hosted agents, swarms, background tasks, external channel ingress, agent-maintained memory/wiki stores, and MCP/A2A-style interoperability are intentionally separate from the shell surface; see capOS-Hosted Agent Swarms. The shell can launch, inspect, approve, or cancel hosted tasks, but it should not own the hosted-agent control plane.

Approval and Authentication

Elevation belongs in a trusted broker service that the shell can consult but cannot impersonate.

Conceptual interfaces:

interface ApprovalClient {
  request @0 (
    reason :Text,
    plan :ActionPlan,
    requestedCaps :List(CapRequest),
    durationMs :UInt64
  ) -> (grant :ApprovalGrant);
}

enum ApprovalState {
  pending @0;
  approved @1;
  denied @2;
  expired @3;
  escalated @4;
}

interface ApprovalGrant {
  state @0 () -> (state :ApprovalState, reason :Text);
  claim @1 () -> (caps :List(GrantedCap));
  cancel @2 () -> ();
}

interface AuthorityBroker {
  request @0 (
    session :UserSession,
    plan :ActionPlan,
    requestedCaps :List(CapRequest),
    durationMs :UInt64
  ) -> (grant :ApprovalGrant);
}

ActionPlan is the structured description of the work the request will perform. Free-form text it carries is for the approval UI; the broker decides authority from the typed step list, never from the summary string.

struct ActionPlan {
  # Brief, redactable, human-readable summary. Used by the approval UI;
  # not used as an authority input by the broker.
  summary @0 :Text;

  # Structured action steps. The broker decides whether each step is
  # representable for the bound session/profile; an unrepresentable step
  # fails the whole request.
  steps @1 :List(ActionStep);

  # True if any step modifies durable state, terminates a service,
  # releases storage, sends external traffic, or is otherwise hard to
  # reverse. Brokers may require step-up authentication and longer
  # review windows when this is set.
  destructive @2 :Bool;

  # Stable identifier the requester sets so it can correlate the resulting
  # grant or queue entry. Brokers must not interpret this as authority.
  requestId @3 :Data;
}

struct ActionStep {
  union {
    spawn :group {
      # Manifest entry name or trusted launcher alias. The broker
      # resolves the alias to a binary identity before grant.
      target @0 :Text;
      # Cap names the spawned process needs from the launcher's
      # advertised set. Each name maps to a concrete `CapRequest`
      # in the enclosing `ActionPlan.requestedCaps`.
      capNames @1 :List(Text);
    }
    serviceControl :group {
      service @2 :Text;
      verb    @3 :ServiceVerb;
    }
    storageOpen :group {
      namespace @4 :Text;
      path      @5 :Text;
      mode      @6 :StorageMode;
    }
    # Free-form structured payload describing a step the broker
    # recognises by name. Lets new step kinds land without re-issuing
    # the schema; brokers refuse unknown `kind` values.
    custom :group {
      kind    @7 :Text;
      payload @8 :Data;
    }
  }
}

enum ServiceVerb {
  start   @0;
  stop    @1;
  restart @2;
  reload  @3;
}

enum StorageMode {
  read       @0;
  readWrite  @1;
  append     @2;
}

CapRequest describes a single capability the plan needs. The broker matches each request against the principal’s role bundle and ABAC context; the response either narrows the request and mints the cap, or denies. There is no widening path.

struct CapRequest {
  # Capability interface name advertised by the broker
  # (`ServiceSupervisor`, `Directory`, `TcpProvider`, ...). The broker
  # refuses unknown interfaces.
  interface @0 :Text;

  # Identifier of the target object inside that interface. For
  # `ServiceSupervisor` this is the service name; for `Directory` it
  # is the namespace path; for `TcpProvider` it is an address-policy
  # selector. The broker validates the target against policy.
  target @1 :Text;

  # Per-cap maximum duration. The grant returns the lesser of this and
  # the plan-level `durationMs` after policy narrowing. Zero means
  # "use plan-level default".
  maxDurationMs @2 :UInt64;

  # Optional attenuation hints (subdirectory, method allow-list,
  # address filter). The broker may further narrow these but must
  # never widen them.
  attenuation @3 :Data;
}

GrantedCap is the same transport-level result-cap concept used by ProcessSpawner – a typed reference to an attenuated, leased capability the broker has minted. It is not a separate authority encoding; reading the granted cap is the only way to use the granted authority.

The native shell holds only a session-bound ApprovalClient. It does not submit arbitrary PrincipalInfo, role, UID, label values, or authentication proofs as authority. The ApprovalClient forwards the bound UserSession and typed request to AuthorityBroker. The broker or a consent service wrapping it holds powerful caps, drives any trusted consent or step-up authentication path, and mints attenuated temporary caps after policy and authentication checks.

The conceptual API intentionally has no authProof argument on the shell-visible path. If a proof is needed, it is collected by SessionManager, the broker, or a trusted approval UI and reflected back to the shell only as pending, approved, denied, expired, or escalated.

Approval Inbox

Synchronous approval is not always available. Step-up authentication, a dual-control destructive action, or a deferred review (for example a service-restart change-window) all need a durable queue: the request must be listable later, persistent across reconnects, and triageable in batch.

The broker exposes that queue through an ApprovalInbox cap minted into the session bundle of whoever may approve. The inbox is not a shell cap; the native shell uses ApprovalClient to submit requests, and a separate principal (a security operator, the same operator under step-up, or a multi-party reviewer set) holds the inbox cap that decides them. Remote workspaces (the CapSet UI) treat ApprovalInbox as the canonical pending-actions surface, which lets a browser session show “you have pending approvals” without granting the browser any of the requested authority.

interface ApprovalInbox {
  # List entries currently awaiting decision. Bounded; the broker
  # enforces a per-inbox visible-window cap and may return fewer than
  # `limit` rows. `truncated` distinguishes "broker capped this page"
  # from "no further rows".
  list @0 (
    cursor :Data,
    limit  :UInt32
  ) -> (
    entries    :List(ApprovalEntry),
    nextCursor :Data,
    truncated  :Bool
  );

  # Look up a specific entry by id. Useful when a UI deep-links to
  # an entry past the listed window.
  entry @1 (entryId :Data) -> (entry :ApprovalEntry);

  # Approve, deny, or escalate a single entry. `approve` returns the
  # `ApprovalGrant` minted by the broker; `deny` and `escalate`
  # transition the entry without minting caps. The decider's reason
  # text is bounded and recorded in audit.
  decide @2 (
    entryId  :Data,
    decision :ApprovalDecision,
    reason   :Text
  ) -> (grant :ApprovalGrant);

  # Bulk-decide entries that share shape (same requester principal,
  # same plan summary fingerprint, same destructive flag). The broker
  # rejects mixed shapes with an explicit diagnostic instead of
  # silently approving heterogeneous requests.
  batchDecide @3 (
    entryIds :List(Data),
    decision :ApprovalDecision,
    reason   :Text
  ) -> (grants :List(ApprovalGrant));

  # Subscribe to inbox change events. The listener cap is held by
  # the broker; logging out of the inbox session revokes the
  # subscription.
  watch @4 (listener :ApprovalListener) -> ();
}

enum ApprovalDecision {
  approve  @0;
  deny     @1;
  escalate @2;
}

struct ApprovalEntry {
  # Broker-minted opaque id, stable across reconnects.
  entryId       @0 :Data;
  # Opaque audit-only principal id of the requester.
  requesterId   @1 :Data;
  # Display name; not authoritative.
  requesterName @2 :Text;
  plan          @3 :ActionPlan;
  requestedCaps @4 :List(CapRequest);
  durationMs    @5 :UInt64;
  state         @6 :ApprovalState;
  # Last decider reason or denial detail; bounded.
  reason        @7 :Text;
  createdAtMs   @8 :UInt64;
  expiresAtMs   @9 :UInt64;
  escalation    @10 :EscalationInfo;
}

struct EscalationInfo {
  # Number of additional reviewers the broker has notified. Zero when
  # the entry has not been escalated.
  reviewerCount @0 :UInt32;
  # Role names of the additional reviewers; never principal ids.
  reviewerHints @1 :List(Text);
}

interface ApprovalListener {
  appended  @0 (entry :ApprovalEntry) -> ();
  decided   @1 (entryId :Data, state :ApprovalState) -> ();
  expired   @2 (entryId :Data) -> ();
}

The ApprovalClient itself does not change shape: a request that the broker cannot decide synchronously still returns an ApprovalGrant immediately, with state == pending and a stable handle. The broker adds an entry to the corresponding inbox; the requester polls or watches its grant; the inbox holder drives the decision. When the inbox holder calls decide(approve), the existing grant transitions to approved and claim returns the minted caps – the requester does not learn an entry id, and the inbox does not learn the requester’s ApprovalGrant cap. The two surfaces meet only at the broker.

Inbox entries are durable across reconnects because entryId is broker-minted and the inbox cap is session-bound rather than transport-bound. Closing a transport does not delete entries; re-presenting the same session-scoped inbox cap rebinds the listener without losing pending state. Entries expire on the broker timer at expiresAtMs and produce an expired listener event; expired entries remain visible to entry() for a bounded audit window defined by broker policy, after which they move to the audit log only.

Elevation Flow

User request (typed directly, or produced by agent-mode tool-use as an ActionPlan before invoking the broker):

restart the network stack

Requested action presented to the broker:

- stop service "net-stack"
- spawn "net-stack"
- grant: nic, timer, log
- wait for health check

Missing authority:
- ServiceSupervisor(net-stack)

Requested duration:
- 60 seconds

Broker decision:

  • Which UserSession and profile is this request bound to?
  • Is that principal/profile allowed to restart net-stack?
  • Is the requested binary allowed?
  • Are the requested grants narrower than policy permits?
  • Do mandatory confidentiality and integrity constraints allow the grant?
  • Is there fresh user presence?
  • Does this require step-up authentication?

If approved, the broker returns a narrow leased capability:

supervisor: ServiceSupervisor(service="net-stack", expires=60s)

It should not return broad ProcessSpawner, BootPackage, or DeviceManager authority when a scoped supervisor cap can do the job.

Authentication

Authentication proof should be consumed by the SessionManager or broker boundary, not exposed as a secret to the shell. Suitable mechanisms include:

  • password or PIN for medium-risk local actions.
  • hardware key or WebAuthn-style challenge for administrative actions.
  • TPM-backed local presence for device or boot-policy operations.
  • OIDC step-up: broker requests a fresh ID token from the session’s IdP with prompt=login, max_age, or stronger acr_values before returning a leased cap. The IdP and SessionManager drive the user interaction; the shell sees only pendingapproved/denied.
  • multi-party approval for destructive policy, storage, or recovery actions.

The shell should never receive raw tokens (including OAuth access or refresh tokens), private keys, recovery codes, or full environment dumps. When the broker must delegate outbound authority to a session — for example, “read from this company’s HR API” — it returns a wrapper capability that holds the AccessToken internally; the shell invokes the wrapper without seeing the bearer string.

Shell Hardening

The shell must treat files, logs, web pages, service output, model output, and CQE payloads as untrusted data. They are not instructions.

Required behavior:

  • show an executable typed plan before authority-changing actions.
  • keep elevated caps leased, narrow, and short-lived.
  • release temporary caps after the plan finishes or fails.
  • audit every approval request, grant, cap transfer, and release.
  • require exact targets for destructive actions.
  • refuse broad phrases such as “give it everything” unless a trusted policy explicitly allows a named emergency mode.
  • keep any model-derived context separate from secrets and authentication proofs; see the LLM/agent-runtime proposal for the model-service side.

The enforcement rule is simple: users and models may propose, explain, and request. Capabilities decide what can happen.

POSIX Shell

The POSIX shell is a compatibility layer for existing software and scripts. It should be useful, but it should not define native capOS administration.

The C-ABI substrate for porting POSIX programs (including a POSIX shell) is specified separately in POSIX Adapter. libcapos exposes the capability ring, CapSet, raw syscalls, and heap to C; libcapos-posix layers the POSIX shape (fd table, errno, pipe / read / write / dup / dup2, fork / execve / waitpid / _exit, posix_spawn and the file-action shims, clock_gettime, UDP socket calls, console-backed stdio) on top. Phases P1.1, P1.2, and P1.3 of that proposal are landed; the C-substrate, pipe cap, recording-shim fork-for-exec, direct posix_spawn path, and Console-backed stdio are proven by QEMU smokes (make run-c-hello, make run-posix-dns-smoke, make run-posix-pipe-smoke, make run-posix-stdio-smoke). The POSIX shell port itself depends on Namespace and File caps, which are tracked in that proposal as gating work after the current phases close.

Mapping

POSIX concepts map onto granted capabilities:

POSIX conceptcapOS backing
/synthetic root built from granted Directory or FileServer caps
cwdcurrent scoped Directory cap
fdlocal handle to File, ByteStream, pipe, terminal, or socket cap
pipeByteStream pair or userspace pipe service
PATHsearch inside the synthetic root or a command registry cap
execProcessSpawner or restricted launcher cap
socketssocket factory caps such as TcpProvider or HttpEndpoint
uid, gid, user, groupsynthetic POSIX profile derived from session metadata
$HOMEpath alias backed by a granted home directory or namespace cap
/etc/passwd, /etc/groupprofile service view, scoped to the compatibility environment
env varsdata only; never authority by themselves

If a POSIX process has no network cap, connect() fails. If it has no directory mounted at /etc, opening /etc/resolv.conf fails. If it has no device cap, /dev is empty or synthetic.

A POSIX shell is launched with both a CapSet and compatibility profile metadata. The profile controls what legacy APIs report. The CapSet controls what the process can actually do.

Compatibility Limits

Exact Unix semantics should not be promised early.

  • Prefer posix_spawn over full fork for the first implementation.
  • fork with arbitrary shared process state can be emulated later if needed.
  • setuid cannot grant caps. At most it asks a compatibility broker to replace the POSIX profile or launch a new process with a different broker-issued cap bundle.
  • Mode bits and ownership metadata do not create authority.
  • chmod can modify filesystem metadata exposed by a filesystem service, but it cannot grant caps outside that service’s policy.
  • /proc is a debugging service view, not kernel ambient introspection.
  • Device files exist only when a capability-backed adapter deliberately exposes them.

This is enough for many build tools and CLI programs without making POSIX the security model.

POSIX Session Caps

A normal POSIX shell session might receive:

terminal      TerminalSession
session       UserSession metadata
profile       POSIX profile view
root          Directory or FileServer synthetic root
launcher      restricted ProcessSpawner/command launcher
pipeFactory   ByteStream factory
clock         Timer

Optional caps:

tcp           scoped socket provider
home          writable user Directory
tmp           temporary Directory
proc          read-only process inspection tree

Administrative caps still require broker-mediated approval.

Recovery Shell

A recovery shell is a separate policy profile, not the normal interactive shell with hidden extra privileges. It may receive a larger cap set, but only after strong local authentication and with full audit logging. Guest and anonymous profiles must not fall into recovery authority by omission.

Possible recovery bundle:

console
boot package read
system status read
service supervisor for critical services
read-only storage inspection
scoped repair caps
approval client

Destructive recovery operations should still go through exact-target approval. The recovery shell should be local-only unless a separate remote recovery policy explicitly grants network access.

Required Interfaces

This proposal implies several service interfaces beyond the current smoke-test surface:

  • UserSession / SessionManager: principal/session metadata, audit context, and guest or anonymous profile creation (user identity proposal).
  • TerminalSession: session-scoped interactive terminal I/O. The first boundary is line-oriented write, writeLine, and bounded readLine with per-call echo control and submitted/cancelled/closed outcomes; resize and paste framing can layer on later.
  • StdIO: explicit text I/O capability serviced by the shell, a test harness, a web gateway, or another UI adapter. It has named stdout, stderr, and status streams plus line, block, and hidden read modes; it does not imply inherited POSIX file descriptors and should not be the semantic command interface for native interactive applications.
  • CommandSession: generic interactive command surface for native applications. It describes command paths, nested subcommands, argument shapes, completions, prompts, redaction metadata, render events, and typed invocation results.
  • TerminalHost / terminal entity: process and session object owning raw terminal transport, line discipline, presentation state, history, resize, and GUI/web framing while granting a foreground session to the shell.
  • SchemaRegistry: maps interface IDs to method names and parameter schemas.
  • CommandRegistry: optional registry of native command capabilities.
  • SystemStatus: read-only process and service status.
  • LogReader: scoped log access.
  • ServiceSupervisor: restart/status authority for one service or subtree.
  • AuthorityBroker / ApprovalClient: session-bound base bundles, plan-specific leased grants, and policy/authentication mediation.
  • CredentialStore, ConsoleLogin, and WebShellGateway: boot-to-shell authentication services for password-verifier setup, passkey registration, federated OIDC login, and text terminal launch (boot-to-shell proposal).
  • OAuthClient, OidcIdentityProvider, TokenVerifier, WorkloadIdentityFederation: OAuth2/OIDC primitives for federated login, outbound service authentication, and inbound resource-server token validation (OIDC and OAuth2 proposal).
  • SshGateway, SshHostKey, AuthorizedKeyStore, SshTerminalFactory, TcpListenAuthority, and RestrictedShellLauncher: production remote CLI terminal ingress, SSH host-key proof, public-key login mapping, scoped TCP listen authority, shell-only launch authority, and SSH-backed TerminalSession launch. The current development host-key proof exposes non-production public metadata and performs bounded fixture signing in QEMU; production host keys still require persistent key management (SSH shell proposal).
  • AuditLog: append-only record of plans, approvals, grants, and releases.
  • POSIXProfile / compatibility broker: synthetic UID/GID, names, $HOME, cwd, and profile replacement without treating POSIX metadata as authority.
  • ByteStream / pipe factory: explicit byte-stream composition for POSIX and selected native pipelines.

These should be ordinary capabilities. A shell only sees the subset it has been granted.

Implementation Plan

  1. Native serial shell

    • Built on capos-rt.
    • Lists initial CapSet entries.
    • Invokes typed methods on the capabilities it was actually granted, including TerminalSession for ordinary interactive sessions.
    • When launched with a restricted launcher or other scoped spawn authority, spawns and waits on exact-grant children without assuming broad BootPackage or ProcessSpawner access.
    • Provides caps, inspect, call, spawn, run, wait, release, and trace.
    • Runs interactive applications as ordinary spawned commands or resident command sessions. StdIO requests may be serviced for text-stream programs, but native app commands should flow through structured command surfaces.
  2. Session-aware shell profile

    • Use the SessionManager -> UserSession metadata and AuthorityBroker(session, profile) -> cap bundle split.
    • Add self/session introspection without making identity metadata authoritative.
    • Start with guest, local-presence, and service-account profiles before durable account storage exists.
  3. Structured native scripting

    • Add typed variables, result-cap binding, and plan serialization.
    • Add schema registry support for method names and argument validation.
    • Add a generic command-surface parser so command <args> and nested subcommands compile to typed invocations without app-specific shell matches.
    • Add explicit byte-stream adapters for commands that need text streams.
  4. Approval broker

    • Define ActionPlan, ActionStep, CapRequest, ApprovalClient, ApprovalInbox, ApprovalEntry, and leased grant records.
    • Add local authentication and audit logging.
    • Make administrative native-shell operations request scoped caps through the broker instead of running from a permanently privileged shell.
    • Wire ApprovalInbox into the operator session bundle so deferred, stepped-up, and multi-party approvals have a durable triage surface instead of relying on synchronous return-from-request.
  5. Boot-to-shell integration

    • Add local console login/setup in front of the native shell.
    • Require a configured password verifier when one exists.
    • Enter setup mode when no console password verifier exists.
    • Treat guest as an explicit local profile and anonymous as a separate remote/programmatic profile, not as missing-password fallbacks.
    • Support passkey-only web terminal setup through local/bootstrap authority, not unauthenticated remote first use.
    • The local console login/setup half of this step is landed; the full boot-to-shell flow (durable multi-verifier accounts, passkey paths, federated OIDC login, web text shell gateway, production SSH shell gateway) is tracked in Boot to Shell.
  6. Agent mode (out of scope here)

    • Defined in Language Models and Agent Runtime: no separate “agent shell” process. The native shell, running in “agent mode”, is the tool runner: it gains a LanguageModel client cap plus a per-tool permission table (auto / consent / stepUp / forbidden), exposes its own session caps as typed ToolDescriptor values to the model service, executes the model’s tool calls against those caps, streams results back into the conversation, and keeps the user in the loop through consent prompts and interrupts. There is no PlannerAgent or static ActionPlan pipeline.
  7. POSIX shell

    • Implement after Directory/File, ByteStream, and restricted process launch exist.
    • Start with posix_spawn, fd table emulation, cwd, scoped root, pipes, and terminal I/O, plus synthetic POSIX profile metadata.
    • Add broader compatibility only as real workloads demand it.

Non-Goals

  • No global root namespace.
  • No shell-owned root/admin bit.
  • No model-visible secrets.
  • No default inheritance of all shell caps into children.
  • No authorization from PrincipalInfo, UID/GID, role, or label values alone.
  • No promise that POSIX scripts observe exact Unix behavior without a compatibility profile that grants the needed caps.

Open Questions

  • Should the native shell syntax be CUE-derived, Cap’n-Proto-literal-derived, or a smaller custom grammar?
  • How should schema reflection be packaged before a full runtime SchemaRegistry exists?
  • How should later TerminalSession extensions such as resize and paste framing fit without exposing raw transport authority to ordinary shells?
  • How should the broker fingerprint plans for ApprovalInbox.batchDecide shape-equivalence? A direct hash of ActionPlan.steps is enough for identical plans submitted by the same requester profile, but near-identical plans differing only in requestId or summary text must still batch; near-identical plans differing in step targets or attenuation must not. The broker design needs an explicit fingerprinting rule before batchDecide can be enabled.
  • How should audit logs be stored before persistent storage exists?
  • How should interactive terminal UX scale beyond the planned “one typed capability per command” native-shell surface? The current prototype only exposes narrow typed commands; the questions below apply to the proposed surface, not just what already runs. Several concrete pain points are open:
    • Cap management is manual. A shell user holds a CapSet and must inspect, name, attenuate, pass, and release caps explicitly per command. That is the right model for trust, but it is hostile for everyday work compared with a Unix prompt where $PWD, $PATH, open fds, and ambient credentials disappear from the user’s mind. The question is what affordances (named bindings, scoped session “workspaces”, broker-issued bundles bound to a task, auto-release on plan completion, undo/redo on cap moves, a visible “current authority” indicator) the shell should provide so the typical user is not hand-curating a cap graph for every line. None of this should re-introduce ambient authority; the goal is ergonomics over an already typed graph, not hiding it.
    • No agreed convention for passing parameters to programs. The manifest currently launches binaries with a named CapSet and no positional args, no argv, no environment block, and no structured parameter struct (see system.cue and SystemManifest in schema/capos.capnp); init’s ProcessSpawner-driven children inherit only the caps named in the spawn plan. Shell spawn ... with { ... } syntax is similarly cap-only. That is consistent, but it leaves “what does this program need to know besides its caps?” unanswered: where do free-form values (a chat channel name, an adventure save slot, a resize width) live? Options range from a typed LaunchParameters capnp struct passed through the spawn plan, to a convention that every program declares a parameter schema discovered via SchemaRegistry, to letting parameters always travel as fields on the first method call against a CommandSession/service cap rather than at launch time. The proposal should pick a single shape and describe how the manifest, shell spawn/run, native applications, and POSIX argv adapters all map onto it.
    • No replacement for Unix pipes. The native composition example uses |> but defers byte-stream semantics to ByteStream/StdIO, which is a strictly weaker pipe and not a data-processing model. Real workloads on Unix lean on text streams precisely because they are cheap and structured-enough; capOS can do better with typed records. The open question is whether to standardize a higher-level data-processing primitive — for example, YTsaurus-style map/reduce operators where each stage declares input and output schemas (RecordStream<T>?), the runtime negotiates a wire format (capnp records, framed JSON, columnar, raw bytes) at the boundary, and the shell’s |> becomes a pipeline planner rather than a byte pump. That would give native shell pipelines first-class typed composition without making every interface look like ByteStream. The question is whether this belongs in shell scope, in a separate data-processing proposal, or as a RecordStream capability in the schema registry that the shell merely consumes.
    • No story for ordinary shell programming constructs. The proposed surface is one typed call per line plus |>; the prototype is even narrower. Real interactive and scripted use needs conditionals (branch on a cap call result, on CapException kind, on a value field), loops (iterate a List, fold a RecordStream, retry-with-backoff against a Timer), local variables and assignment beyond the implicit $ from |>, user-defined functions/procedures that take typed parameters and capability arguments, early-return / break, and structured error handling that distinguishes transport-level CapException from application-level result variants. Each of these has capability-graph consequences that POSIX shells never had to face: does a function body close over the caller’s CapSet by reference or by an explicit captured set, are caps bound inside a loop iteration auto-released at the end of that iteration, does a try/recover block release leased broker grants on the failure path, can a function be saved and re-invoked across sessions (i.e. does it become a persistent ActionPlan template), and how does the shell present a partial failure mid-pipeline without leaving orphan caps. The proposal should decide whether the native shell language defines these constructs itself, borrows them from a host language (CUE, a small embedded Rust-like DSL, an existing scripting runtime exposed as a capability), or stays deliberately non-Turing-complete and forces non-trivial control flow into spawned programs that expose typed CommandSession interfaces back to the shell.
    • No environment-variable concept, and no clear replacement. Unix $VAR / export does three jobs at once: ambient configuration inherited by every child, a per-process key-value scratchpad, and a side channel for caller-supplied tweaks (PATH, LANG, TZ, HTTP_PROXY, XDG_*). capOS deliberately has none of this — the manifest passes only a CapSet, and the shell does not synthesize a process-wide string-keyed table. There is also no obvious immediate need: configuration that should be authoritative belongs in a Config capability, locale/timezone are policy state on a session or service cap, and per-invocation tweaks fit the still-undecided parameter-passing convention above. The open question is whether capOS ever needs an explicit environment-like primitive (e.g. a KeyValueScope capability bound to a session, an inheritable structured “ambient context” attached to a spawn plan, or a typed ConfigOverlay channel) for the cases where Unix would have used an environment variable, or whether each historical use case should instead be replaced by a dedicated capability (Locale, Clock, ProxyPolicy, XdgPaths, LogLevel) and the absence of an environment table treated as a feature rather than a gap. POSIX compatibility still has to expose getenv/environ, but that is a separate per-process synthetic view inside the POSIX profile, not a native-shell concept.

Proposal: Remote Session CapSet Clients

Let a regular host application connect to a capOS instance, authenticate through the same session machinery as shells and gateways, receive a broker-issued remote view of its CapSet, and invoke the granted capabilities over standard Cap’n Proto RPC. The first proof can be a Linux Rust CLI because it is easy to script, but the design is for host applications generally: native GUI apps, Tauri apps with Rust backends, server-side webapp gateways, desktop tools, and agent runners can all consume the same remote session CapSet model.

The important correction is that this is not a special “remote chat client” and not another shell transport. Chat, Paperclips, Adventure, system-info, command surfaces, and future service APIs should be ordinary capabilities in a remote session bundle. A shell is one possible client of that bundle; it is not the universal protocol.

Current State

The tree has several local interop and UI proofs:

  • demos/capnp-chat-interop runs inside capOS, accepts one scoped TCP connection, decodes a schema-framed Chat.send parameter message, calls the resident chat endpoint, returns a schema-framed result, and exits.
  • The host harness uses a Linux Python script plus the pinned capnp tool to encode/decode request and result messages.
  • demos/remote-session-capset-gateway runs inside capOS, listens through a manifest-scoped TcpListenAuthority on guest port 2327, authenticates a remote session through SessionManager, returns a broker-shaped remote CapSet view, calls session/system-info DTO operations, and proves wrong-interface, unknown-cap, and stale-session denials. It derives login source metadata from the accepted socket and a gateway-generated connection event id.
  • tools/remote-session-client is a regular Linux Rust client crate. Its library is UI-neutral so the same client logic can back a CLI harness, native GUI, Tauri backend, or trusted web gateway.
  • remote-session-ui is a trusted loopback web bridge in that crate. Its Rust backend holds the TCP connection and remote session state, serves a browser UI, and exposes only view models, call results, denial diagnostics, and redacted transcript rows to browser JavaScript. The focused make run-remote-session-capset-ui harness drives that UI against a gateway-only QEMU fixture.
  • remote-session-web-ui is a capOS-served browser UI backend. Default make run starts it on guest port 8080 with loopback host forwarding, and make run-remote-session-self-served-web-ui proves the full boot-resource UI bundle is served from the capOS-owned origin while preserving the same browser-safe view-model boundary. This remains local/QEMU evidence; the cloudboot L4, private GCE, and public ingress proofs are separate tasks.

Those proofs are useful because they show external Cap’n Proto data can cross the QEMU TCP boundary and reach capOS-hosted services through narrowed listener caps. The remote-session proof is the first target-shaped slice, but it is not the final RPC API. It still lacks:

  • standard capnp-rpc message transport;
  • live typed RPC proxy objects rather than DTO-mediated gateway operations;
  • live endpoint-backed proxy objects beyond the current authenticated per-session DTO worker slices for Chat.send, Adventure status/look/inventory/go(direction), and the Paperclips Path B bridge-internal initial/command/status/projects synthesis from cached serviceLaunch state;
  • Paperclips service-runner launch on the default make run manifest (Path B wires the gateway worker, bridge dispatch, UI launch slot, and the system-remote-session-paperclips.cue focused manifest now declares its AuthorityBroker launch policy, but default-manifest Paperclips launch wiring remains future work);
  • the on-wire Paperclips control-plane (Path C): extending RemoteGatewayRequest/RemoteGatewayResponse with paperclips arms so the bridge no longer synthesizes responses from cached launch state and the gateway worker drives PaperclipsGameClient over a real DTO arm rather than the manifest-static game endpoint fallback;
  • rich Adventure/Paperclips client controls and broader service-specific worker/client implementations beyond the current Chat, Adventure, and Path B Paperclips slices;
  • complete object lifetime and exception behavior;
  • broader revocation and object-drop propagation beyond the current kernel-backed DTO logout and connection-teardown path;
  • TLS/mTLS and expanded auth adapters beyond password, anonymous, and guest;
  • resource accounting for remote references, in-flight calls, and result sizes.

Goals

  • Support a normal host client built and run outside capOS. A Linux Rust CLI is the smallest harness; native GUI and Tauri/webapp-backed clients should not need a different capOS protocol.
  • Authenticate through capOS session/admission services, not through an application-specific service secret.
  • Support multiple admission methods: local password where policy enables it, public-key signatures, OIDC/OAuth browser or device flows, passkey/WebAuthn through the web gateway path, mTLS client identity, guest/anonymous profiles where explicitly enabled, and future service/workload credentials.
  • Return a live remote CapSet view whose entries are typed RPC client objects, not serialized local cap-table slots.
  • Let the client call any granted remote-proxyable capability by name and expected interface ID.
  • Let a host UI discover broker-approved service profiles, start allowed game server processes through a restricted service-runner, and attach the capabilities those processes export or receive without exposing local spawn authority.
  • Support bidirectional session UI composition: a host UI can call capOS capabilities, and capOS-side services or agents can propose bounded changes to the host session’s panes, command palette, visualizations, density, theme, and workflow-specific controls through explicit UI capabilities.
  • Keep local-only authority local: cap IDs, endpoint generations, receiver selectors, session-global identifiers, and kernel result-cap indexes never become portable remote authority.
  • Preserve session-bound invocation context. Remote calls run under the gateway/worker session created for that remote client.
  • Make logout, disconnect, transport breakage, session expiry, policy revocation, and object drop observable and fail closed.

Non-Goals

  • General network transparency across arbitrary capOS hosts.
  • OCapN compatibility or third-party handoffs.
  • Browser JavaScript receiving capOS capability objects directly. A webapp may be a front end, but a trusted server, gateway, or Tauri Rust backend holds the remote CapSet.
  • Letting capOS services execute arbitrary host UI code, inject unreviewed JavaScript/CSS, spoof trusted browser/desktop chrome, or persist UI changes outside the granted session UI scope.
  • Replacing SSH, WebShellGateway, native shell, or interactive command surfaces.
  • Exposing raw ProcessSpawner, raw process handles, endpoint owner caps, local cap ids, result-cap slots, raw network factories, broad storage roots, key material, or browser-held capOS capability objects as a default remote bundle. Process handles stay backend-local.
  • Treating a browser or webview as a capOS capability host. Browser code sees view models, launch forms, command descriptors, user events, diagnostics, and rendered results; the trusted Rust/backend side holds the remote session and any remote capability proxies.
  • Treating password authentication as the only or preferred remote path.
  • Serializing the kernel CapSet page or local cap table to the client.

UI Scope And Architecture

This section is the single-page synthesis future contributors should read before changing anything in tools/remote-session-client/ or the gateway. The detailed mechanics live in the rest of this proposal, the backlog (docs/backlog/remote-session-capset-client.md), and the plan (docs/backlog/remote-session-capset-client.md); this section captures what the UI is for, what it must hold, and how the pieces decompose.

Goal

A remote operator, after authenticating to a capOS gateway, can drive every remote-proxyable capability the broker grants their session – directly, with typed UI, without a shell, without webview-held capOS handles, without leaking session-id hex, cap slots, or process handles to the browser. The CapSet UI is not a shell, not a generic API explorer, and not a browser; it is a peer client of the same broker bundle a shell would consume, over TCP/RPC instead of the ring page, with a backend-held authority boundary and a typed UI on top.

What the UI is for

Grouped by intent, not by panel. Each item is constrained by the corresponding section later in this proposal.

  • Sign in to a remote capOS host. OS-style login surface with a visible username field, secondary endpoint/auth controls, no full persistent technical header. The gateway advertises the auth methods the system makes available (narrowed only by explicit manifest policy); disabled methods stay listed and clearly marked so the protocol is not password-shaped. The web UI’s username field is empty by default – the bridge does not pre-fill from CAPOS_REMOTE_SESSION_USER, host USER, or any other host-side identity hint, because a pre-fill leaks operator/account hints to anything observing the page before authentication. The CLI may take --user as an explicit operator override; the web UI does not. Denials surface with explicit codes, never as silent transport errors.
  • Understand who/what the operator is. Session view: principal, profile, auth method, auth strength, freshness/expiry, logout. Redacted session-id only. Lifecycle states observable: live / logged_out / future expired / revoked / recovery_only. Stale-call attempts must visibly fail closed. (See ## Invocation Context and docs/proposals/session-bound-invocation-context-proposal.md.)
  • Discover what was granted. CapSet view as the inspection surface (name, interface id, transfer policy, lease expiry, get-by-name+id); service catalog view as the task-oriented surface (broker- and launcher-advertised runnable profiles, required grants, exported descriptors, launch/probe/status). See ## Service Catalog And Game Server Launch.
  • Use what was granted. For every cap the broker bundles, the UI must offer at least a generic invocable form – not just inspection. Service-specific rich clients (Adventure rich client, real Chat panel, Paperclips client, future agent-shell-services) layer on top of the same backend-held caps. Where a service exposes a typed CommandSurface (see docs/proposals/interactive-command-surface-proposal.md), the UI renders typed buttons/inputs/selectors driven by that surface’s metadata rather than hand-coded controls. Where a service exposes text/audio/video surfaces, the UI consumes them through the Chat substrate (docs/proposals/chat-multimedia-substrate-proposal.md): listener caps for incoming text/audio/video, capnp -> stream methods for outgoing media, capability-mediated peer/channel granting, and a WebRTC mapping for the browser-to-backend audio/video path. The CapSet UI never holds the listener caps directly; the trusted Rust backend owns them and emits redacted view-model events plus WebRTC handles for the browser.
  • Host a terminal panel when granted. The CapSet UI is not defined as a terminal emulator and works without one. But when the broker grants a TerminalSession cap – for a native shell, a POSIX shell, or any StdIO-based service that expects a terminal on the other side – the UI may host a terminal panel for that cap. The boundary stays: terminal bytes flow through a backend-held TerminalSession; the browser renders frames it receives, never opens a raw shell or holds a ProcessSpawner.
  • Surface agent-shell-exposed capabilities as first-class. The CapSet UI does not contain the LLM loop, model client, or tool-execution runner – those live in the agent shell process (see docs/proposals/llm-and-agent-proposal.md). But agent-shell-exposed services (e.g. “send message to running agent”, “approve queued action”, “audio stream to/from agent”) are services the broker can bundle. When bundled, the CapSet UI exposes them through the same per-session worker / typed view-model pattern as Chat or Adventure. Action-approval queues are the canonical capability-driven UI surface here – the policy engine asks, the operator sees a queue and approves/denies per item.
  • Launch services where policy allows. Service-runner launch flow: select profile → see required grants → side-effect-free probe → confirm → backend launches restricted server graph (e.g. adventure-server + NPC companions) → backend attaches/retains exported descriptors in the backend-held remote CapSet. Browser sees launch form, status, denials, descriptors – never raw ProcessSpawner or process handles.
  • Diagnose / audit. Low-level probes (denied-chat, stale-call, system MOTD, session-summary diff) live in a Diagnostics or Session panel, not interleaved with normal service use. Redacted transcript export in its own view; redaction status visible; raw authority material absent. UI smoke checks for forbidden markers (processhandle, capabilitymanager, capslot, …).
  • Bidirectional UI composition (later). A capOS service may, only when granted a RemoteUiSurface cap, propose bounded layout/theme/command/visualization patches and receive typed user events back. Cannot inject JS/CSS, spoof login chrome, persist UI state without a separate settings cap, or exceed quota/size bounds. See ## Bidirectional UI Composition.

Design invariants the UI must hold

The proposals don’t specify pixel layout; they specify a small number of hard invariants. Every UI design choice has to fit these:

  1. Authority boundary. Trusted Rust backend holds: TCP connection, remote session state, per-session worker proxies, capOS cap references, broker bundle policy, raw snapshots used to compute view models. Browser holds: view models, command descriptors, launch forms, redacted transcript rows, theme state.
  2. Session-bound invocation. Every post-auth call runs under the immutable SessionContext of the per-session worker. The browser cannot select identity by request field; the backend cannot construct a fresh SessionContext from request bytes. Logout, disconnect, expiry, revocation must break all session-bound proxies and fail closed before result bytes reach the caller.
  3. Privacy-preserving disclosure. Default endpoint metadata is opaque (scoped_ref + freshness). Subject fields (principal, profile, auth strength) appear in the UI only because the broker policy explicitly disclosed them for that service.
  4. Capability = invoke gate; UI surface = render gate. A button on the screen is not what authorizes a call. The cap held in the backend is. UI controls that aren’t currently invocable must say “planned / not remote-proxyable yet” rather than imply they work.
  5. Interface = permission. Method-level access lives in the schema, not in a per-cap rights bitmask. Narrowing what a remote client can do means a narrower wrapper cap from the broker – not a flag on the same cap.
  6. Side-effect-free probes are real. A probe response that says “supported / required grants accepted / message” did not spawn anything, allocate endpoint owners, or attach caps.
  7. Redaction is structural, not after-the-fact. Sensitive fields are dropped or redacted on the way into view models, not stripped from logs after the fact. Backend tests assert browser envelopes never contain raw session-id hex or password material.
  8. UI smoke fails if any visible button is unexercised. This prevents the UI from accumulating decorative controls.
  9. Theme/layout state is local UI state, not capOS state. Persistence requires an explicit settings cap.

Architecture decomposition

flowchart LR
  subgraph host[Host machine]
    subgraph browser[Browser / webview / Tauri webview]
      js[Browser JS - view models, forms, results, redacted transcript, theme state]
    end
    subgraph rust[Trusted Rust backend - tools/remote-session-client]
      bridge[HTTP bridge - /api/* endpoints]
      app[AppState - session VM, caps VM, snapshots, transcript, automation]
      tcp[Gateway TCP connection - schema-framed DTOs today, capnp-rpc planned]
      lib[remote-session-client lib - protocol, frame, session_diff, transcript]
    end
    cli[CLI binary - same lib backend]
  end

  subgraph capos[capOS guest in QEMU or future hardware]
    subgraph gw[Remote-session gateway process]
      tcplisten[TcpListenAuthority on guest port 2327]
      authflow[Auth flow - password, anonymous, future adapters]
      sm[SessionManager.login -> UserSession]
      broker[AuthorityBroker.remoteClientBundle]
    end
    subgraph workers[Per-session RPC workers]
      chatw[Chat worker - holds Chat client facet]
      advw[Adventure worker - holds Adventure endpoint]
      futurew[Future workers per service - terminal, agent, voice...]
    end
    subgraph services[Backing services]
      cs[chat-server]
      ad[adventure-server + NPCs]
      pc[paperclips-server - future]
    end
    kernel[Kernel - SessionManager, CapTable, Endpoints, ring, audit]
  end

  js -- HTTP JSON --> bridge
  bridge --> app --> lib --> tcp
  cli --> lib
  tcp -- TCP / DTO today / capnp-rpc planned --> tcplisten
  tcplisten --> authflow --> sm --> broker
  broker -- backend-held descriptors / caps --> app
  app -- worker spawn requests --> broker
  broker --> workers
  chatw --> cs
  advw --> ad
  workers <--> kernel

Key seams:

  • Gateway boundary (demos/remote-session-capset-gateway/): scoped TcpListenAuthority, SessionManager, AuthorityBroker, narrowly approved backend launch authority. No raw NetworkManager, raw ProcessSpawner, broad endpoint authority.

  • Per-session worker boundary (demos/remote-session-chat-worker/, demos/remote-session-adventure-worker/, future workers): each endpoint-backed remote method runs in a worker that holds the live session-bound caller context. Worker spawn is validated; logout/connection-close tears down workers; release flushing happens on shutdown.

  • Trusted Rust backend boundary (tools/remote-session-client/src/): the AppState keeps gateway: Option<GatewayConnection>, current_snapshot: RemoteSessionSnapshot (raw), and view-model fields (redacted). The HTTP bridge’s /api/* surface is the only path the browser has into capOS authority.

  • Browser boundary (tools/remote-session-client/ui/): pure client of /api/state view models, /api/call/* typed calls, /api/capset/*, /api/probe/*, /api/transcript/*. JS state is presentation: theme, active tab, login form values, click coverage report.

  • Transport evolution. Today: bespoke schema-framed Cap’n Proto DTOs, length-prefixed frames, request/response sequence numbers. Planned: standard capnp-rpc with live proxy objects, exception mapping, release/drop, promise pipelining. The backend boundary stays the same; the wire shape changes.

    Standard capnp-rpc (the capnp-rpc Rust crate, v0.25 at the time of writing) is std-only and requires a futures executor; the QEMU-side gateway is #![no_std] #![no_main] with a synchronous loop { accept; loop { recv_frame; handle; send_frame } } shape (demos/remote-session-capset-gateway/src/main.rs). The wire-level replacement is therefore gated on either bringing an async runtime to capOS userspace or shipping a sync-friendly capnp-rpc adapter. Until then, transport-lifetime / exception behavior carries the contract documented next, which the eventual rewrite must preserve.

    Runtime decision for the first proxy layer: use a temporary dual-stack. The Linux host backend now has a local capnp-rpc Chat facade/proxy layer because that side already has std and can run a futures executor. The facade translates backend-held typed proxy calls into the existing RemoteGatewayRequest / RemoteGatewayResponse DTO transport, so the guest gateway remains synchronous and #![no_std]. This proves host-backend proxy semantics, denial/disconnect mapping, and browser-safe view-model integration; it does not claim standard capnp-rpc framing or live RPC vats inside capOS. Gateway-wire replacement waits for the userspace runtime decision above, and the dual-stack must be removed after the reviewed guest-side RPC path carries live service traffic.

Transport lifetime and exception contract

The bespoke transport’s lifetime contract is what the future capnp-rpc proxy layer has to preserve. The host-side test module in tools/remote-session-client/src/bin/remote_session_ui.rs pins each rule end-to-end:

  • Connection close mid-call clears state, returns gatewayDisconnected. A TCP FIN observed during a request surfaces as 503 gatewayDisconnected with view.lastResult.code = "gatewayDisconnected", view.connected = false, session = null, empty caps / services / launchers, and a disconnect transcript row scoped to the operation that failed. Covered by authenticated_gateway_close_during_call_clears_view_with_reconnect_guidance, oversized_gateway_response_during_call_clears_view_with_reconnect_guidance, password_denial_then_closed_tcp_resets_before_retry, http_password_denial_then_closed_tcp_preserves_backend_error_and_clears_view.
  • Half-open transport (write succeeds, read stalls) times out cleanly. The bridge’s read_timeout (endpoint.io_timeout()) must fire and surface the same gatewayDisconnected shape; no hang or partial-state leak. Both the post-request stall case and the partial-frame-header stall case are covered: half_open_response_read_times_out_as_disconnect, partial_response_header_then_stall_treated_as_disconnect.
  • Protocol-level decode errors (sequence mismatch, malformed payload) yield 500 internal without tearing down the connection. This documents current behavior; the future capnp-rpc rewrite is expected to tighten this to a connection- level abort once the proxy layer is in place. Covered by response_with_wrong_seq_yields_internal_error, malformed_response_payload_yields_internal_error.
  • Immediate re-login after transport failure succeeds. No retry / cooldown gate; the recovered session must not echo the prior call’s failure as lastResult. Covered by immediate_relogin_after_mid_call_close_succeeds.
  • disconnect rows survive into the operator-visible exported transcript (GET /api/transcript/redacted) scoped to the operation that failed and free of stream-level metadata (peer addresses, frame sizes, raw os error strings, secrets). Covered by disconnect_recorded_in_exported_transcript_after_mid_call_close.
  • Gateway-side teardown calls kernel UserSession.logout on both the explicit-logout DTO path and the connection-close path. Verified by the QEMU-driven harness in tools/qemu-remote-session-capset-smoke.sh, which asserts that UserSession.logout cap call succeeded; remote session stale and connection teardown UserSession.logout cap call succeeded both appear during the multi-cycle interop run.
  • Post-logout calls fail closed. The bridge keeps the gateway socket alive after logout so a stale-call probe gets an explicit staleSession denial rather than a transport failure. Covered by repeated_stale_calls_after_logout_remain_fail_closed and the worker-targeted stale_chat_proxy_after_logout_returns_typed_denial.
  • Worker/proxy lifetime failures preserve the same split. Worker-targeted Chat.send transport loss and oversized worker responses clear backend gateway/session state and surface gatewayDisconnected with reconnect guidance, while post-logout worker calls remain typed staleSession denials on the still-open gateway socket. The backend-only capnp-rpc facade maps transport breakage to ErrorKind::Disconnected, and maps DTO denials or unexpected worker/proxy responses to Failed CapException-like errors rather than panics or silent broader authority. Covered by chat_worker_transport_breakage_clears_state_and_redacts_export, oversized_chat_worker_response_maps_to_disconnect_without_frame_leak, generated_chat_client_transport_breakage_maps_to_disconnected_exception, generated_chat_client_dto_denial_maps_to_failed_cap_exception_like_error, and generated_chat_client_unexpected_worker_response_maps_to_failed_exception.
  • Revoked leases are not yet separately observable. The current DTO surface carries leaseExpiresAtMs on cap entries, but it has no explicit revoke/lease-expired call path or denial code that can distinguish a revoked lease from staleSession or methodDenied. Tests must not fake this coverage; add it with the standard RPC object lifetime path or a reviewed DTO denial shape.
  • Redacted transcript export does not expose exception/lifetime internals. Worker-targeted disconnect, oversized response, and stale-session exports are asserted free of raw socket addresses, OS error strings, frame-size diagnostics, local cap ids, result-cap labels, proxy table positions, raw session-id hex, passwords, and host endpoint hints.

Resource and revocation bounds

Each per-session resource class has an explicit named ceiling and maps over-cap conditions to a typed denial diagnostic that reuses the transport-error envelope from above. Operators tuning these bounds should re-audit the per-session memory budget and the operator-multitool scenario before changing them; raw observed counters are not exposed to browser-facing view models.

ResourceConstantDefaultWhere enforcedDenial code
Outstanding worker calls per sessionMAX_OUTSTANDING_WORKER_CALLS_PER_SESSION4tools/remote-session-client/src/bin/remote_session_ui.rs::transact (gates Adventure / Chat-shaped requests before submission)tooManyWorkerCalls (HTTP 503)
Transcript ring per sessionTRANSCRIPT_ROWS_CAP (4096), TRANSCRIPT_DETAIL_BYTES_CAP (1 MiB)row + byte capsAppState::push_transcript / enforce_transcript_caps in the same filedrop-oldest plus a single audit "transcript truncated; ..." row
Backend cap holders per sessionMAX_BACKEND_CAP_HOLDERS_PER_SESSION (64), MAX_BACKEND_SERVICE_CATALOG_ENTRIES (64), MAX_BACKEND_LAUNCHER_CATALOG_ENTRIES (32)per-Vec entry capscapset_list / service_catalog / launcher_catalog in the same filetooManyCapHolders (mirrors transport-error envelope)
Browser-session owner slotone tentative or authenticated ownerfirst-wins bridge ownerlogin-route preflight reserves before gateway authentication; success finalizes on cookie rotation, failure releases the reservationsessionAlreadyInUse (HTTP 409)
Local HTTP request parserrequest line 8 KiB, header line 8 KiB, 96 headers, aggregate headers 32 KiB, body 64 KiB, fixed read/write timeoutloopback bridge input boundsread_http_request and handle_connection reject before route dispatch, JSON parsing, auth, or gateway I/OhttpLineTooLong, tooManyHeaders, headersTooLarge, requestBodyTooLarge, requestTimeout
Local HTTP handler slotsMAX_HTTP_HANDLER_THREADS (32)concurrent request handlersaccept loop acquires a bounded slot before spawning a handler threadhandlerLimitExceeded (HTTP 503)
Concurrent gateway logins per principalMAX_CONCURRENT_LOGINS_PER_PRINCIPAL (4), PRINCIPAL_TABLE_SLOTS (32)per-principal counter, distinct-principal table ceilingdemos/remote-session-capset-gateway/src/lib.rs::PrincipalLoginTable::try_admit, called from both password and anonymous login pathsserviceUnavailable with “per-principal concurrent-session cap reached…”

The bridge-side bounds are exercised by host tests in remote_session_ui.rs::tests (transcript_row_count_cap_drops_oldest_with_truncation_marker, transcript_byte_cap_drops_oldest_with_truncation_marker, transcript_at_exact_row_cap_does_not_truncate, capset_list_at_max_holders_bound_stores_all_entries, capset_list_over_max_holders_returns_typed_denial, service_catalog_at_max_entries_bound_stores_all_entries, service_catalog_over_max_entries_returns_typed_denial, launcher_catalog_at_max_entries_bound_stores_all_entries, launcher_catalog_over_max_entries_returns_typed_denial, outstanding_worker_calls_at_bound_still_allow_one_more_after_completion, outstanding_worker_calls_over_bound_returns_typed_denial, concurrent_first_wins_login_reservations_allow_one_post_login_owner, failed_login_reservation_releases_for_later_owner, http_parser_rejects_oversized_request_line_before_route_work, http_parser_rejects_oversized_header_line, http_parser_rejects_too_many_headers, http_parser_rejects_aggregate_headers_too_large, http_parser_rejects_oversized_body_from_content_length, http_parser_times_out_incomplete_request_line, handler_slots_bound_concurrent_request_threads). The gateway-side bound is exercised by host tests in demos/remote-session-capset-gateway/src/lib.rs::tests (admits_up_to_max_concurrent_logins_per_principal, rejects_over_cap_admission_with_typed_denial, release_reopens_a_slot_for_the_same_principal, distinct_principals_have_independent_counters, release_to_zero_drops_the_slot, release_unknown_principal_is_a_noop, table_full_admission_does_not_grow_past_slot_ceiling).

Two contracts the future capnp-rpc rewrite must preserve: fail-closed bound exhaustion never panics or leaks raw counters into browser envelopes (only typed denial codes plus a backend audit row); and operator-visible audit material (bound-exhausted transcript rows, drop-oldest truncation markers) is recorded backend-side through the existing redacted-transcript path, not surfaced through new untyped error channels.

Layer map for future iterations

LayerOwnerTodayHeading toward
Wiregateway ↔ backendlength-prefixed schema-framed DTOsstandard capnp-rpc over TCP, then TLS/mTLS
Authgatewaypassword, anonymous, guest; disabled methods advertised+ public key, OIDC (device-code + PKCE), passkey, mTLS, service credential
Bundlebrokershell-bundle-shaped wrapper for remotefirst-class remoteClientBundle profile shape
Workerper-sessionChat.send, Adventure status/look/inventory/gobroader Adventure verbs, real Chat panel, Paperclips worker, generalized lifecycle, terminal-session host, agent-shell services
Backend (Rust)trustedAppState, snapshot, view models, transcript, automation, first-wins BrowserSession ownership, local HTTP parser/handler bounds, per-session resource bounds (worker-calls, transcript rows + bytes, cap holders, gateway logins per principal)live RPC proxy state, RemoteUiHost cap holder
Browseruntrusted UIlogin + Services / CapSet / Diagnostics / Transcript / Session SPAricher service-specific clients, generic CommandSurface-driven forms, agent-shell mode, terminal panel for granted TerminalSession, RemoteUiSurface rendering
Host packagingtrustedCLI, make remote-session-ui, make remote-session-tauri check/dev wrapperdistributable Tauri package sharing the same Rust backend

Self-served capOS web UI boundary

The first self-served browser UI is a capOS-hosted application service, not the host remote-session-ui development bridge moved into the guest. A new capOS userspace service, remote-session-web-ui, owns the HTTP listener, serves the UI bundle, runs the authenticated web-session backend, holds the remote session CapSet/proxy state, and projects browser-safe view models.

Static assets are boot-package resources. The implementation should reuse the reviewed host UI asset source or a smaller reviewed subset, but the served copy is an immutable, fixed-name bundle embedded in the capOS boot package and granted by manifest resource name with a pinned digest or equivalent build-time integrity label. remote-session-web-ui serves only that bundle and a small generated bootstrap document; it does not expose a host directory, capOS storage root, asset traversal, or development hot-reload path.

The first listener surface is HTTP/1.1 on a manifest-scoped TcpListenAuthority for a dedicated UI port such as guest port 8080. HTTP serves static assets plus same-origin JSON API routes. WebSocket, server-sent events, and terminal/media streaming remain later extensions that need separate route-level bounds; the first proof should avoid them so the authority and validation surface is small.

The manifest grants for remote-session-web-ui are narrow: scoped TcpListenAuthority for the UI port, SessionManager, AuthorityBroker, the immutable UI asset bundle, and the same restricted remote-client service-runner/backend-launch authority needed to expose approved service descriptors. It must not receive raw NetworkManager, raw socket factories, broad storage roots, raw ProcessSpawner, shell launcher authority, endpoint owner caps, or arbitrary endpoint creation authority.

The service is the trusted backend and holds remote CapSet/proxy state server-side. Browser JavaScript receives only view models, launch forms, user-event commands, typed results, denial diagnostics, and redacted transcript rows. It never receives raw capOS caps, raw ProcessSpawner, process handles, endpoint owner authority, local cap IDs, result-cap slots, session-global identifiers, remote CapSet handles, host usernames, host environment variables, host paths, or QEMU-forwarding identity hints.

Login remains session-manager shaped. The browser submits credentials or guest/anonymous intent to the capOS-served JSON endpoint; the service derives source metadata from its accepted socket and service-generated event id, asks SessionManager for a UserSession, asks AuthorityBroker for the remote-client bundle, and only then exposes disclosed session/service fields as browser-safe models. The browser cannot select principal, profile, worker session context, or backend cap holder by replaying a request field.

Gate 1B is now an evidence ladder rather than a single proof name. The landed local/QEMU layer is:

  • remote-session-self-served-web-ui: a focused manifest boots remote-session-web-ui, browser automation loads assets from the capOS-owned origin, logs in, calls at least one granted capability through the service-held backend state, proves logout/stale failure stays closed, and checks forbidden authority markers are absent from browser-visible envelopes and transcripts.
  • remote-session-self-served-web-ui-default-run: default make run starts the capOS-served UI on guest port 8080 and forwards it to a loopback host port for local operator use.
  • remote-session-self-served-full-ui-bundle: the capOS service now serves the reviewed fixed-name boot-resource bundle, including the operator workspace assets and /bundle/manifest.json, with explicit content types, no directory traversal, and digest-pinned build evidence.

Those proofs do not close the selected GCE Web UI path by themselves. The local service proof cloud-prod-remote-session-web-ui-l4-local-proof is done: it runs remote-session-web-ui through the non-qemu cloudboot socket path using the Phase C userspace network stack and configured IPv4 route, not the older QEMU-only kernel socket fixture or the host remote-session-ui bridge. After that, cloud-gce-private-self-hosted-webui-proof proves private GCE reachability over the live NIC without public IP or public firewall exposure. cloud-gce-public-self-hosted-webui-ingress-tls is the later public operator-access task; it remains on hold for explicit public-ingress/TLS authorization even though the ingress policy design is recorded.

Rollback is manifest/build-target selection: remove the focused target and the remote-session-web-ui listener/asset grants while keeping the host-served make remote-session-ui bridge and the remote-session CapSet gateway unchanged.

Architecture

flowchart TD
    Client[Host app: CLI, GUI, Tauri, or web gateway] -->|TCP/TLS + capnp-rpc| Gateway[RemoteSessionGateway]
    Gateway --> Auth[Auth adapters]
    Auth --> Sessions[SessionManager]
    Gateway --> Broker[AuthorityBroker]
    Broker --> Worker[Per-session RPC worker]
    Broker --> Catalog[Remote service catalog]
    Catalog --> Runner[Restricted service runner]
    Runner --> GameServers[Game server processes]
    Worker --> RemoteCapSet[RemoteCapSet]
    RemoteCapSet --> Proxies[Remote capability proxies]
    GameServers --> Proxies
    Proxies --> LocalCaps[capOS capabilities]
    Worker --> Audit[AuditLog]

The remote listener is a trusted gateway. In the final RPC shape it accepts the transport, performs or delegates authentication, obtains a UserSession, asks the broker for a remote-client bundle, and hosts a per-session RPC vat. That vat exports a RemoteSession object and remote proxy objects for capabilities in the broker-issued bundle. During the temporary dual-stack period, the guest side still accepts DTO frames and the Linux host backend hosts the first local proxy facade over those DTO calls.

For the first implementation the per-session worker may be an ordinary capOS service process. That shape matches the session-bound invariant: one workload process has one immutable session context. A single long-lived gateway may handle pre-auth connection state, but post-auth capability invocation should run inside a worker whose session context is the authenticated remote session, or through an equivalently reviewable dispatch path that cannot mix unrelated user sessions as ambient authority.

Bootstrap Interfaces

The DTO surface below is now pinned in schema/capos.capnp: RemoteAuthStart, RemoteAuthStep, RemoteServiceGrantRequirement, RemoteServiceExport, RemoteServiceProfile, plus the RemoteSessionGateway, RemoteAuthFlow, RemoteSession, RemoteCapSet, RemoteServiceCatalog, and RemoteServiceRunner interfaces. Round-trip coverage for the new structs lives in capos-config/tests/remote_capnp_rpc_dto_roundtrip.rs. The transport that consumes them is still gated on the userspace async-runtime decision (capnp-rpc v0.25 is std-only and needs a futures executor). The first proxy slice is host-backend-only and dual-stack: it uses capnp-rpc locally in the trusted Linux backend for Chat while translating to the legacy RemoteGatewayRequest/RemoteGatewayResponse DTO union on the gateway wire. The schema and generated bindings do not change for that slice, and browser JavaScript still receives only view models, typed results, typed denials, and redacted transcript rows.

enum RemoteAuthKind {
  password @0;
  publicKey @1;
  oidcDeviceCode @2;
  oidcAuthorizationCodePkce @3;
  passkey @4;
  mtlsClientCert @5;
  guest @6;
  anonymous @7;
  serviceCredential @8;
}

struct RemoteAuthMethod {
  kind @0 :RemoteAuthKind;
  label @1 :Text;
  profileHints @2 :List(Text);
  interactive @3 :Bool;
  enabled @4 :Bool;
}

struct RemoteAuthStart {
  kind @0 :RemoteAuthKind;
  selector @1 :LoginSelector;
  requestedProfile @2 :Text;
  clientNonce @3 :Data;
  # Source metadata is intentionally not a client-supplied field.
  # The gateway derives LoginSourceMetadata from the accepted socket
  # and its own connection event id before calling
  # SessionManager.login. A client-supplied source field would let
  # remote callers forge audit metadata downstream services depend on.
}

struct RemoteAuthStep {
  prompt @0 :Text;
  redaction @1 :Bool;
  url @2 :Text;
  userCode @3 :Text;
  challenge @4 :Data;
  expiresAtMs @5 :UInt64;
}

interface RemoteSessionGateway {
  authMethods @0 () -> (methods :List(RemoteAuthMethod));
  start @1 (request :RemoteAuthStart) -> (flow :RemoteAuthFlow);
  guest @2 (requestedProfile :Text) -> (session :RemoteSession);
  anonymous @3 (requestedProfile :Text) -> (session :RemoteSession);
}

interface RemoteAuthFlow {
  next @0 (response :Data) -> (step :RemoteAuthStep, done :Bool,
      session :RemoteSession);
  cancel @1 () -> ();
}

struct RemoteCapEntry {
  name @0 :Text;
  interfaceId @1 :UInt64;
  transferPolicy @2 :Text;
  leaseExpiresAtMs @3 :UInt64;
}

interface RemoteSession {
  info @0 () -> (info :SessionInfo);
  capSet @1 () -> (caps :RemoteCapSet);
  renew @2 (proof :Data, requestedDurationMs :UInt64)
      -> (session :RemoteSession);
  logout @3 () -> ();
}

interface RemoteCapSet {
  list @0 () -> (entries :List(RemoteCapEntry));
  get @1 (name :Text, expectedInterfaceId :UInt64) -> (cap :AnyPointer);
}

struct RemoteServiceGrantRequirement {
  name @0 :Text;
  interfaceId @1 :UInt64;
  transferMode @2 :Text;
  holder @3 :Text;  # backendHeld, serviceOwned, or clientFacet
}

struct RemoteServiceExport {
  name @0 :Text;
  interfaceId @1 :UInt64;
  transferPolicy @2 :Text;
}

struct RemoteServiceProfile {
  id @0 :Text;
  label @1 :Text;
  processGraph @2 :List(Text);
  requirements @3 :List(RemoteServiceGrantRequirement);
  exports @4 :List(RemoteServiceExport);
  state @5 :Text;  # unavailable, attachable, startable, running
}

struct RemoteServiceLaunchRequest {
  profileId @0 :Text;
  grantNames @1 :List(Text);
}

struct RemoteServiceCatalogEntry {
  id @0 :Text;
  label @1 :Text;
  summary @2 :Text;
  capName @3 :Text;
  transportInterfaceId @4 :UInt64;
  schemaInterface @5 :Text;
  proxyStatus @6 :Text;
  methods @7 :List(Text);
  notes @8 :List(Text);
}

struct RemoteServiceLaunchStatus {
  profileId @0 :Text;
  status @1 :Text;         # notLaunched, unsupported, denied, ready, running
  launchSupported @2 :Bool;
  message @3 :Text;
  acceptedGrantNames @4 :List(Text);
  exportedServices @5 :List(RemoteServiceCatalogEntry);
}

interface RemoteServiceCatalog {
  list @0 () -> (profiles :List(RemoteServiceProfile));
}

interface RemoteServiceRunner {
  probe @0 (request :RemoteServiceLaunchRequest)
      -> (status :RemoteServiceLaunchStatus);
  start @1 (request :RemoteServiceLaunchRequest)
      -> (status :RemoteServiceLaunchStatus);
  attach @2 (profileId :Text) -> (status :RemoteServiceLaunchStatus);
}

The AnyPointer result is proposal shorthand for an ordinary Cap’n Proto capability pointer whose expected interface ID was already checked by the gateway. Generated client helpers should immediately cast it to the requested typed client. The remote client does not receive a numeric local capId, endpoint selector, result-cap index, or session identifier it can replay somewhere else.

The catalog and runner sketches are also proposal-level. They describe the remote-facing contract, not the internal implementation. The completed launch DTO/probe slice uses a serviceLaunch request/response arm for the side-effect-free probe: RemoteServiceLaunchRequest carries only a profile id plus explicit grant names, and RemoteServiceLaunchStatus reports status such as notLaunched, unsupported, denied, ready, or running, launch support, accepted grant names, and exported or planned service descriptors. The current Adventure slice makes that serviceLaunch path a real restricted backend launch for the default make run manifest, so Adventure may report running and launchSupported=true after the approved server graph starts. Paperclips remains a future launch profile. A capOS service runner may use local spawn authority, BootPackage data, or broker-held service caps inside capOS, but the remote client and browser/webview code receive only service descriptors, launch requests, status results, denials, and remote capability descriptors. Raw ProcessSpawner, process handles, endpoint owner caps, local cap ids, and result-cap slots are not exposed.

Service Catalog And Game Server Launch

The default make run story and the focused game proofs are intentionally different:

  • system.cue imports cue/defaults/defaults.cue, boot-launches standalone init, and lets init start chat-server, remote-session-capset-gateway, remote-session-web-ui, and the foreground shell. The default binary set includes Adventure server/NPC/client binaries and the terminal Paperclips binary. make run forwards guest port 8080 to a loopback host port and prints remote self-served UI: tcp 127.0.0.1 <port> -> guest :8080; the make run-default-web-ui target proves the capOS-served endpoint with browser automation. Adventure is not boot-started automatically; the current remote-session serviceLaunch slice starts adventure-server plus simple NPC companions through a restricted backend runner when requested. Paperclips landed in Path A + Path B as described below; the default make run manifest reports launchSupported=false / status=missingBinary for the paperclips launcher until Path C (the kernel-side AuthorityBroker allowlist extension and the on-wire DTO arm) lands.
  • The default remote-session gateway is narrow. It has console, scoped TCP-listen authority for guest port 2327, SessionManager, and AuthorityBroker, plus narrowly approved backend launch authority for the Adventure profile; it does not expose raw ProcessSpawner, raw network manager/socket authority, endpoint owner handles, process handles, local cap ids, result-cap slots, or game service endpoint owner caps. The remote-session-web-ui service separately receives scoped TCP-listen authority for guest port 8080, SessionManager, AuthorityBroker, and console.
  • make run-adventure uses system-adventure.cue to start chat-server, adventure-server, Adventure NPC companion processes, an adventure-scenario-test, and the shell. The Adventure server exports an Adventure endpoint, consumes a client facet of Chat, and owns room/player state keyed by live caller-session references.
  • make run-paperclips uses system-paperclips.cue to start paperclips-server and paperclips-proof-server services exporting PaperclipsGame endpoints. The terminal client is then launched with explicit StdIO, game endpoint, timer, and optional proof_accelerator grants. The server owns generated content, game state, timer cadence, command specs, status snapshots, project entries, unlock checks, and game-rule mutation.

The remote UI should not treat those terminal transcripts as the product boundary. The staged path is:

  1. The broker advertises a remote service catalog for the authenticated session. The catalog is derived from manifest/default profiles and policy, and includes only services the remote profile may inspect, attach to, or start.
  2. The launch DTO/probe slice is complete. It defines a remote-safe launch request, status, and probe contract for cataloged profiles. It can report unsupported launch state, accepted grant names, a message, planned exported service descriptors, and denial status without side effects: no process starts and no new capabilities are attached.
  3. The current Adventure slice implements the restricted service-runner path for the default manifest. It starts the Adventure server plus simple NPC companion processes with explicit named grants, then returns launch status and remote descriptors for exported or broker-held caps. Process handles stay backend-local.
  4. The trusted Rust backend attaches those descriptors to the backend-held RemoteCapSet and drives typed calls. Browser JavaScript or a Tauri webview receives view models, launch/status forms, service descriptors, denials, and results, not raw capOS handles. The implemented DTO worker slices cover Chat.send plus Adventure status, look, inventory, and first mutable bounded go(direction).
  5. The first UI panels can be generic: service list, start/attach, status, read-only Adventure controls, a bounded movement control, transcript, and denial details. Purpose-built Adventure and Paperclips clients can layer richer rendering and broader mutable game actions over the same service-runner and remote CapSet backend later; Paperclips does not have default-manifest remote launch support yet.

Operator commands should stay explicit:

make run
cargo run --manifest-path tools/remote-session-client/Cargo.toml \
  --target x86_64-unknown-linux-gnu \
  --bin remote-session-client -- --host 127.0.0.1 --port <printed-port>
CAPOS_REMOTE_SESSION_PORT=<printed-port> make remote-session-ui

Add --launch-adventure to the CLI command to start the default-manifest Adventure graph through the restricted serviceLaunch path and require a running status. Add --adventure-status after --launch-adventure to require read-only Adventure status, look, and inventory responses through the session-bound worker path. Add --adventure-go east after --launch-adventure to require the first bounded mutable Adventure go(direction) response through that same worker path.

The Tauri wrapper runs from this repository and reuses the same backend boundary by loading the loopback remote-session-ui surface in a desktop webview:

CAPOS_REMOTE_SESSION_PORT=<printed-port> make remote-session-tauri

That target checks Tauri CLI and Linux build prerequisites, reports dependency/scaffold status, and runs a deterministic wrapper check by default. Set CAPOS_REMOTE_SESSION_TAURI_MODE=dev to launch cargo tauri dev. Missing host Tauri packages fail with explicit diagnostics and point operators back to make remote-session-ui. The webview receives the same browser-safe view models, events, denials, typed results, and redacted transcript rows as the trusted local web bridge; the backend keeps the remote session and caps.

Bidirectional UI Composition

A conventional GUI program opens a window and owns the controls inside it. A remote capOS session does not need to be that limited. The host app can expose a session-scoped UI host capability to capOS, and capOS-side services or agents can use that capability to propose a better interface for the current task:

  • Paperclips can ask for counters, project controls, and status charts instead of printing lines.
  • Chat can ask for a channel list, unread badges, and a message pane.
  • Adventure can ask for a map pane, inventory slots, command buttons, and room transcript.
  • A diagnostics agent can open log, metric, and trace panes side by side, highlight the relevant capability calls, and change density for a debugging session.
  • A teaching or accessibility agent can request larger type, simplified controls, or a guided task layout for a particular session.

The authority is explicit and separate from service authority. Holding Chat does not let a service rewrite the user’s UI. Holding RemoteUiHost or a narrow UiSurface facet lets the service propose bounded UI changes for the current remote session. The host app remains the compositor and policy enforcer.

Conceptual shape:

enum UiPatchKind {
  openSurface @0;
  closeSurface @1;
  updateModel @2;
  setLayoutHint @3;
  setThemeHint @4;
  addCommand @5;
  removeCommand @6;
}

struct UiSurfaceSpec {
  surfaceId @0 :Data;
  title @1 :Text;
  kind @2 :Text;
  safetyClass @3 :Text;
  modelSchema @4 :UInt64;
}

struct UiPatch {
  kind @0 :UiPatchKind;
  surfaceId @1 :Data;
  payload @2 :Data;
  expiresAtMs @3 :UInt64;
}

struct UiEvent {
  surfaceId @0 :Data;
  command @1 :Text;
  payload @2 :Data;
  userInitiated @3 :Bool;
}

interface RemoteUiHost {
  open @0 (spec :UiSurfaceSpec) -> (surface :RemoteUiSurface);
  theme @1 (scope :Text, hints :Data) -> ();
}

interface RemoteUiSurface {
  apply @0 (patch :UiPatch) -> ();
  poll @1 (maxEvents :UInt16) -> (events :List(UiEvent));
  close @2 () -> ();
}

The payloads above should become typed structs before implementation. They are shown as Data only to keep the sketch short. The important boundary is that UI updates are declarative patches and typed view models, not arbitrary host code. The host validates the requested surface kind, model schema, command set, theme tokens, data size, update rate, and safety class before rendering anything.

This is still a remote CapSet client model:

host UI holds RemoteSession + RemoteCapSet
host UI grants a narrow RemoteUiHost/RemoteUiSurface cap to a trusted worker
capOS service or agent sends declarative UI patches through that cap
host UI renders and sends typed user events back
service effects still require ordinary service caps

The direction is therefore bidirectional but not symmetric. The host app can call capOS service caps. capOS can shape the session UI only through UI caps the host granted. Neither side gains ambient authority over the other.

Safety rules:

  • Host chrome, login prompts, origin indicators, permission prompts, and emergency reset controls are reserved. capOS-rendered surfaces cannot spoof them.
  • UI patches are session-scoped. Persistent layout/theme changes require an explicit profile/settings cap or user confirmation.
  • Theme and look/feel changes use bounded tokens or validated design-system variables, not raw CSS injection.
  • UI command descriptors are data; executing a command still calls a typed capability under the current session policy.
  • The user can close, reset, or pin surfaces against agent rearrangement.
  • UI updates are quota-bound and auditable when they materially affect workflow, consent, disclosure, or action execution.
  • Browser front ends keep raw capOS caps server-side or in a Tauri/native Rust backend. Browser JavaScript receives rendered state and sends user events; it does not hold RemoteCapSet entries.

This is the broader version of the WebShell idea. A web shell can be more than a terminal emulator: it can be a session workspace whose composition is negotiated by the capabilities present in the session. The terminal remains one surface in that workspace, not the only surface.

Authentication And Admission

Authentication adapters all produce the same output: a UserSession plus profile inputs for the broker. They differ only in how the proof is obtained and verified.

  • Password: maps to the existing SessionManager.login(method, selector, proof, source) path when remote password login is enabled by policy. It must use the existing credential failure/backoff/audit rules and must not be the only supported remote method.
  • Public key: maps to SessionManager.sshPublicKey or a generalized signature-auth method. SSH userauth and raw remote RPC public-key auth can share account/key records, but the transcript bytes must be domain-separated by protocol and channel binding.
  • OIDC/OAuth: device-code flow fits headless or CLI clients; authorization code + PKCE fits browser-assisted clients. The OAuth/OIDC service verifies ID tokens and maps external subjects through the user-identity admission model before SessionManager mints a session.
  • Passkey/WebAuthn: belongs behind the web-authenticator path. A remote native client may open a browser or use a platform authenticator, but raw authenticator secrets never become capOS app data.
  • mTLS client certificate: TLS client-auth can identify a principal or pseudonymous subject through certificate policy. Certificate identity is an admission input; the resulting CapSet still comes from the broker.
  • Guest and anonymous: explicit policy profiles. They are not fallbacks for missing credentials and should receive short leases and narrow bundles. Guest admission is currently surfaced through the bridge as an explicit AuthMode::Guest option (/api/login/guest, CLI --guest); the gateway enforces the requestedProfile == "guest" and principal.kind == Guest invariants before broker dispatch via the validate_guest_admission helper, and refuses with RemoteErrorCode::AuthenticationDenied and the redacted "guest login denied" message regardless of which policy branch fired. When the manifest has no guest seed account the gateway returns RemoteErrorCode::DisabledAuthMethod so the bridge can distinguish a manifest-disabled method from a credential failure. Guest sessions surface only the configured display name ("Remote Guest") and principal_kind enum label to the bridge; the seeded principal id bytes are never disclosed through the bridge transcript or API envelope.
  • Service/workload credentials: future non-human clients can authenticate with OAuth client credentials, token exchange, mTLS, or signed workload assertions. They receive service-profile bundles, not human shell bundles.

Every method must record source metadata and protocol/channel binding appropriate to its transport. A successful proof selects a principal and session; it does not directly grant service authority.

Remote CapSet Semantics

A local process starts with a read-only CapSet page plus local cap-table entries. A remote client instead receives a live RemoteCapSet object:

  • list returns names, interface IDs, display metadata, and lease summaries.
  • get returns a typed RPC capability pointer only if the name exists and the expected interface ID matches.
  • The returned object is a proxy owned by the remote-session worker.
  • Dropping the remote object releases the worker’s hold edge when no other remote references remain.
  • Logout, expiry, revocation, disconnect, or worker shutdown breaks all session-bound proxy objects. The current DTO gateway implements kernel-backed explicit logout and owned-session connection teardown; full live proxy object-drop/revocation behavior remains future work.

This is still an actual session bundle. It is not a copy of the kernel’s local CapSet ABI. The remote representation exists because a Linux process has no capOS ring page, no capOS CapSet mapping, and no local cap table.

Invocation Context

Remote capability calls should look like ordinary calls to the target service:

remote client call
  -> capnp-rpc message
  -> per-session worker proxy
  -> local capOS capability call
  -> target service sees the worker's live session context

The remote client cannot choose service-visible subject identity. Request fields are ordinary data. If a service needs subject details, it uses the existing subject-disclosure policy: explicit request plus a matching service-scoped disclosure grant. By default it receives only the opaque service-scoped caller-session reference used by the session-bound invocation model.

Error And Lifetime Model

The remote path keeps the existing error split:

  • Cap’n Proto RPC transport errors and broken connections become RPC exceptions or disconnected promises.
  • Proxy/worker infrastructure failures become CapException-like capability exceptions.
  • Domain outcomes remain schema result fields or unions.
  • A missing cap name, interface mismatch, denied profile, stale session, or revoked lease is an observable denial, not a silent fallback to a broader service.

Open promises must fail when the remote session logs out or the connection is closed. The worker must release local caps on every close path.

Relationship To Shells And Gateways

Remote session CapSet clients are a peer of shell transports:

  • Native shell: a local capOS process that uses its local CapSet and ring. It can later expose a schema-aware REPL over the same capabilities a remote client sees, but the remote client does not need to spawn a shell.
  • SSH shell: a production CLI terminal transport. It authenticates and launches capos-shell with a TerminalSession. It should not become the only way for external programs to call typed services.
  • WebShellGateway: browser terminal, webapp, and agent UI transport. Browser JavaScript must not receive raw capOS caps; the gateway can use the remote session CapSet model server-side and expose terminal frames, view models, command descriptors, or bounded tool requests to the browser. This is close to the same mental model as a “web shell”, except the shell is not the required protocol. The web UI can present service-specific controls over the same session CapSet, and capOS-side services can adjust the session workspace through UI composition caps. A remote CapSet web UI can be built before the full WebShellGateway by omitting terminal delegation, shell-runner policy, and agent execution; it is just another host client of the remote session bundle.
  • Tauri or desktop GUI: the Rust/native backend may hold the remote RemoteSession and typed capability clients, while the UI layer receives rendered state, command descriptors, and user-intent events. The UI layer should not receive replayable capOS authority as data. The backend may grant narrow UI-surface caps back to capOS services so they can propose adaptive layouts without gaining arbitrary desktop control.
  • Agent shell: the agent runner holds session caps server-side and presents tool descriptors to the model. A hosted agent can use the same remote session bundle shape as long as actual capOS invocations remain in the trusted worker.
  • Interactive command surfaces: command metadata can be one of the granted capabilities. A remote client can render command specs directly instead of scripting text through a shell.

Authority Rules

  • The gateway receives scoped listener/TLS/auth/session/broker/audit authority, not raw broad network or spawn authority.
  • Post-auth workers receive only the broker-issued remote-client bundle plus proxy lifecycle authority.
  • Default remote bundles should be narrower than operator shell bundles.
  • Raw ProcessSpawner, unrestricted NetworkManager, key-vault, credential store, broad account store, broad storage root, and host debug caps require explicit elevated policy.
  • Remote proxyable caps must declare transfer/lifetime policy. Local-only caps may appear in a local shell CapSet without being exportable through RemoteCapSet.
  • Capability names are lookup conveniences. Interface ID and broker policy define whether a returned object is usable for the requested type.
  • Replayable handles are forbidden. Session IDs, grant IDs, endpoint metadata, object epochs, and proxy table positions are not bearer tokens.

Design Grounding

  • Session-Bound Invocation Context defines the one-session-per-process invariant and privacy-preserving endpoint caller-session metadata.
  • User Identity and Policy defines principals, sessions, profiles, admission sources, renewal, and brokered CapSet minting.
  • Boot to Shell defines the existing CredentialStore/SessionManager/AuthorityBroker path and non-password login directions.
  • SSH Shell Gateway, Certificates and TLS, and OIDC and OAuth2 define public-key, TLS/mTLS, and federated admission inputs.
  • capos-service defines the service lifecycle shape needed for listener loops, per-session context, shutdown, drain, and metrics.
  • Capability-Based Service Architecture defines the broader service taxonomy, capability layering, and init/spawn boundary the gateway, per-session workers, and restricted service runner reuse. The default make run gateway, the Adventure service-runner path, and the Paperclips Path B worker plumbing inherit the process-startup, attenuation, and HTTP-capability rules described there; Path C will extend the broker allowlist surface in the same authority frame.
  • Remote Session UI Security defines the web-security posture for the loopback remote-session-ui bridge and its Tauri desktop wrapper – per-browser BrowserSession cookies, CSRF/CSP/cookie discipline, first-wins ownership, local HTTP parser bounds, and Tauri capability minimization – that the trusted Rust backend in this proposal exposes to the browser. Both proposals reference each other; this proposal owns the upstream remote-session CapSet wire and host-client shape, while that proposal owns the browser-facing authority boundary.
  • R17 – Remote-session UI bridge and Tauri wrapper are research-only routes long-horizon residual risk (distributable packaging, desktop automation, non-loopback exposure) back to this proposal and the remote-session UI security proposal. Non-loopback remote-session UI exposure must remain blocked until that production posture is accepted by the corresponding review-finding task.
  • Interactive Command Surfaces defines typed command sessions that can be rendered by remote clients.
  • Browser Capability and Agent Web Sessions defines browser-side authority boundaries and gateway mediation for web UI sessions.
  • Language Models and Agent Runtime defines agent runners, tool proxies, and browser-agent UI orchestration boundaries.
  • Cloudflare, Cap’n Proto, Workers RPC, and Cap’n Web grounds production object-capability RPC, live object bindings, and remote resource-exhaustion discipline.
  • Spritely, OCapN, and CapTP grounds distributed object-capability lifetime, promise, reference, and handoff questions while staying non-binding for capOS wire compatibility.
  • Cap’n Proto Error Handling grounds the exception-versus-domain-result split that the host-backend facade and eventual gateway RPC transport must preserve.

Implementation Shape

The first implementation is deliberately small:

  1. Keep the existing capnp-chat-interop service and harness as the transport starting point, but rename the target outcome in planning docs to remote session CapSet interop. Done.
  2. Add generated Linux Rust bindings for the relevant schema subset. Done.
  3. Add a host client library that connects through QEMU user TCP. Done with a schema-framed DTO transport; replacing it with standard capnp-rpc framing and live proxy objects remains the next transport step.
  4. Add a capOS gateway that supports one policy-enabled auth method plus explicit guest/anonymous behavior. Done for password, anonymous, and guest, with disabled public-key, OIDC, and passkey/WebAuthn method entries advertised. Guest admission ships with a dedicated RemoteGatewayRequest.guestLogin arm, the validate_guest_admission broker-side enforcement helper that pins the requestedProfile == "guest" plus principal.kind == Guest invariants, and a RemoteErrorCode::DisabledAuthMethod path so the bridge can distinguish a manifest-disabled method from a credential failure.
  5. Return remote session summary, CapSet list, and typed get metadata. Done as DTOs.
  6. Call at least two capabilities from the bundle. Done for session, system_info, the worker-backed Chat.send path, and Adventure status/look/inventory/go(direction) after serviceLaunch. The focused chat proof also shows a service-domain denial remains a schema chatSent(false) result and that chat-server sees bounded session-bound caller metadata through disclosure policy. Broader Adventure methods, Paperclips methods, live proxy objects, and object-level release/drop lifecycle remain future work.
  7. Prove a missing cap, wrong interface ID, wrong profile, stale session, and logout path fail closed. Done for the focused proof, including a kernel-backed UserSession.logout call and owned-session disconnect propagation in the DTO gateway; full release, live proxy object-drop, renewal, and revocation propagation remains future work.
  8. Add a first host UI client over the current UI-neutral Rust client. Done for a trusted local web bridge with a loopback browser UI and Rust backend that holds the remote session state. It covers endpoint configuration, auth methods, login, session summary, CapSet inspection, sessionInfo, systemMotd, denial probes, logout, stale-call proof, redacted transcript export, and a focused browser automation proof. The repo-local Tauri wrapper now checks or launches the same loopback backend/webview boundary; distributable packaging remains later. The UI remains separate from WebShell and does not include a terminal emulator, shell-runner policy, or agent execution.
  9. Define the launch DTO/probe shape after the read-only remote service catalog. Done: this slice defines a remote-safe launch request, launch status, and side-effect-free probe so the CLI/web backend can render forms and denials for Adventure/Paperclips profiles. It deliberately does not start processes, create endpoint owners, attach caps, or expose raw ProcessSpawner, process handles, endpoint owner handles, local cap ids, result-cap slots, or browser-held capOS capabilities.
  10. Implement the actual restricted Adventure service-runner path. Done: the default-manifest Adventure profile starts adventure-server plus simple NPC companion processes and attaches or retains the resulting Adventure/chat descriptors/caps in the backend-held remote CapSet. Paperclips landed in two halves: Path A added the read-side RemotePaperclips* DTO schema (RemotePaperclipsCommandResult, RemotePaperclipsCommandList, RemotePaperclipsProjectList, RemotePaperclipsStatusSnapshot, RemotePaperclipsEvent, RemotePaperclipsProjectStatus, RemotePaperclipsEventKind, and the single-command RemotePaperclipsCommand input DTO) in schema/capos.capnp, with bounded wire-roundtrip coverage in capos-config/tests/remote_paperclips_dto_roundtrip.rs; Path B added the dedicated demos/remote-session-paperclips-worker/ crate mirroring the Adventure worker shape, the gateway SessionWorkerKind::Paperclips enum variant with matching SessionWorkerSet arms and spawn_paperclips_graph/build_paperclips_service_launch/ fill_paperclips_launcher/paperclips_catalog_status helpers, a manifest-static game endpoint slot on the gateway capset, bridge RequestKind::PaperclipsInitial/Command/Status/Projects synthesis from cached serviceLaunch state (the on-wire control plane lands in Path C), UI launch slot plus status chip with paired smoke automation (paperclipsLaunchVisible/paperclipsStatus/paperclipsStatusObserved), the system-remote-session-paperclips.cue focused manifest, and the make run-remote-session-paperclips-vm / make run-remote-session-paperclips-ui gates. Raw ProcessSpawner, process owner handles, endpoint owner caps, local cap ids, result-cap slots, and browser-held capOS capabilities stay out of the remote contract. Process handles stay backend-local. Adventure status/look/inventory controls and first mutable bounded go(direction) use the session-bound worker pattern; Paperclips Path B uses the same worker shape with bridge-side response synthesis until Path C lands the wire-level DTO arm and the broker allowlist grants for the default manifest. Broader Adventure controls, Path C wire/broker extension, and rich Paperclips client implementations remain later.
  11. Replace the bounded make remote-session-tauri preflight with the actual repo-local Tauri wrapper over the same Rust backend. Done for check/dev mode: CAPOS_REMOTE_SESSION_PORT=<printed-port> make remote-session-tauri validates the wrapper scaffold and host prerequisites, and CAPOS_REMOTE_SESSION_TAURI_MODE=dev launches the wrapper through cargo tauri dev. Distributable packaging remains gated on reviewed sidecar/backend lifecycle handling.
  12. Add the first typed proxy layer as a host-backend-only temporary dual-stack. Done for Chat: tools/remote-session-client/ hosts a local capnp-rpc facade that translates backend-held proxy calls to the existing DTO gateway protocol while keeping schema/generated bindings, gateway wire shape, and browser authority unchanged. The later gateway rewrite must provide standard capnp-rpc framing, typed remote proxy objects, exception mapping, release/drop handling, and resource bounds before the bespoke DTO service path can be retired.
  13. Layer richer service clients on top of the same backend boundary. The first richer client is a session-summary diff: a pure Rust helper in tools/remote-session-client/src/session_diff.rs compares two snapshots of the session view (CapSet entries plus SessionInfoSummary) and returns typed CapSetDiff / SessionSummaryFieldDiff records keyed on (name, interface_id) and on visible session fields. Renewals or policy rebinding surface as policy_changed rather than removed + added. The trusted web bridge stores the raw snapshots backend-side and exposes /api/call/session-diff-refresh, which returns a redacted SessionSummaryDiffVm. Browser JavaScript receives only that view model: added/removed cap entries by (name, interfaceIdHex, transferPolicy, leaseExpiresAtMs), policy/lease changes, redacted session-id changes, and a summary string. The first call after login captures a baseline (hasBaseline=false); subsequent calls return the diff against the previous snapshot. The browser renders the diff in a dedicated “Last refresh diff” pane on the Session view; raw session_id_hex, replayable cap handles, and kernel session ids stay backend-side. The focused UI smoke clicks “Refresh & Show Diff” twice and asserts both the no-baseline and post-baseline shapes. Two backend host tests cover the baseline + no-change path and the added-cap + expiry-change path.
  14. Add a separate UI-composition proof only after the basic session proof: grant a narrow test RemoteUiSurface, accept one declarative patch, send one typed user event back, and prove the service cannot spoof trusted chrome or persist layout state without the relevant cap.

Later slices can add more auth adapters, TLS, renewal, browser-assisted auth, service credentials, UI composition surfaces, promise pipelining, and distributed GC.

Visual Design Handoff

The host UI visual language is anchored on two Claude Design handoffs:

  • The original capOS Login bundle (delivered 2026-05-02 13:26 UTC). Only the CSS tokens and design intent were ported into the production UI; the prototype is not kept in-tree.
  • The capOS Workspace bundle (delivered 2026-05-02, see tools/remote-session-client/ui/design-bundle/). Covers the post-login workspace shell, chat list, active group chat with embedded approval cards, active DM with E2E lock + fingerprint card, active call (collapsed banner + full-pane), stage room, and a “start sheet” with the four ocap-clean entry flows (open DM from contact card, redeem invite, browse directory, start ephemeral chat). This bundle IS kept in-tree as reference at tools/remote-session-client/ui/design-bundle/ and includes conversation transcripts, HTML prototypes, JSX components, and the unique theme assets. See its CAPOS-INTEGRATION.md for the bundle-to-live-UI mapping and the iteration-7 prerequisite (CSP refactor + per-browser BrowserSession cookie before any inline scripts/styles from the prototype reach production).

Both bundles ship four themes (Space, Mountain, Light, Operator) and a consistent token system (themes.jsx in either bundle is authoritative for palette / typography / radii / blur). The branding assets actually shipped under branding/ were copied into tools/remote-session-client/ui/assets/ for the bridge to serve; the prototype’s reference imagery is kept only in the in-tree design-bundle directory.

What landed in tools/remote-session-client/ui/:

  • Vanilla CSS rewrite of styles.css around the design’s theme tokens. No React, no Babel, no third-party CDN script tags. Trust boundary stays intact: the loopback bridge serves only static assets.
  • index.html restructured to the design’s hero + auth-card + footer layout with mobile responsiveness, an Operator dashed inner frame (capos://auth label), and the original data-test surface fully preserved so make run-remote-session-capset-ui still passes.
  • A trusted-static feature flag block (window.CAPOS_UI_FEATURES, overridable via ?features=) gates surfaces that are scaffolded but not yet backed by the Rust gateway. Default flags match what the current backend honours.

Surfaces scaffolded but flag-gated off by default (no functional support in capOS yet; future tracks will wire them):

  • Passkey sign-in (?features=passkey). Tracks docs/proposals/boot-to-shell-proposal.md (passkey/WebAuthn, credential setup) and docs/proposals/cryptography-and-key-management-proposal.md.
  • OIDC / SSO providers (?features=sso for Google/GitHub/Okta). Tracks docs/proposals/oidc-and-oauth2-proposal.md. The trusted Rust backend must own the provider integration; browser JavaScript must continue to receive only view models, results, and denials.
  • MFA second-factor step (?features=mfa). Tracks docs/proposals/boot-to-shell-proposal.md. The 6-digit input animates end-to-end as a UI demo today; production wiring is a future slice.
  • Success step (?features=successStep). The current Rust backend transitions straight to the workspace on session start; the success card is design-parity scaffolding for a future mid-step surface.
  • Capability-grant consent strip. Removed from the design itself during iteration (the user concluded it demonstrated the wrong thing for capOS); kept in the deferred list because a future consent-on-grant flow for OAuth-style external identities would re-use the same visual language.

Surfaces flag-gated on by default but UI-only today (decorative state without a backend round-trip):

  • System status pill, Region pill, Language pill, Footer, Hero panel, Remember-device checkbox, Forgot-password link, Password show/hide toggle.

Constraints the visual layer must keep across future slices:

  • Login is a dedicated OS-like screen with a visible username field and no full persistent technical header. Resource profile names such as operator are not user-typed system details.
  • Browser login sends username/password only. The username field is empty by default: the browser UI does not pre-fill from CAPOS_REMOTE_SESSION_USER, host USER, or any other host-local identity hint, because that would disclose account hints before authentication.
  • Authenticated users land in a compact Services-first workspace where Session, CapSet, Diagnostics, and Transcript are separate views. The UI smoke harness must continue to fail if any visible button is not exercised; new flag-gated buttons must stay hidden by default so the smoke surface does not grow without paired automation coverage.
  • No third-party CDN script tags or runtime frameworks are added to the trusted UI. Theme switching uses the existing data-theme attribute on <html>/<body>; CSS variables flip the design tokens.

Proposal: SSH Shell Gateway

Production remote shell access for capOS using SSH as a terminal transport while preserving the native shell’s capability boundaries.

Status Split

Implemented:

  • SSH-shaped authority prerequisites and fixture authentication proof: development-only sign-only host key, manifest-seeded authorized-key lookup, public-key session minting over fixture authentication bytes, unsupported feature policy/audit classification, restricted shell launcher, and a bounded host-local plain-TCP terminal-host proof.
  • UserSession.auditContext fails closed after logout (same ensure_session_live guard as info()); run-ssh-public-key-session proves pre-logout success, post-logout failure, idempotent second logout, and continued closed state.

Not implemented:

  • encrypted SSH packet transport;
  • OpenSSH-compatible key exchange and channel handling;
  • full SSH userauth transcript validation;
  • channel binding;
  • TerminalSessionFromByteStream terminal-factory wiring;
  • OpenSSH harness.

Do not infer OpenSSH-compatible remote login from the current “partially implemented” status.

Remote and non-loopback deployment is blocked. The current proof uses development/fixture key material and host-local plaintext wiring for bounded authority checks; it is not a production SSH service. Before exposure beyond loopback, the implementation must have encrypted SSH transport, production host-key storage, durable authorized-key/account storage, full userauth transcript validation, channel binding, audit records for auth and shell launch, and a reviewed pre-auth/post-auth isolation story.

Problem

The Telnet Shell Demo described in Networking proves that a remote TCP connection can become a TerminalSession without granting the shell raw network authority. That is the right capability boundary, but Telnet is intentionally not a production remote access path. It has no encryption, no host authentication, no replay protection, no key-based user authentication, and no deployable security story beyond “host loopback in QEMU.” This proposal is the production remote-shell successor to that loopback-only research demo; the demo’s TerminalSession boundary survives, but its plaintext transport does not.

capOS needs a production-oriented CLI remote shell that works with normal SSH clients while avoiding the Unix mistake of treating an SSH login as a raw remote root shell, ambient user id, inherited file descriptor set, or global filesystem entry point.

The SSH path should be a terminal host and session authenticator. It should not become a general-purpose privilege broker, TCP proxy, process supervisor, or substitute for the native shell’s capability model.

Relationship To Telnet

SSH reuses the Telnet Shell Demo’s core contract – the same TerminalSession boundary Shell requires for any terminal-backed capos-shell, and the same broker-issued shell bundle Boot to Shell mints for a fresh session:

  • A gateway accepts TCP connections.
  • The gateway owns transport framing and terminal-host behavior.
  • The spawned capos-shell receives a cap named terminal implementing TerminalSession.
  • The shell receives the normal broker-issued shell bundle for the authenticated session.
  • The shell does not receive raw TcpSocket, NetworkManager, listener, broad process-spawn, private-key, authorized-key-store, or host-key authority.

The transport changes. Telnet handles plaintext option negotiation over a host-loopback QEMU forwarding rule. SSH handles version exchange, key exchange, host-key proof, encrypted packet framing, user authentication, session channels, PTY requests, window changes, shell requests, and clean channel teardown.

The security boundary does not change. The shell still sees only a terminal session and a scoped capability bundle.

SSH is not the only remote client model. It is the production terminal/CLI transport for operators who want an interactive shell. Programmatic clients should use the remote session CapSet path instead: authenticate through a session/admission method, receive a broker-issued remote CapSet view, and call provided capabilities over Cap’n Proto RPC without creating a shell. Public-key account records may feed both paths, but the authentication transcript bytes must be domain-separated by protocol and channel binding. See Remote Session CapSet Clients.

The first SSH implementation milestone is still host-local development. It should not silently inherit the Telnet demo’s trusted gateway compromise. Before implementation, the SSH path must either close the gateway authority gap with scoped listener and shell-only launcher grants, or explicitly preserve that gap in a task record as a host-local-only compromise while still proving that the spawned shell has no raw network, spawn, key, or SSH transport authority.

Pre-auth and post-auth shell flows must not share broad process/address-space authority for production exposure. Either split the authentication gateway and post-auth shell launcher into separate processes with narrow handoff caps, or produce a reviewable proof that the shared process cannot use pre-auth network, key, listener, or parser state as post-auth shell authority.

Scope

Initial SSH support is deliberately narrow:

  • SSH-2 only, following the RFC 4251-4254 family at the protocol level.
  • One interactive session channel per connection for the first proof.
  • pty-req, window-change, shell, EOF, close, and disconnect handling.
  • Public-key user authentication first.
  • Fresh random material for key exchange, rekey, padding, session identifiers, and authentication challenges comes from EntropySource or a narrowed SSH transport-crypto service that owns EntropySource; it is never ambient process state.
  • Password authentication only if it is wired to the existing CredentialStore failure/backoff path and policy explicitly enables it.
  • No port forwarding, agent forwarding, X11 forwarding, SFTP, SCP, subsystem requests, exec requests, direct TCP forwarding, or arbitrary environment import in the first milestone.

Those excluded SSH features are not harmless defaults. In capOS they require their own capabilities, policy, accounting, and audit records before exposure.

Components

flowchart TD
    Client[SSH client] -->|TCP 22| Gateway[SshGateway]
    Gateway --> HostKey[SshHostKey cap]
    Gateway --> Keys[AuthorizedKeyStore]
    Gateway --> Sessions[SessionManager]
    Gateway --> Broker[AuthorityBroker]
    Gateway --> Launcher[RestrictedShellLauncher]
    Gateway --> Listen[TcpListenAuthority]
    Gateway --> Audit[AuditLog]

    Keys --> Sessions
    Sessions --> Broker
    Broker --> Bundle[Scoped shell bundle]
    Gateway --> Terminal[SSH-backed TerminalSession]
    Launcher --> Shell[capos-shell]
    Terminal --> Shell
    Bundle --> Shell

SshGateway is the only component exposed to the network. It is an ordinary userspace service once the socket capability path can support it. During an early implementation it may wrap the same in-kernel TCP capabilities used by Telnet; a later decomposed-network stack should not change the shell contract. The schema-level gateway contract is intentionally small: status and shutdown methods identify the service surface without granting child shell authority.

SshHostKey is a sign-only private-key capability. It should be backed by the PrivateKey/KeyVault model from Cryptography and Key Management: the gateway can sign the SSH exchange hash but cannot export private key material, enumerate unrelated keys, or administer the vault.

AuthorizedKeyStore maps an SSH public key to a principal and authentication policy. It stores public key material and policy metadata, not shell authority. OpenSSH-format public keys are bytes imported into a verifier path, matching the crypto proposal’s PublicKeyFormat.opensshWire escape hatch for public material. The initial schema returns an SshAuthorizedKeyDecision with principal/profile metadata and an audit reason; actual shell authority still comes from SessionManager and AuthorityBroker.

TerminalSession is backed by the SSH channel. The gateway translates channel data, EOF, close, PTY mode, and window-size events into the terminal host contract. The schema names this construction surface SshTerminalFactory; it returns a result-cap index for the SSH-backed TerminalSession. Password prompts, hidden echo, cancellation, and teardown stay at that boundary.

TcpListenAuthority is the scoped listener grant shape for this milestone. It can mint only the configured TcpListener rather than exposing raw NetworkManager.createTcpListener for arbitrary ports.

RestrictedShellLauncher is narrower than the transitional RestrictedLauncher: it launches only the native shell against a supplied terminal/session context instead of accepting an arbitrary binary name. The current kernel source is manifest-declared as restricted_shell_launcher; it adds the child terminal, session, and stdio grants itself and accepts only named capability-sourced pass-through grants for the reviewed shell startup bundle (creds, sessions, audit, broker, and optional system_info). Before spawn it verifies the supplied UserSession profile matches the requested profile, and the focused proof shows the spawned native shell running under that supplied session.

Authority Model

The gateway receives only the capabilities required for its job:

  • TCP listen authority for the configured SSH port, preferably as a manifest-declared TcpListener handoff or scoped listener factory rather than raw NetworkManager.
  • Sign-only SshHostKey authority for configured host-key algorithms.
  • Narrow EntropySource authority, or an SshTransportCrypto cap that owns entropy and exposes only SSH key-exchange, rekey, cipher/MAC, and random padding operations.
  • Read or verify authority over AuthorizedKeyStore.
  • SessionManager authority to mint a session after successful SSH authentication.
  • AuthorityBroker authority to request the normal remote shell profile.
  • Restricted shell launch authority scoped to capos-shell.
  • Pass-through grants required by the current shell startup path, such as creds, sessions, audit, and broker, where policy permits them.
  • AuditLog append authority for connection, authentication, launch, and teardown records.

In the production-shaped authority model, it does not receive:

  • Broad ProcessSpawner authority.
  • Raw NetworkManager, outbound connectTcp, or an arbitrary listener factory.
  • Key export or KeyVault administrative authority.
  • Storage namespace authority except the narrow public-key records required by AuthorizedKeyStore.
  • SSH agent, port-forward, or subsystem authority unless later proposals add explicit caps for those surfaces.

A host-local development checkpoint may temporarily preserve raw NetworkManager, arbitrary listener factory, or broad ProcessSpawner authority in the gateway only if a task record captures the compromise and the harness proves it does not cross the shell boundary. The spawned shell must never receive raw NetworkManager, TcpListener, TcpSocket, ProcessSpawner, SSH transport, host-key, authorized-key-store, key-vault, or general-purpose entropy authority.

Identity metadata is not authority. A login name, SSH username, key fingerprint, source IP, principal id, or profile label only becomes useful after a trusted service returns a capability bundle.

Authentication

Host authentication

The host key should be a narrow wrapper around a PrivateKey cap, constrained to SSH host-key signing. Host keys are generated or imported through KeyVault, opened through an explicit SealPolicy, and rotated through a versioned host identity record. The gateway can sign the key exchange hash but cannot export private material.

SSH transport keys are separate from the host key. Key exchange must use fresh entropy and the algorithm policy selected for the deployment. The baseline standards are RFC 4251-4254; extension negotiation and modern algorithm recommendations come from later SSH RFCs such as RFC 8308, RFC 8709, RFC 9142, and other updates recorded by the RFC Editor for the 4251-4254 family. The first implementation should pin a small reviewed algorithm set rather than accepting every algorithm a library exposes.

For development, a manifest-seeded host key may be acceptable only when the manifest field, docs, and harness mark it as non-production. The current development path uses kernelParams.sshDevelopmentHostKey with the required label capos-development-only-ssh-host-key and the kernel source ssh_development_host_key; the resulting cap exposes only public metadata and signs bounded ssh-ed25519 exchange hashes with the manifest seed for QEMU proof. make run-ssh-host-key verifies the signature against the configured public key, proves wrong-algorithm denial, and checks that the development seed and raw signature are not printed to proof logs. For deployment, host keys need persistent storage, rotation policy, key-management-backed signing, and audit.

User public keys

Public-key login maps an accepted SSH public key to a principal record and authentication strength. The key record should include:

AuthorizedSshKey {
  keyId
  principalId
  publicKey
  algorithm
  fingerprint
  allowedProfiles
  sourcePolicy
  createdAtMs
  disabledAtMs
  comment
}

The current manifest-seeded prerequisites implement public key record loading, generic authorization decisions, and a bounded session-mint bridge. The AuthorizedKeyStore accepts ssh-ed25519 records with 32-byte public keys and SHA-256 fingerprints, rejects duplicate ids and fingerprints, maps principals to existing seed accounts, and denies disabled records. SessionManager accepts bounded fixture authentication bytes/signatures for configured keys and mints UserSession metadata with publicKey authentication strength; the focused make run-ssh-public-key-auth proof also shows AuthorityBroker denying a mismatched shell profile.

SessionManager.sshPublicKey consults the bootstrap RamAccountStore after signature verification using lookup_by_principal. Non-Active account statuses (Disabled, Locked, RecoveryOnly) and missing principals fail closed before a session is minted, so a runtime account-store mutation cannot be ignored by the SSH path even though authorized-key records carry their own disabledAtMs flag. The bootstrap fallback (no account store wired) keeps the seed-account validation contract: manifest validation guarantees every authorized-key principal binds to an active seed account. The run-ssh-public-key-session smoke also proves UserSession.auditContext returns principal metadata before logout and fails closed with ensure_session_live after explicit logout(), matching the same fail-closed contract as info().

Each denial path emits a stable auth= audit code (no schema variant change). The codes form the SSH gateway’s operator-visible audit contract: ssh-public-key for success, ssh-key-unknown, ssh-key-disabled, ssh-key-profile-not-allowed, ssh-bad-signature, ssh-account-missing, ssh-account-disabled, ssh-account-locked, ssh-account-recovery-only, ssh-account-lookup-failed, ssh-profile-kind-invalid, ssh-profile-not-interactive, ssh-auth-bytes-invalid. Failed records keep principal and profile blank by policy: the auth= code is the only discriminator, so failed-auth lines cannot be used as a side channel to probe for valid principal IDs.

This is still not a complete SSH public-key authentication exchange: no SSH transport transcript, channel binding, or terminal factory is wired end-to-end. A bounded plain-TCP terminal-host proof now reuses the configured key fixture to mint a public-key session and launch capos-shell through RestrictedShellLauncher, but that proof is not an encrypted SSH transport or OpenSSH userauth exchange. End-to-end QEMU proof of the ssh-account-disabled/ssh-account-locked paths requires an AccountStoreManagerCap kernel cap source so a demo can mutate account state at runtime; that is tracked in the local-users management backlog and is not required by the bounded host-local SSH gateway proofs.

Cloud metadata may seed initial authorized keys through the cloud-bootstrap path, but those keys are input to AuthorizedKeyStore, not ambient login authority. A metadata-provided key still needs an account/profile mapping and should be auditable as cloud-seeded material.

Passwords and step-up

Password authentication over SSH is optional and should be disabled unless CredentialStore can enforce the same generic failure text, bounded backoff, rate limits, and audit behavior as the local shell. Keyboard-interactive can later drive step-up prompts, but it should not be the first implementation unless a concrete policy needs it.

SSH Channel Policy

The first gateway accepts only session channels that request an interactive shell. It rejects:

  • exec requests.
  • subsystem requests such as SFTP.
  • agent forwarding.
  • TCP forwarding and reverse forwarding.
  • X11 forwarding.
  • environment variables except a small reviewed allow-list, if any.
  • more than one active shell channel per connection.

Each rejected request should produce an SSH protocol failure plus an audit record with a reason code. The audit record should not include command lines, environment dumps, key material, or terminal content.

The current bounded policy surface is capos-config::ssh_policy. It allows public-key auth, one session channel, PTY, window-change, and a first shell request. It denies disabled password auth, exec, subsystem/SFTP, direct TCP/IP, TCP/IP forwarding and cancellation, agent forwarding, X11 forwarding, environment import, second session-channel opens, and second shell channels. Password auth has no policy allow path in this proof; it stays denied until a real CredentialStore verifier, backoff, and audit path is wired into the gateway. Denials return only a protocol failure class and a stable audit reason code; request payloads such as command text and environment values are not part of the decision data.

Implementation Slices

The final OpenSSH proof should not land as one opaque SSH server commit. Keep the implementation reviewable by landing these slices in order:

  1. Version exchange. A bootable ssh-gateway service accepts one host-local OpenSSH TCP connection, exchanges RFC 4253 identification strings, records only sanitized client software/version metadata, and disconnects before key exchange without launching a shell. The compatibility harness uses /usr/bin/ssh; malformed and overlong client identification strings are covered by a separate low-level hostile TCP/banner fixture.
  2. KEXINIT and algorithm selection. Parse KEXINIT, select exactly one reviewed development algorithm set, and disconnect on unsupported algorithms. Algorithm names are transport policy inputs, not authority.
  3. Development key exchange. Complete the host-local encrypted transport by deriving traffic keys from the negotiated KEX shared secret, exchange hash, and session id per RFC 4253. Entropy supplies ephemeral KEX material, padding, and challenges, not direct session-key bytes. Call SshHostKey.signExchangeHash and prove no private host-key or raw entropy material reaches logs or child shell grants.
  4. Public-key userauth. Bind the OpenSSH public-key userauth transcript to SessionManager.sshPublicKey, accept the configured key, deny unknown keys generically, and keep password auth disabled until a real verifier/backoff path is wired.
  5. Channel policy. Route session open, PTY, window-change, shell, exec, subsystem, forwarding, agent, X11, environment, and second-channel requests through capos-config::ssh_policy, producing protocol-visible failures and sanitized audit reason codes for denied features.
  6. SSH-backed terminal launch. Replace the plain-TCP terminal-host proof with an SSH channel-backed TerminalSession, launch capos-shell through RestrictedShellLauncher, run session, caps, and exit via OpenSSH, and prove cleanup for both client disconnect and shell exit.

Resource And Teardown Rules

SSH exposes several resource boundaries before the shell even starts: handshake CPU, pending connections, packet buffers, channels, PTY state, terminal buffers, authentication attempts, and live shell processes.

The gateway must have fixed per-connection bounds and fail closed when they are exceeded. Disconnect, TCP close, SSH channel close, failed authentication, session expiration, shell exit, and gateway teardown must all release the same resources:

  • accepted socket,
  • SSH connection state,
  • terminal session object,
  • spawned shell handle,
  • broker-issued grants,
  • authentication challenge state,
  • audit correlation record.

Shell exit should close the SSH channel. Client disconnect should close the terminal and let the shell observe the normal TerminalSession close path.

Exit Criteria

The first SSH milestone is complete when:

  • SshGateway, host-key, authorized-key, and SSH-backed terminal contracts are documented in schema/design form.
  • The development host-key path is available only through an explicitly non-production manifest field and a narrow SshHostKey cap; production signing remains blocked on key management and persistent storage.
  • A manifest can start an SSH gateway with only scoped TCP listen, host-key, authorized-key, session, broker, audit, and restricted shell-launch grants, or the remaining host-local demo compromise is explicitly preserved in a task record.
  • The gateway accepts a normal OpenSSH client on a host-local QEMU forwarded port, authenticates one public key, spawns capos-shell with a TerminalSession, runs one command, and disconnects cleanly.
  • The harness proves denied password login when disabled, denied port forwarding, denied subsystem requests, rejected unknown keys, and cleanup after client disconnect.
  • The harness proves unavailable entropy or disabled KEX algorithms fail closed before authentication or shell launch.
  • Documentation states which parts are development-only and which are acceptable for production deployment.

Dependencies

  • Telnet Shell Demo from Networking for the socket-backed TerminalSession proof this gateway succeeds.
  • TerminalSessionFromByteStream as a shared prerequisite for SSH channel and TLS/mTLS-backed remote terminals. SSH channel data is not a connected TcpSocket; it must enter the same terminal factory used by Telnet-over-TLS – whose certificate, trust store, ACME, and pinning model lives in Certificates and TLS – so line discipline, echo policy, IAC handling where relevant, close semantics, and hidden password behavior do not fork by transport.
  • Cryptography and key-management primitives for sign-only host keys.
  • EntropySource or a narrowed SSH transport-crypto service for key exchange, rekey, packet padding, and challenge freshness.
  • User identity, account, and session policy records for AuthorizedKeyStore principal/profile mapping.
  • System-monitoring audit records for remote authentication, denied SSH features, launch decisions, and teardown.
  • Resource accounting for connection, channel, and shell-process limits.
  • Persistent storage before production host keys and authorized keys can survive reboot safely.

Remote-shell ingress should land in this order:

  1. TerminalSessionFromByteStream and shared terminal line/echo/hidden-input discipline.
  2. A transport-neutral byte-stream terminal factory used by both SSH channel data and TLS/mTLS cleartext byte streams.
  3. Either Telnet-over-TLS or SSH may land first, but neither should fork terminal semantics.
  4. Production deployment profile chooses SSH for familiar operator CLI access and TLS/mTLS, configured through Certificates and TLS, for PKI-integrated service/operator environments.

No more SSH terminal transport work should land until the shared prerequisite exists and has proof coverage for byte-identical hidden password behavior, line/IAC factoring, and repeated close/reconnect behavior.

Grounding

This proposal relies on these in-tree design documents and research notes:

  • Networking for the Telnet Shell Demo this gateway succeeds and the TCP capability path the SSH listener reuses.
  • Shell for the native capos-shell and the TerminalSession boundary every remote-shell transport must preserve.
  • Boot to Shell for CredentialStore, SessionManager, AuthorityBroker, RestrictedShellLauncher, and EntropySource, including the bounded SSH terminal-host proof that already lands inside that flow.
  • Certificates, TLS, and Certificate Transparency for the TLS/mTLS counterpart transport profile and the shared certificate, trust-store, and pinning model the Telnet-over-TLS factory consumes.
  • Cryptography and Key Management for PrivateKey, PublicKeyFormat.opensshWire, KeyVault, and SealPolicy.
  • User Identity and Policy for principal/account/session/profile semantics.
  • Resource Accounting and Quotas for listener, socket, channel, packet-buffer, and shell-process bounds.
  • System Monitoring for audit record shape and retention boundaries.
  • Storage and Naming for the capability-native storage model needed before production host keys and authorized keys become durable.
  • Trust Boundaries for remote-shell ingress review criteria.
  • Local Users Management Backlog for account, role, and RAM-store sequencing that feeds authorized-key principal mapping.
  • Genode Research for the session-factory precedent: clients request narrowed sessions from authority-bearing components instead of receiving broad factories directly.
  • Pingora Research for the listener/service/runtime split that informs keeping TCP listener setup separate from application shell authority.

External standards grounding starts from RFC 4251, RFC 4252, RFC 4253, and RFC 4254. Later SSH algorithm and extension updates, including RFC 8308, RFC 8709, and RFC 9142, must be checked when choosing the implementation’s accepted algorithm set.

Non-Goals

  • Replacing the native shell with a POSIX shell.
  • Treating SSH username or Unix UID as authority.
  • Ambient home directories, inherited file descriptors, or global paths.
  • SSH agent forwarding as a shortcut to key authority.
  • SFTP/SCP as a storage API before scoped file/storage capabilities exist.
  • Port forwarding before explicit network-proxy capabilities and policy exist.

Proposal: Telnet over TLS Shell

An optional remote-shell path that wraps the existing Telnet TerminalSession handoff in TLS 1.3. It is not the default production access interface and should not be prioritized ahead of SSH Shell Gateway for operator CLI access or WebShellGateway for browser/agent workflows. Its value is narrower: service terminals, compatibility environments that already standardize on TLS client certificates, and deployments that want a small terminal protocol over the project’s certificate/TLS stack.

This proposal sits on three load-bearing siblings and should be read with them: Networking for the socket capability surface (now served by the userspace network stack; the kernel SocketTerminalSession shim and the kernel socket owner behind it are retired), the host-loopback exposure rule, and the trust-boundary-debt paragraph this proposal must not extend; Certificates and TLS for TlsServerConfig, CertVerifier, TrustStore, CertificateStore.watch, Issuer/ACME, and rotation-without-restart; and SSH Shell Gateway for the shared TerminalSession-factory contract, the RestrictedShellLauncher/SessionManager/AuthorityBroker boundary, and the staged “transport verifies, SessionManager authorizes” split this proposal’s mTLS path mirrors. None of those siblings depends on this proposal; this proposal depends on all three.

Why Both, And Why Not Just SSH

The networking proposal and docs/status.md correctly call out plaintext Telnet on 127.0.0.1:2323 as a loopback-only research demo and name the SSH Shell Gateway as the production remote-shell successor to that demo. That comparison is between SSH and plaintext Telnet, not between SSH and TLS-wrapped Telnet. Once TLS is in the picture the operational tradeoffs change, and capOS has good reasons to expose both paths in production.

AspectSSH Shell GatewayTelnet over TLS
TransportBespoke SSH-2 binary packet protocol; the gateway parses KEXINIT, channel-open, requests, etc.TLS 1.3 record layer plus a Telnet IAC stream consumed by the cleartext-pair TerminalSession factory through a shared state-machine module (the retired kernel SocketTerminalSession previously played this role for plaintext TCP).
Protocol surface inside capOSWhole SSH message set must be parsed and reviewed even when only one channel is allowed; algorithm-policy parser; channel/forwarding/agent/X11/subsystem reject paths.TLS handshake (rustls or equivalent) + the shared Telnet IAC state machine already implemented for the plaintext path. The gateway itself is mostly handshake plumbing and does not parse IAC.
User authPublic-key built into the protocol (SshHostKey, AuthorizedKeyStore, SessionManager.sshPublicKey). Password optional and gated.Two paths: passwords through the local CredentialStore flow, or mTLS client certificates verified against a TrustStore and mapped through SessionManager.tlsClientCert.
Identity modelSSH key fingerprints, principal records keyed off public-key bytes, custom rotation/audit story.X.509 with subjects/SANs and the project’s existing certs/TLS proposal: ACME issuance, CT, OCSP, CertificateStore.watch, mTLS, pinning, name constraints. Rotation and revocation share infrastructure with everything else TLS.
Ecosystem leverageNew SSH-only operational track: authorized_keys, host-key custody, fingerprint pinning, key rotation tooling.Reuses the PKI/ACME track that capOS already needs for cloud KMS HTTPS, the web-shell gateway, mTLS between services, monitoring egress, OIDC, and audit.
Client populationOpenSSH and friends; familiar to operators.Standard TLS-capable telnet clients (telnets:// on port 992) and openssl s_client piping into telnet’s IAC discipline; clients are scarcer than for SSH but tooling exists.
ComposabilityOne protocol; one auth model.Transport identity (server cert, optional client cert) is orthogonal to user auth; deployments can layer mTLS-issued client identity and a passkey/OIDC step-up if they want, without inventing protocol extensions.
Best fitOperator CLI access where SSH client tooling and public-key login are the right defaults.Workload-to-workload terminal access between services that already speak TLS, deployments that prefer corporate-CA client certs for identity, browser/web bridges that already terminate TLS, and any environment where minimising new protocol surface is a security goal.

The paths are complementary, but not equal priorities. SSH remains the main operator CLI target; WebShell remains the main browser/agent target. A production capOS deployment can expose Telnet over TLS when its operational stack wants certificate-based terminal access, but the roadmap should treat it as an optional follow-up after certificate/TLS, durable identity, session lifecycle, audit, and listener-authority work are already credible.

Scope

The first milestone is deliberately narrow but production-shaped from day one — there is no separate “demo-only” gate after which the proposal pivots to a different cert custody story:

  • TLS 1.3 only. Implicit-TLS variant: TLS handshake first, then a normal Telnet byte stream over the established TLS record layer. IANA-registered port 992 (“telnets”) is the default; deployments pick their own port via the manifest.
  • One interactive TerminalSession per connection.
  • Server-side TLS always; mTLS client certificates as the recommended user-auth path, with passwords through CredentialStore as the fallback for deployments that have not provisioned client certs yet.
  • Algorithm policy is a single reviewed set: one ECDHE group (x25519), one signature algorithm pair (ed25519 leaf, ed25519 or ecdsa_p256 root acceptable), one AEAD cipher suite (TLS_CHACHA20_POLY1305_SHA256 or TLS_AES_128_GCM_SHA256). No downgrade negotiation surface, no TLS 1.2 fallback.
  • Cert and key custody routes through the cert proposal’s KeyVault/CertificateStore/CertVerifier/TrustStore caps from the start. The QEMU smoke uses a manifest-seeded development leaf exactly the same way ACME issuance would: through KeyVault import and CertificateStore.put. There is no separate dev-only signing surface that production has to retire.
  • Telnet IAC handling is owned by the TerminalSession implementation that receives the byte stream — the cleartext- byte-pair TerminalSession factory. IAC handling is a userspace concern in every path: the kernel SocketTerminalSession that once terminated the plaintext-TCP path is retired. The TLS gateway terminates TLS, hands a cleartext byte pair to the factory, and itself parses no IAC. See The Userspace-Cleartext TerminalSession Factory for the ownership rule.

Out of scope (with reasons recorded in Considered Alternatives where the question is liable to come up again):

  • STARTTLS-via-Telnet-options upgrade.
  • TLS 1.2, RSA key exchange, non-AEAD ciphers, compression, externally provided session-ticket keys, multi-cipher policy.
  • SSH-style channel multiplexing, port forwarding, agent forwarding, X11, subsystem requests.
  • In-kernel TLS termination.

Components

flowchart TD
    Client[telnets:// client / openssl s_client] -->|TCP + TLS 1.3| Gateway[telnet-tls-gateway]
    Gateway --> Listen[TcpListenAuthority badge 992]
    Gateway --> TlsCfg[TlsServerConfig cap]
    TlsCfg --> Key[PrivateKey sign-only]
    TlsCfg --> ChainSrc[CertificateStore.watch]
    TlsCfg --> Verifier[Optional CertVerifier + TrustStore for mTLS]
    Verifier --> ClientTrust[TrustStore for client CA]
    Gateway --> Sessions[SessionManager]
    Gateway --> Broker[AuthorityBroker]
    Gateway --> Launcher[RestrictedShellLauncher]
    Gateway --> Audit[AuditLog]
    Gateway --> Factory[TerminalSessionFromByteStream]

    Factory --> Term[Cleartext-backed TerminalSession]
    Launcher --> Shell[capos-shell]
    Term --> Shell
    Sessions --> Broker
    Broker --> Bundle[Scoped shell bundle]
    Bundle --> Shell

The shape mirrors the SSH proposal one-for-one. Only the transport authority changes.

telnet-tls-gateway is the only network-facing component. It owns:

  • The TCP listener, acquired through a manifest-declared TcpListenAuthority whose badge is the configured TLS port (992 in the default manifest).
  • TLS 1.3 server-side handshake and record layer, composed from a TlsServerConfig cap. The gateway never sees the raw PrivateKey; the TLS config encapsulates sign-only key authority, the certificate-chain source, the optional client-cert verifier, and the algorithm policy.
  • The cleartext byte pair (read half + write half) produced by the TLS layer, immediately handed to TerminalSessionFromByteStream after the handshake completes. The gateway implements no Telnet, echo, line-discipline, or terminal logic.

TlsServerConfig, TrustStore, CertVerifier, CertificateStore.watch, KeyVault, SealPolicy, EntropySource are not defined here. They are the caps the Certificates and TLS and Cryptography and Key Management proposals already specify.

RestrictedShellLauncher and the broker/session/credential plumbing are unchanged from the plaintext demo and the SSH proposal. The spawned capos-shell receives only terminal, child-local stdio, and the broker-issued shell bundle (session, creds, sessions, audit, broker, optional shell_config). It does not see TLS, certificate, trust-store, key, listener, raw socket, or gateway-protocol authority.

The Userspace-Cleartext TerminalSession Factory

The retired plaintext demo terminated the byte stream inside the kernel: TcpSocket.intoTerminalSession consumed a connected accepted socket and returned a move-only TerminalSession backed by the kernel SocketTerminalSession cooked-mode shim (line discipline, password echo policy, CRLF normalization). That shim is removed: the kernel socket owner behind it was retired by the userspace network-stack migration, and TcpSocket.intoTerminalSession now fails closed in every dispatch path. There is no kernel-terminated terminal byte stream any more; a network-backed TerminalSession must be constructed in userspace.

The kernel model never extended to TLS anyway. TLS termination must be userspace — adding rustls to the kernel would be a substantial expansion of the in-kernel networking surface, exactly the expansion the networking proposal’s “trust-boundary debt” paragraph forbids. The kernel TCP path was acceptable because the bytes already crossed that boundary; TLS records do not.

This means TLS-backed remote shells need a different TerminalSession construction surface that consumes a userspace-owned bidirectional cleartext byte pair. Sketch:

interface ByteStreamPair {
    inbound  @0 () -> (rx :ByteStream);
    outbound @1 () -> (tx :ByteStream);
    closeHint @2 () -> (hint :CloseHint);
}

interface TerminalSessionFromByteStream {
    wrap @0 (pair :ByteStreamPair, options :TerminalLineOptions)
        -> (term :TerminalSession);
}

Line discipline (cooked vs raw, password echo policy, paste handling, CRLF state) belongs inside the implementation of wrap, not in the gateway. The implementation must:

  • Preserve the same LineEcho::Hidden semantics the retired kernel SocketTerminalSession enforced (the cooked-mode line discipline survives as host-tested capos_lib::line_discipline), including the fix history captured in the Telnet IAC handoff commits.
  • Keep the spawned shell’s view of TerminalSession byte-identical to the UART terminal path. The shell must not need to care about the transport.
  • Treat partial reads, partial writes, peer close, and TLS close_notify as ordinary TerminalSession close events, not transport-specific errors leaking to the shell.
  • Own Telnet IAC handling for the cleartext byte pair. IAC ownership is wholly a userspace concern: no kernel component terminates a network byte stream any more, and the cleartext bytes reach the terminal only through this factory. The IAC state machine (option negotiation, suppress-go-ahead, echo policy, the NUL-prefixed-password and staircase-output fix history) belongs in a shared userspace module the gateways and the cleartext-pair factory call into, so neither path forks the byte rules and no IAC parsing returns to the kernel.

This factory is also what the SSH Shell Gateway needs: SSH channel-backed terminals are not connected TcpSockets either. This proposal therefore defines the surface in a transport-neutral way. Whichever of SSH or Telnet-over-TLS lands first will deliver TerminalSessionFromByteStream, and the other reuses it.

Authority Model

The gateway receives only the capabilities required for its job:

  • TcpListenAuthority whose badge is the configured TLS port. Mints exactly one TcpListener for that port and nothing else; raw NetworkManager.createTcpListener is not granted.
  • TlsServerConfig for TLS server-side handshake. Not the underlying PrivateKey, KeyVault, CertificateStore administrative surface, or TrustStore mutation.
  • EntropySource, or a narrowed TlsTransportCrypto cap that owns entropy and exposes only TLS handshake, record-layer, rekey, and random-padding operations. Random material for handshake nonces, key derivation, and record nonces never comes from ambient process state.
  • TerminalSessionFromByteStream for the cleartext-backed terminal.
  • SessionManager to mint a session at handoff: anonymous in the password path, tlsClientCert in the mTLS path.
  • AuthorityBroker to request the normal shell bundle profile.
  • RestrictedShellLauncher to spawn capos-shell with the supplied session and the reviewed pass-through grants only.
  • AuditLog append authority for connection, handshake outcome, authentication outcome, shell launch, and teardown records. Audit records carry stable reason codes; they do not carry private key material, certificate private parts, raw entropy, decrypted password bytes, or terminal content.

It explicitly does not receive:

  • Raw NetworkManager, raw TcpListener factories beyond the configured port, outbound connectTcp, or any UDP/ICMP authority.
  • Raw PrivateKey access, KeyVault administration, key generation, key export, or certificate issuance.
  • CertificateStore mutation or trust-store administration. The gateway consumes a TrustStore for client-cert verification; it cannot add or remove anchors.
  • Broad ProcessSpawner authority. Shell launch goes through RestrictedShellLauncher only.
  • CredentialStore authority, and no parsing, logging, audit, or storage authority for credential bytes. The gateway necessarily has plaintext password bytes in its TLS-record and cleartext-pair buffers while a record is being consumed (see the password fallback section’s TCB note); it does not run CredentialStore verification, does not interpret those bytes as credentials, and does not retain them. capos-shell handles login exactly as on the local console.
  • Any kernel-internal or system-wide TerminalSession factory beyond the cleartext-byte-pair construction surface.

The spawned shell does not gain TLS, certificate, trust-store, key, network, listener, raw socket, or gateway-protocol authority. The boundary the plaintext demo proves with caps is preserved verbatim.

Authentication

Server identity

Server identity is asserted through the leaf certificate carried by TlsServerConfig. Custody routes through the cert/key proposals from the start:

  • The leaf private key is a KeyVault-backed sign-only PrivateKey cap under an explicit SealPolicy allowing only TLS server-side signing.
  • The leaf chain is produced through whichever issuance path the deployment uses: ACME for internet-facing endpoints, manifest-issued for development and air-gapped, internal-CA-issued for corporate fleets. The cert proposal’s Issuer/Acme interfaces are the source of truth.
  • Rotation lands through CertificateStore.watch. TlsServerConfig re-derives its leaf for the next handshake; existing TLS sessions finish on the old chain. No gateway restart, no SIGHUP, no filesystem signaling.

The QEMU development manifest seeds a leaf and key through the same cap surface — the cert is imported into KeyVault and CertificateStore, not exposed through a parallel “dev-only” signing cap. Smoke harnesses pin the development leaf by SHA-256 SPKI; deploys pin or trust through their normal trust path.

The recommended production user-auth path is mTLS:

  1. TlsServerConfig.clientVerifier returns a CertVerifier plus a TrustStore of acceptable client CAs, scoped to the deployment.
  2. The TLS handshake requires a client certificate. The gateway verifies it through CertVerifier.verifyChain against the client-CA TrustStore, with name constraints, EKU (clientAuth), and revocation status enforced by the verifier policy.
  3. On success, the gateway hands the verified leaf to SessionManager.tlsClientCert (a new mint path mirroring sshPublicKey). The session manager maps subject/SAN/fingerprint to a principal record and allowed shell profiles, and mints a UserSession with tlsClientCert authentication strength.
  4. AuthorityBroker issues the shell bundle for the matched profile; RestrictedShellLauncher spawns capos-shell with that bundle and the cleartext-backed terminal.

The session manager’s mapping is intentionally explicit. A verified client cert proves “this private key signed this handshake,” not “this is user X.” Mapping subject/SAN to a principal is a separate authorization step that lives in SessionManager, exactly as AuthorizedKeyStore does for SSH public keys. Anonymous holders of a trusted cert do not silently become privileged accounts.

mTLS user auth fails closed without ever reaching the shell. The failure path is staged so transport verification and authorization stay distinct, mirroring how SSH AuthorizedKeyStore and SessionManager.sshPublicKey separate “key signature is valid” from “key maps to a principal”:

  • A client cert that fails the TLS trust path — untrusted issuer, expired, revoked, signature invalid, name constraint violation, missing clientAuth EKU — ends with a TLS handshake alert. No authorization step runs and SessionManager.tlsClientCert is never called.
  • A client cert that successfully verifies through the configured CertVerifier but maps to no principal record causes a SessionManager.tlsClientCert deny with a sanitized audit reason code, before any shell launch. Verified-but-unmapped certs are an authorization failure, not a transport failure, and must not be collapsed into the TLS alert above.
  • A profile mismatch between the requested shell bundle and the mapped principal’s allowed profiles causes an AuthorityBroker deny, again before launch.

User authentication: password fallback

Deployments that have not yet provisioned client certs use the existing local-shell path:

  1. The TLS handshake completes with no client certificate (or with a client cert that the deployment has explicitly marked “transport-only”), and the gateway mints an anonymous session.

  2. RestrictedShellLauncher spawns capos-shell, which prints login: and runs login/setup against CredentialStore with the same generic-failure / bounded-backoff / audit policy used on the local UART console.

  3. Password bytes are LineEcho::Hidden input through the terminal session. The gateway implements no Telnet, line-discipline, or credential parsing of plaintext beyond moving bytes between the TLS record layer and the cleartext byte pair, and never logs password bytes or includes them in audit records or proof transcripts.

    Plaintext password bytes do exist in gateway-mapped TLS record-layer buffers and in the cleartext byte pair while the record is being consumed; that is unavoidable for any in-process TLS terminator and must be acknowledged honestly. The gateway is therefore part of the password-fallback TCB, comparable to the way the retired kernel SocketTerminalSession was part of the plaintext demo’s TCB. The mTLS path is preferred precisely because it does not put password bytes on the wire or through the gateway in the first place.

This is weaker than mTLS but the trust boundary is no larger than the local console: the kernel TCB plus one terminator-shaped component (the gateway here, the kernel UART TerminalSession for the local console). It exists so deployments can ship Telnet-over-TLS before completing client-cert provisioning, not as a recommended end state.

Step-up paths (future)

Deployments may want to combine transport-level identity (mTLS) with an additional human factor (passkey, OIDC, TOTP). Step-up is the shell’s responsibility, not the gateway’s: capos-shell gains a stepUp command in a separate proposal, the gateway does not short-circuit it. Treating mTLS plus passkey as orthogonal layers is one of the reasons this path exists alongside SSH at all.

Considered Alternatives

STARTTLS via Telnet options

Rejected. Three reasons, in decreasing order of weight:

  1. No mainline client support. Generic Telnet+STARTTLS has no IETF-standardised binding. RFC 2941 (Telnet Authentication Option) and RFC 2946 (Telnet Data Encryption Option) are generic frameworks; the only concrete TLS binding lives in TN3270E (mainframe terminal emulators such as x3270, IBM Personal Communications, and Vista TN3270). BSD/netkit telnet — the standard Linux client and the one capOS already harnesses — does not speak it. GNU inetutils telnet, the Microsoft Windows telnet client, and PuTTY do not speak it. Targeting STARTTLS would commit capOS to a TN3270E-shaped client population it has no reason to address, while excluding the implicit-TLS clients that do exist (telnets://, openssl s_client, modern TLS-capable telnet implementations).
  2. Pre-handshake plaintext window. STARTTLS requires plaintext IAC option exchange before TLS. That window leaks client identity, supports active downgrade attacks (server claims STARTTLS support is unavailable, expecting cleartext fallback), and complicates audit (where does “I refused to start TLS” log, and how does the server distinguish a legitimate non-TLS client from a downgrade attempt?).
  3. Forces pre-handshake protocol parsing into the gateway. STARTTLS requires the gateway to parse Telnet IAC before any TLS protection exists, complete with its own state machine to detect the STARTTLS option and decide whether to invoke TLS — protocol surface on unauthenticated cleartext bytes that the implicit-TLS-from-byte-zero design never exposes.

If a future deployment specifically needs TN3270E-style STARTTLS for mainframe interoperation, it is a separate proposal with its own authority model — not a generalisation of this one.

In-kernel TLS termination

Rejected. The networking proposal’s “trust-boundary debt” paragraph explicitly forbids expanding kernel-side networking surface for its own sake. TLS termination is large, well-served by rustls in userspace, and gains nothing by living in the kernel.

A single “remote shell” proposal covering both SSH and TLS

Rejected. The two paths share a TerminalSession factory and the broker/session/launcher plumbing, but their transport, key custody, client population, and user-auth ergonomics differ enough that collapsing them produces a worse design document. They are described separately, sized separately, and can be implemented and audited independently.

Implementation Slices

Slices land in this order. None is a single opaque commit. Slice 1 is shared with the SSH gateway and may be delivered by either project.

  1. Userspace cleartext-byte-pair TerminalSession factory. Define ByteStreamPair, TerminalSessionFromByteStream.wrap, and TerminalLineOptions. Implement against a plaintext userspace byte pair first, with no TLS in the loop. Build the line discipline on the shared host-tested capos_lib::line_discipline module (the cooked-mode core the retired kernel SocketTerminalSession used) plus a userspace IAC state-machine module, producing byte-identical output to the UART terminal for echo policy, hidden password, CRLF state, and peer close. Either project (this proposal or the SSH gateway) may deliver this slice; both projects depend on it.

    No SSH or TLS terminal transport slice should proceed past fixture work until this factory exists, IAC/line discipline is factored, hidden password behavior is byte-identical to the raw TCP terminal, and repeated close/reconnect proofs pass.

  2. TlsServerConfig consumption with development leaf. Wire a manifest-seeded leaf into KeyVault and CertificateStore, compose TlsServerConfig with the reviewed algorithm policy, and add make run-telnet-tls-config proving the cap signs handshake transcripts, refuses non-allow-listed algorithms, and never exposes private key bytes in proof logs. The dev path uses the same caps as production; only the issuance source differs (manifest import vs ACME / internal CA).

  3. telnet-tls-gateway service, password path. Boot the userspace gateway against a scoped TcpListenAuthority for port 992, terminate one host-loopback TLS 1.3 handshake with openssl s_client, write the cleartext byte pair into the factory, and run a logincapsexit transcript through the existing CredentialStore flow. Prove the “Service Liveness” rule with repeated connections.

  4. mTLS user auth. Add SessionManager.tlsClientCert, define AuthorizedTlsClient records (subject/SAN/fingerprint → principal/profile mapping), wire TlsServerConfig.clientVerifier, and prove the four staged states with the trust-path/authorize distinction the mTLS auth section already requires: trust-path failure such as untrusted issuer, expired, revoked, signature invalid, name-constraint violation, or missing clientAuth EKU (TLS handshake alert, no SessionManager call); verified-but- unmapped cert (SessionManager.tlsClientCert deny pre-launch with sanitized audit reason); verified+mapped cert with profile mismatch (AuthorityBroker deny); accepted cert (UserSession with tlsClientCert strength reaches the shell, caps confirms boundary).

  5. Production custody path. Replace the manifest-seeded leaf with an ACME-issued or internal-CA-issued chain through the cert proposal’s Issuer interface. Prove rotation through CertificateStore.watch lands without restart and without breaking in-flight sessions.

  6. system-telnet-tls.cue, make run-telnet-tls, and the host harness. Default the manifest to mTLS-required with a fallback passwordOnly knob, add cleanup proofs for client disconnect, server close_notify, and shell exit, and update the topic indexes, sidebar, and docs/tasks/README.md when the slice lands.

Each slice keeps the kernel networking surface untouched. New TLS state lives in the userspace gateway; new line-discipline state, if any, stays inside the TerminalSession factory’s implementation.

Resource And Teardown Rules

The gateway must enforce fixed per-connection bounds and fail closed when they are exceeded. Disconnect, TCP close, TLS close_notify, failed handshake, terminal-factory error, shell exit, and gateway shutdown must all release the same resources:

  • accepted socket,
  • TLS connection state (handshake buffers, key schedule, record-layer buffers),
  • cleartext byte pair,
  • TerminalSession object,
  • spawned shell handle,
  • broker-issued grants,
  • audit correlation record.

Shell exit closes the cleartext byte pair, which closes the TLS layer, which closes the TCP socket. Client disconnect or TLS close_notify closes the TLS layer, which closes the byte pair, which the shell observes as a normal TerminalSession close. There is no privileged “tear down everything” path that bypasses the byte-pair lifecycle.

The accept loop applies the same shape as the post-7a155f4 plaintext gateway: per-connection failures (handshake error, factory error, launch error, shell wait error) are log-and-continue events; setup-time failures (listener creation, broker bootstrap, TLS config acquisition) and accept itself remain fail-closed. The “Service Liveness” review rule applies verbatim.

Threat Model And Honest Limits

What Telnet-over-TLS gives, with TLS 1.3 + AEAD + ECDHE + deployment-issued or pinned-development leaf:

  • Confidentiality and integrity against passive and active network observers.
  • Forward secrecy of session bytes after the connection ends.
  • Per-session randomness (replay protection) from the TLS handshake.
  • Server identity assertion as good as the deployment’s trust path: ACME-issued public chain, corporate-CA chain, or SPKI pinning in the QEMU smoke.
  • With mTLS: cryptographic client identity tied to PKI, with rotation and revocation on the same operational track as the rest of the deployment’s TLS estate.

What it does not give:

  • SSH-style channel multiplexing, exec, port forwarding, agent forwarding, subsystem requests. These are explicit non-goals; if they are needed, the SSH gateway is the right path.
  • Resistance against an attacker who can replace the deployment’s trust path on the client side. SPKI pinning in the harness mitigates this for the QEMU smoke; deployments must use a real trust anchor.
  • Stronger user auth than the deployment provisioned. mTLS without principal mapping is just transport; password fallback without step-up is just CredentialStore. The gateway does not synthesise authority it was not given.

This proposal does not claim Telnet-over-TLS is “as secure as SSH” or “more secure than SSH.” It is a different protocol with a different operational profile and a smaller surface to review. Whether that profile suits a given deployment is an operational decision, not a default.

Dependencies

  • Networking for the socket capability surface served by the userspace network stack, the host-loopback exposure rule, and the trust-boundary-debt paragraph this proposal must not extend.
  • Certificates and TLS for TlsServerConfig, Certificate, CertificateChain, TrustStore, CertVerifier, CertificateStore.watch, Issuer/ACME, algorithm policy, and CT/OCSP plumbing.
  • Cryptography and Key Management for sign-only PrivateKey, KeyVault, SealPolicy, and EntropySource (or a narrowed TlsTransportCrypto cap).
  • Shell for the TerminalSession boundary and the rule that remote text transports do not turn the shell into a raw byte-stream consumer.
  • Boot to Shell for CredentialStore, SessionManager, AuthorityBroker, and the login/setup flow the password fallback path reuses.
  • SSH Shell Gateway for the parallel TerminalSession factory requirement and the TcpListenAuthority/RestrictedShellLauncher/SessionManager conventions to mirror.
  • User Identity and Policy for principal/account/session/profile semantics shared by password and mTLS paths.
  • Resource Accounting and Quotas for listener, socket, handshake-buffer, key-schedule, terminal, and shell-process bounds.
  • System Monitoring for audit record shape and retention boundaries.
  • Storage and Naming for the capability-native storage path that production leaf certs and client-cert principal records become durable through.

External standards grounding:

  • IANA Service Name and Transport Protocol Port Number Registry — telnets on TCP/992 (the implicit-TLS variant the default manifest binds).
  • RFC 8446 (TLS 1.3). Older TLS RFCs are listed only to document why they are explicitly out of scope.
  • RFC 854/855/856/857/858 (Telnet, option negotiation, binary, suppress-go-ahead, echo) for the upper protocol the kernel IAC filter already implements.
  • RFC 5280 (X.509 PKI) and RFC 8555 (ACME) for the certificate chain and issuance paths.
  • RFC 2941 / RFC 2946 cited only as the explicitly-rejected STARTTLS-style alternative (see Considered Alternatives).

Grounding

In-tree project docs read or re-read while shaping this proposal:

  • Networking for the Phase A/B/C boundaries, the TcpListenAuthority shape, the kernel-side IAC filter, the post-7a155f4 IAC handoff fix, and the trust-boundary-debt rule against expanding kernel networking surface.
  • SSH Shell Gateway for the RestrictedShellLauncher / SessionManager / AuthorityBroker / scoped-listener pattern and the staged transport-verify-then-authorize separation that the mTLS path now mirrors.
  • Certificates and TLS for TlsServerConfig, CertVerifier, TrustStore, CertificateStore.watch, Issuer/ACME, and the rotation-without- restart rule the production-custody slice depends on.
  • Cryptography and Key Management for KeyVault, SealPolicy, sign-only PrivateKey, and EntropySource shape.
  • Shell for the TerminalSession boundary and the rule that remote text transports do not become raw ByteStream/StdIO substitutes.
  • Boot to Shell for the login/setup flow the password fallback reuses and the CredentialStore failure/backoff/audit policy.
  • REVIEW.md for the Service Liveness rule applied to the gateway accept loop, the design-grounding requirement that produced this section, and the proposal-doc shape (status header, last-reviewed timestamp with timezone, relative links).

docs/research/ files read for prior-art grounding:

  • Genode for the session-factory precedent: clients receive narrowed sessions from authority-bearing components rather than holding a broad factory themselves. The TerminalSessionFromByteStream / gateway split follows that pattern, exactly as the SSH proposal does.
  • Pingora for the listener / TLS-termination / service split that informs keeping the TcpListener, the TLS terminator, and the application-shaped shell-launch authority on separate caps. The TCB-acknowledgement paragraph in the password-fallback section is grounded in this separation: TLS termination puts plaintext in the terminator’s memory by construction, and the right answer is to size and bound the terminator, not to claim it never sees the bytes.
  • Plan 9 and Inferno for the Plan 9 cpu remote-shell precedent: a CPU server is reached over a connection-oriented transport (originally TCP, with TLS/SSL added later), the client authenticates through 9P’s pluggable Tauth/Rauth auth-fid mechanism, and only after authentication does the client Tattach and run an interactive shell. Inferno’s certificate-based authentication model is the same shape with X.509 instead of Kerberos. The relevance here is structural: remote-CLI access can be built around connection- oriented authenticated transports with verification and authorization as separate stages, exactly the split this proposal uses for mTLS plus SessionManager.tlsClientCert. capOS does not adopt Plan 9’s namespace-as-authority model — that is the wrong primitive for a Cap’n Proto-typed system — but the staged authenticate-then-attach pattern validates the design.

No other docs/research/ file is directly applicable: the seL4, Zircon, EROS/CapROS/Coyotos, LLVM, capnp/OS error handling, IX-on-capOS hosting, and out-of-kernel scheduling reports do not address remote-shell transport choice or PKI integration in ways that change this proposal.

Non-Goals

  • Replacing or subordinating the SSH Shell Gateway. The two are peer production paths.
  • Telnet-over-TLS as a research-only or demo-only path. Production custody (KeyVault + CertificateStore.watch + ACME / internal CA) is the target shape from slice 5; the manifest-seeded development leaf is a stepping stone, not a parallel architecture.
  • STARTTLS via Telnet options.
  • TLS 1.2 or any cipher-policy negotiation surface that allows downgrade.
  • Adding rustls, in-kernel TLS, or any in-kernel networking parser. The kernel SocketTerminalSession is retired; no terminal or protocol byte handling returns to the kernel.
  • SSH-style channel multiplexing, exec, port forwarding, agent forwarding, X11, subsystems.
  • Treating a verified client cert as authority. Authority comes from the principal mapping in SessionManager and the bundle issued by AuthorityBroker, exactly as for SSH public keys.

Proposal: Boot to Shell

How capOS should move from “boot runs smokes and halts” to an authenticated, text-only interactive shell without weakening the capability model.

Problem

The old boot path was a systems bring-up path that started fixed services, proved kernel and userspace invariants, and exited cleanly. The completed local console milestone added interactive login/setup and shell behavior; the later init-owned default manifest moved that shell behind standalone init. The remaining problem space is remote/web login, stronger credential policy, and richer shell/session behavior without reintroducing ambient authority.

The first interactive milestone was deliberately modest:

  • Boot QEMU or a local machine to a text console login/setup prompt.
  • Start a native capability shell after local authentication or first-boot setup.
  • Keep browser-hosted text terminal, WebAuthn/passkeys, and remote enrollment as later work in the same proposal family after the local console path works.
  • Keep graphical shells, desktop UI, window systems, and app launchers as a later tier.

The risk is that “make it interactive” tends to smuggle ambient authority back into the system. A login prompt must not become a kernel uid, a web terminal must not become an unaudited remote root shell, and first-boot setup must not be a first-remote-client-wins race.

Scope

The completed local-console milestone covered:

  • Serial/local text console login and first-boot credential setup.
  • Native text shell as the post-login workload.
  • Minimal SessionManager, CredentialStore, AuthorityBroker, and AuditLog pieces needed to launch that shell with an explicit CapSet.
  • Password verifier records stored with a memory-hard password hash.
  • Local recovery/setup policy for machines with no credential records.

Later in the same proposal family:

  • Passkey registration and authentication for a web text shell.
  • A passkey-only account path that does not require creating a password first.
  • Federated login via OpenID Connect (OIDC) identity providers — device code on the local/serial console, authorization code + PKCE on the web text shell. See OIDC and OAuth2.

Out of scope:

  • Graphical shell, desktop session, compositor, GUI app launcher, clipboard, or remote desktop.
  • POSIX /bin/login, PAM, sudo, su, or Unix uid/gid semantics.
  • Password reset by policy fiat. Recovery is a separate authenticated setup or operator action.
  • Making authentication proofs visible to the shell, agent, logs, or ordinary application processes.

Design Principles

  • Authentication creates a UserSession; capabilities remain the authority.
  • The shell is an ordinary process launched with a broker-issued CapSet.
  • Console authentication, web authentication, and federated OIDC login feed the same session model.
  • Passwords are verified against versioned password-verifier records; raw passwords are never stored, logged, or passed to the shell.
  • Passkeys store public credential material only; private keys stay in the authenticator.
  • OIDC ID tokens are verified against a pinned OidcIdentityProvider; the raw token never reaches the shell or audit stream as bytes.
  • First-boot setup requires local setup authority or an explicitly configured bootstrap credential. Remote first-come setup is not acceptable.
  • A missing credential store does not imply an unlocked system.
  • Guest and anonymous sessions are explicit policy profiles, not fallbacks for missing credentials.
  • Development images may have an explicit insecure profile, but that must be visible in the manifest and serial output.

Architecture

The original local console boot-to-shell proof collapsed the authentication service and interactive shell into a single userspace process. Focused shell-led smokes still boot capos-shell directly as initConfig.init with a narrow bootstrap CapSet (see Service Architecture). The default system.cue path now runs capos-shell as an init-started service through standalone init (Service Architecture), but the shell-side authority model is the same: it mints its own anonymous UserSession and only upgrades after a password login:

flowchart TD
    Kernel[kernel starts one init]
    Init[standalone init or focused shell init]
    Shell[capos-shell]
    Cred[CredentialStore]
    Session[SessionManager]
    Broker[AuthorityBroker]
    Audit[AuditLog]
    Term[TerminalSession]
    Web[WebShellGateway]
    Launcher[RestrictedShellLauncher]

    Kernel --> Init
    Init --> Shell
    Shell --> Term
    Shell --> Cred
    Shell --> Session
    Shell --> Broker
    Shell --> Audit
    Session --> Broker
    Broker --> Launcher
    Cred --> Session
    Audit --> Session
    Audit --> Broker
    Web -. "future" .-> Session

The shell keeps the authority-holding caps needed for its session boundary (terminal, creds, sessions, audit, broker) because the current interactive substrate has not split login, shell, and approval into separate services. It does not hand those caps to any child it spawns; spawn grants go through the broker-issued RestrictedLauncher whose allowlist depends on the current session’s profile (empty for anonymous, full interactive shell set for operator, and empty or narrowly policy-selected for guest). The launcher itself is the Service Architecture ProcessSpawner cap wrapped behind broker-enforced policy, so a shell child cannot widen its CapSet at spawn time. The broker returns a narrow shell bundle such as:

terminal        TerminalSession
self            UserSession metadata
status          read-only SystemStatus
logs            scoped LogReader
home            scoped Namespace or temporary Namespace
launcher        RestrictedLauncher
approval        ApprovalClient

Early builds can omit storage-backed home and use a temporary namespace. They still should not hand the shell broad BootPackage, ProcessSpawner, FrameAllocator, raw device, or global service-supervisor authority by default.

First Terminal Boundary

The first interactive console boundary should be a session-scoped TerminalSession, not a widened boot Console cap and not a raw byte-stream cap handed directly to login or shell processes.

Console stays the early-boot and panic-path output surface. The component that owns the underlying local console transport, line discipline, edit buffer, and later web-terminal framing can be called ConsoleTerminal or TerminalMux; the external authority boundary is the same either way:

  • only the terminal service owns raw console transport state and line buffers,
  • the shell process receives the foreground TerminalSession cap and drives pre-auth password/setup input through it with per-call echo = hidden,
  • shell children do not inherit the terminal unless the shell names it in a spawn plan.

A later web-shell or federated-login service that needs a separate authentication front-end will still get its own TerminalSession and its own broker-issued bundle; it does not widen authority on the local console shell. The shell-side framing of this split — terminal-host process versus shell process, with the terminal owning raw console state and the shell owning the post-auth command loop — lives in Shell.

The first interface should stay line-oriented:

enum LineEcho {
  visible @0;
  hidden @1;
}

enum LineStatus {
  submitted @0;
  cancelled @1;
  closed @2;
}

struct LineRequest {
  prompt @0 :Text;
  maxBytes @1 :UInt32;
  echo @2 :LineEcho;
  allowEmpty @3 :Bool;
}

interface TerminalSession {
  write @0 (data :Data) -> ();
  writeLine @1 (text :Text) -> ();
  readLine @2 (request :LineRequest) -> (status :LineStatus, line :Data);
}

That shape fixes the first boot-to-shell boundary:

  • readLine returns one bounded line or a structured cancelled/closed result. The service owns the temporary edit buffer and scrubs it after completion or cancellation.
  • Echo policy is per call. Password entry uses echo = hidden; the shell never toggles a terminal-global echo bit that could leak into later prompts.
  • The terminal service enforces a hard implementation ceiling even if a caller asks for a larger maxBytes. ConsoleLogin and setup flows should request smaller bounds than the shell’s ordinary command reader.
  • Cancellation is line-scoped. Operator abort input returns cancelled and the caller receives no partial secret buffer.
  • The first milestone does not need raw byte reads, terminal history replay, multi-reader fan-out, or shell-visible secret-state. Paste framing, resize, and richer terminal controls can extend TerminalSession later.

This keeps password/setup entry inside ConsoleLogin and terminal services. The broker, audit log, shell, and shell children only see the outcomes they need: session metadata, policy results, and a terminal handle for post-auth interactive work.

Console Login

The local console path now runs entirely inside capos-shell, so “login” is a shell command rather than a separate pre-shell process. The shell always boots with an anonymous session; authentication is an explicit user action. The three states below describe what the login and setup commands see, not a boot-time mode selector. The in-shell command surface and the login / setup / caps / inspect command behavior live in Shell and Shell; this proposal describes only the session/credential/broker authority side of the same flow. The make run-login smoke covers the password path, and make run-shell covers the anonymous-only path.

Password Configured

If CredentialStore has an enabled console password verifier for the selected principal or profile, login prompts for the password, verifies it through CredentialStore, mints an operator UserSession via SessionManager.login, asks the broker for the operator shell bundle, and swaps the in-shell session and launcher in place.

The verifier record should be versioned:

PasswordVerifier {
  algorithm: "argon2id"
  params: { memoryKiB, iterations, parallelism, outputLen }
  salt: random bytes
  hash: verifier bytes
  createdAtMs
  credentialId
  principalId
}

Argon2id is the default target because it is memory-hard and widely reviewed. The record must include parameters so stronger settings can be introduced without invalidating older records. A deployment may add a TPM- or secret-store-backed pepper later, but the design must not depend on a pepper being present.

On failed attempts, the shell records an audit event and applies bounded backoff before re-prompting within the same login invocation. The backoff state is not a security boundary by itself, because local attackers may reboot; the password hash strength still matters.

No Console Password

If no console password verifier exists, login reports that setup is required. The user must run setup to create the first verifier. The make run-login-setup smoke drives the first-boot path: no verifier exists, login refuses, setup mints the first volatile verifier through the manifest operator seed principal, and the shell then upgrades to an operator session.

Setup mode can:

  • create the first console password verifier,
  • enroll a first passkey for the web text shell (future),
  • create both credentials (future).

Until a credential is created, the shell stays in the anonymous session: it can exercise caps, inspect, session, and help, but the broker-issued anonymous launcher has an empty allowlist, so the shell cannot spawn children or escalate authority. This matches the operator expectation: no configured password means “setup required”, not “open console”.

Passkey-Only Deployment

Passkey-only should be possible without creating a password. It still needs a bootstrap authority path.

Acceptable first-passkey bootstrap paths:

  • local console setup enrolls the first passkey and then never creates a password verifier,
  • the manifest or cloud metadata includes a predeclared passkey public credential for an operator principal,
  • the console prints a short-lived setup challenge that a web enrollment flow must redeem before registering the first passkey.

Unacceptable path:

  • the first remote browser to reach the web endpoint becomes administrator because no password exists.

If a machine is passkey-only, the local console can still expose setup, recovery, guest, or diagnostic profiles according to policy. It should not silently become an unauthenticated administrator shell.

Guest and Anonymous Profiles

The user-identity proposal distinguishes authenticated, guest, anonymous, and pseudonymous sessions (see User Identity and Policy for the full taxonomy and User Identity and Policy for the underlying session structure). Boot-to-shell should consume that model directly.

Authenticated password login creates a human or operator UserSession with auth strength password. Authenticated passkey login normally creates a human, operator, or pseudonymous UserSession with auth strength hardwareKey. Neither proof is authority by itself; both feed the broker. Default password-authenticated local operator sessions do not expire by fixed wall-clock timestamp; their normal lifecycle is explicit logout, terminal/connection/process-tree close, or administrator revocation. A manifest can still opt into a hard operator lifetime for focused proofs or deployment policy.

Guest is the only unauthenticated profile that belongs on the local interactive console by default. It is a deliberate SessionManager.guest() path with a local interactive affordance, weak or no authentication, short expiry, tight quotas, no durable home unless policy grants one, and a bundle such as:

terminal        TerminalSession
self            guest UserSession metadata
tmp             temporary Namespace
launcher        RestrictedLauncher(allowed = ["help", "settings"])
logs            scoped LogReader for this guest session

Guest should not receive ApprovalClient for administrative actions unless a named policy grants it. If no console password exists, setup may offer a guest session only when the manifest explicitly enables a guest profile. Otherwise the operator must create a credential or leave the ordinary shell unavailable.

Anonymous is different. It is usually remote or programmatic, has a random ephemeral principal ID, receives a smaller cap bundle than guest, and has no elevation path except “authenticate” or “create account”. It is not the console fallback for missing credentials, and it should not be counted as “booted to shell” unless the product goal is an explicitly anonymous demo.

If the web gateway later supports anonymous access, it should be a purpose-scoped workload or very restricted text terminal with no durable home, strict quotas, short expiry, and audit keyed by network context plus ephemeral session ID. It must not share the passkey setup path, because passkey-only bootstrap is a credential-enrollment flow, not anonymous access.

An empty CapSet remains the “Unprivileged Stranger” case. It is useful for attack-surface demonstration, but it is not a session profile and not a shell login mode.

Web Text Shell and Passkeys

This is later work in the same proposal family, not part of the current local-console acceptance gate. The web shell is a browser-hosted terminal transport, not a graphical shell. It should display the same native text shell protocol through a terminal UI and should launch the same kind of session bundle as the local console path.

Required pieces:

  • network stack and HTTP/WebSocket or equivalent streaming transport,
  • TLS or a deployment mode acceptable to browsers for WebAuthn,
  • stable relying-party ID and origin policy,
  • random challenge generation,
  • passkey credential storage,
  • user-verification policy,
  • audit and rate limiting.

Passkey credential records should store public material:

PasskeyCredential {
  credentialId
  principalId
  publicKey
  relyingPartyId
  userHandle
  signCount
  transports
  userVerificationRequired
  createdAtMs
}

The authentication flow is:

  1. Browser requests a login challenge.
  2. WebShellGateway asks SessionManager or CredentialStore for a bounded, random challenge tied to the relying-party ID and intended principal.
  3. Browser calls the platform authenticator.
  4. Gateway verifies the WebAuthn assertion, origin, challenge, credential ID, public-key signature, user-presence/user-verification flags, and sign-count behavior.
  5. SessionManager mints a UserSession with auth strength hardwareKey.
  6. AuthorityBroker returns the shell bundle for that session/profile.
  7. RestrictedShellLauncher starts the native text shell connected to the web terminal stream.

Registration requires an existing authenticated session, local setup authority, or an explicit bootstrap path. Passwordless registration is allowed; unauthenticated remote registration is not.

Remote Session Clients

The same authentication and broker model also serves non-shell remote clients. A host app – CLI, native GUI, Tauri backend, webapp gateway, or service client – should not have to start a terminal shell just to call typed services. After password, public-key, OIDC, passkey, mTLS, guest, anonymous, or service/workload admission succeeds under policy, SessionManager mints a UserSession and AuthorityBroker returns a remote-client bundle. The client then sees a remote CapSet view whose entries are Cap’n Proto RPC object references, not local capOS cap slots.

This keeps boot/login policy unified:

  • authentication proofs are consumed by trusted session/admission services;
  • the broker chooses the CapSet for the selected profile;
  • shells, web terminals, agents, and non-shell remote clients are different consumers of session bundles;
  • password auth is one adapter, not the remote protocol shape.

The detailed remote-client design lives in Remote Session CapSet Clients.

Federated Login (OIDC)

OIDC is the third authentication path alongside password and passkey. It lets capOS accept identity from a corporate IdP (Azure AD, Google Workspace, Okta, Keycloak, Dex, GitHub) without capOS storing or managing primary user credentials. The schemas, grant types, JWKS handling, and token lifecycle live in OIDC and OAuth2; this section describes only the integration surface.

Console (device code)

Serial consoles have no browser. The login path is RFC 8628 device authorization:

  1. ConsoleLogin calls OAuthClient.startDeviceCode on an IdP that the manifest has configured as acceptable for console login.
  2. TerminalSession.writeLine prints the verification URL and user code; the user completes the flow on a separate device.
  3. ConsoleLogin polls pollDeviceCode at the advertised interval, honoring slow_down. Expiry is a hard fail.
  4. On granted, ConsoleLogin passes the resulting IdToken cap to SessionManager.login(method = "oidc", proof = idTokenRef).
  5. SessionManager calls OidcIdentityProvider.verifyIdToken with the client’s IdTokenPolicy, receives IdTokenClaims, derives PrincipalInfo.id = hash(iss, sub), derives authStrength from acr/amr, and mints a UserSession.
  6. The broker returns the same shell bundle as for other login methods; no OIDC-specific authority flows into the shell.

Failed verification uses the same generic failure text and bounded backoff as password login. The manifest controls which IdPs the console accepts and which subject patterns are allowed to log in; unlike password/passkey paths, OIDC login does not implicitly treat “any valid token from any configured IdP” as authority — a permitted- subject allow-list is required.

Web text shell (authorization code + PKCE)

WebShellGateway offers OIDC alongside WebAuthn. The gateway drives OAuthClient.startAuthCode, redirects the browser to the IdP, and consumes the returned code through completeAuthCode. PKCE is mandatory; state and nonce are generated from EntropySource. The gateway validates redirect URI exactly, requires TLS, and enforces IdTokenPolicy.nonceMustMatch.

Identity provider trust

CredentialStore gains IdP trust records alongside password verifiers and passkey public credentials:

IdpTrustRecord {
  recordId
  issuer                 # canonical URL
  clientRegistrations    # allowed OAuthClient records for this IdP
  jwks                   # snapshot or discovery URL + pinned TLS roots
  allowedAlgorithms
  allowedAcr / allowedAmr
  subjectAllowList       # e.g. principals matching sub/email/groups
  clockSkewSeconds
  authStrengthMap        # acr/amr -> AuthStrength (X.1254 LoA)
  createdAtMs
}

Records are public material (IdP URLs, JWKS, policy). Like passkey records, they can be bootstrapped from the manifest or cloud metadata, with a bounded RAM overlay for admin-managed records until durable storage exists. CredentialStore.verify stays a secret- preserving boundary; OIDC verification that rejects a token returns only denied with a generic failure class.

Federated principal bootstrap

For a fresh image with no local password, OIDC login can create the first UserSession when the manifest explicitly predeclares:

  • one or more trusted issuers,
  • a subject allow-list or group/claim predicate,
  • the principal identities those subjects map to.

This is the OIDC analog of the manifest-declared passkey bootstrap path: the authority comes from the manifest trust root, not from “the first caller who presents a token wins.” Without predeclared trust, OIDC login cannot be the only path to an administrative session on a fresh image — setup mode applies.

Scope of tokens

Access tokens issued alongside the ID token belong to the OAuth service. Neither the shell nor the broker ever receives raw token bytes. If the broker needs to delegate outbound authority to the session (e.g. “read from our corporate storage API”), it returns a wrapper cap holding an AccessToken cap, not a bearer string.

Refresh and session duration

SessionManager holds the RefreshToken cap associated with a federated session when the IdP issues one and the scope includes offline_access (or the IdP’s equivalent). Token refresh is a privileged operation scoped to SessionManager and audited; the shell cannot refresh its own session token. On logout or session expiry, SessionManager releases the refresh token and optionally calls the IdP’s revocation endpoint.

Required Interfaces

These are ordinary capabilities, not kernel modes.

EntropySource

Owns the only approved path for fresh auth/session secrets in the first implementation.

Responsibilities:

  • provide unpredictable bytes for password salts, session IDs, setup tokens, and later WebAuthn challenges,
  • fail closed when secure randomness is unavailable instead of returning predictable bytes,
  • keep raw entropy authority out of shells and ordinary workloads.

Only CredentialStore, SessionManager, later WebShellGateway, and a future SshGateway or narrower SSH transport-crypto service should hold it. ConsoleLogin, the shell, and spawned workloads should never mint their own session IDs, salts, setup tokens, SSH key-exchange material, or challenges.

CredentialStore

Owns credential verifier records and challenge state.

Responsibilities:

  • list whether setup is required without exposing hashes,
  • create password verifier records from setup authority,
  • verify password attempts without returning the password or verifier bytes,
  • register passkey public credentials,
  • store trusted OIDC identity-provider records (issuer, JWKS or pinned discovery URL, allowed audiences, subject allow-list, acr/amrAuthStrength mapping) so SessionManager can consume OidcIdentityProvider caps bound to deployment policy,
  • issue and consume bounded WebAuthn challenges,
  • rotate or disable credentials through an authenticated admin path.
  • load bootstrap verifier/public-credential and IdP-trust records from manifest or cloud bootstrap config and maintain a bounded RAM overlay until durable storage exists.

SessionManager

Creates UserSession metadata after successful authentication, explicit local guest policy, purpose-scoped anonymous policy, or setup policy. It should record auth method, auth strength, freshness, expiry, profile, and audit context. It should not hand out broad system caps directly. Boot-to-shell uses authenticated sessions and optional local guest sessions for ordinary interactive shells; anonymous sessions are narrower remote/programmatic contexts unless a manifest explicitly defines an anonymous demo terminal. Session IDs come from EntropySource; if fresh randomness is unavailable, authenticated login and token-bearing setup flows fail closed instead of reusing predictable IDs. The end-to-end mint/promote sequence and the account-store boundary it consumes are User Identity and Policy; the shell-side immutable-per-process invocation context that consumes the minted session lives in Shell and is proven by make run-session-context. The make run-local-users smoke covers the manifest-seeded local operator path that backs the password-login flow.

AuthorityBroker

Maps a session/profile to a narrow CapSet. Early policy can be static and manifest-backed. The important constraint is that the broker returns capabilities, not roles or strings that downstream services treat as authority.

ConsoleLogin

Consumes TerminalSession, CredentialStore, SessionManager, broker access, and a restricted shell launcher. It never receives broad boot-package or device authority unless a recovery profile explicitly grants it. It owns pre-auth password/setup entry and must not forward raw password bytes, setup tokens, or partial secret input into the shell, broker, or audit service.

On the current local-console substrate ConsoleLogin is not a separate process. Its responsibilities are folded into capos-shell, which owns the pre-auth TerminalSession, drives password/setup prompts, invokes CredentialStore/SessionManager/broker, and promotes its own session in place. The authority rules above still apply: the same process must not leak password bytes, setup tokens, or broker secrets into spawned children. A future web-shell or federated-login front-end can re-introduce a separate ConsoleLogin-shaped service that mints sessions for a distinct shell process.

WebShellGateway

Terminates the browser terminal session, handles passkey challenge/response, drives the OAuth authorization code + PKCE flow for federated login, and connects the authenticated session to the shell process. It should not own general administrative caps. It should ask the broker for the same narrow shell bundle as any other session.

SshGateway

Terminates SSH transport for CLI remote shell access, verifies host/user key protocol state, maps accepted SSH public keys to sessions, and connects the authenticated session to the shell process through an SSH-backed TerminalSession. It should not own general administrative caps, raw KeyVault administration, port-forward authority, or broad process-spawn authority. It should ask the broker for the same narrow shell bundle as any other session. The detailed transport and key-custody model is in SSH Shell Gateway. The initial schema names the supporting authority surfaces TcpListenAuthority, SshHostKey, AuthorizedKeyStore, SshTerminalFactory, and RestrictedShellLauncher; the development host-key path now exists only as an explicitly labeled non-production QEMU proof. Bounded QEMU proofs now cover configured authorized-key lookup, fixture public-key session minting, restricted shell launch, and a plain-TCP terminal-host handoff, while real SSH signing, encrypted transport, packet/channel handling, and the final OpenSSH harness remain later gates.

OAuthClient and OidcIdentityProvider

Supplied by the OAuth service (OIDC and OAuth2). ConsoleLogin holds an OAuthClient cap configured for device-code grants against the manifest-declared IdPs, and an OidcIdentityProvider cap for ID-token verification. WebShellGateway holds analogous caps configured for authorization code + PKCE. Neither service retains access tokens in long-lived session state — refresh tokens live inside SessionManager, bound to the UserSession lifecycle.

AuditLog

Records setup entry, credential creation, failed attempts, successful session creation, broker decisions, shell launch, credential disablement, and logout. Audit entries must not include passwords, password hashes, passkey private material, bearer tokens, complete environment dumps, or full terminal lines. Correlate auth/session events with opaque record IDs and policy/result codes, not with secret-bearing payloads.

First Security Substrate

Before local setup/login code lands, the first implementation should fix these rules:

  • Entropy source: CredentialStore and SessionManager receive an EntropySource cap. Password salts, session IDs, setup tokens, and later passkey challenges come only from it. If secure randomness is unavailable, credential creation, authenticated session creation, setup-token issuance, and passkey enrollment fail closed. The only remaining boot path is an explicit manifest-gated guest or development profile.
  • Credential backing: CredentialStore is initialized from manifest or cloud-bootstrap verifier/public-credential records plus a bounded RAM overlay for setup-created credentials and disable/rotate state. Until a real storage service exists, any setup-created credential and any disable/rotate action recorded only in that overlay is volatile and both the console UX and audit records must say so. The manifest may carry verifier or public-credential material, not raw passwords or reusable setup tokens.
  • Bounded setup-token/challenge state: CredentialStore owns one bounded table for setup tokens and later WebAuthn challenges. Each record is bound to a purpose, principal/profile, opaque record ID, secret bytes, created/expiry times, and consumed bit. The first redemption attempt consumes the record whether the attempt succeeds or fails, so replay always fails closed and retry requires a newly minted token or challenge. Records are scrubbed on consume or expiry.
  • Auth failure policy: CredentialStore.verify returns only success, denied, or unavailable. ConsoleLogin prints generic failure text and enforces bounded backoff without revealing whether a principal exists, which field mismatched, or whether a verifier came from bootstrap config or the RAM overlay. Permanent lockout is out of scope for the first milestone; bounded delay plus audit is required.
  • Audit and redaction: AuditLog records structured auth/session events with result codes, profile, auth method, reason classes, and opaque credential/token record IDs. Principal/session IDs appear only after successful authentication or when referring to an already minted session; a failed pre-auth attempt logs only a terminal-local event ID plus generic failure class. It must never log raw passwords, verifier bytes, salts, setup-token/challenge secrets, passkey private material, or full terminal lines. When setup creates a volatile credential or RAM-only disable state, the audit event records volatile = true rather than any secret-bearing payload.

Prerequisites

Boot-to-shell should not be selected before these pieces are credible:

  • Default boot uses init-owned manifest execution; the kernel starts only init with fixed bootstrap authority.
  • init can start long-lived services and not just short smoke binaries.
  • ProcessSpawner can launch the shell and login services with exact grants.
  • A TerminalSession path exists. Current Console stays output-oriented; login and shell work should use bounded line input with per-call echo mode and structured cancellation instead of raw console reads.
  • The native text shell exists as a capos-rt binary with caps, inspect, call, spawn, wait, release, and basic error display.
  • EntropySource exists for salts, session IDs, setup tokens, and later WebAuthn challenges, and auth/setup flows fail closed if it is unavailable.
  • There is at least bootstrap verifier/public-credential backing plus a bounded RAM overlay. Durable credential storage can come later, but the first implementation must be honest about whether created credentials survive reboot.
  • Minimal SessionManager, AuthorityBroker, and AuditLog services exist.
  • A restricted launcher or broker wrapper prevents the shell from receiving broad init authority.
  • Web text shell requires networking, HTTP/WebSocket or equivalent, TLS/origin handling, and WebAuthn verification. It can lag local console boot-to-shell. TLS configuration, server certificates, ACME issuance, OCSP stapling, and CT policy are defined in Certificates and TLS; WebAuthn attestation certificate verification uses the CertVerifier from that proposal against a FIDO MDS trust store.
  • Federated OIDC login requires outbound TLS to the IdP discovery and JWKS endpoints, an OAuth client service, and manifest-declared IdP trust records. It depends on networking and the interfaces in OIDC and OAuth2. Device code can land with the local console path once networking exists; authorization code + PKCE lands with the web text shell.

Completed Local Milestone Definition

The local-console boot-to-shell milestone completed when:

  • make run-shell or the default boot path reaches a text login/setup prompt. The focused proofs are make run-terminal for the bounded line-discipline TerminalSession surface, make run-credential for the password-verifier store, make run-login for the password-login path, make run-login-setup for the first-boot setup path, make run-local-users for the manifest-seeded local-operator path, and make run-shell for the anonymous-only path.
  • With a configured password verifier, the console refuses the shell on a bad password and launches it on the correct password.
  • With no console password verifier, the console enters setup mode and requires creating a credential or selecting an explicitly configured local guest or development policy before launching a normal shell.
  • If secure randomness is unavailable, setup and authenticated login fail closed; only explicitly enabled guest or development profiles may continue.
  • Guest console sessions, when enabled, are created through SessionManager.guest() and receive only terminal/tmp/restricted-launcher style caps with no administrative approval path by default.
  • Anonymous sessions are not used as the missing-password console fallback and are not accepted as proof that the ordinary boot-to-shell milestone works.
  • The shell starts with a broker-issued CapSet and can prove at least one typed capability call plus one exact-grant child spawn through a granted launcher or other explicitly scoped spawn authority.
  • ConsoleLogin drops its TerminalSession once the shell starts, and a shell-spawned child without an explicit terminal grant cannot use the terminal.
  • Audit output records setup/auth/session/broker/shell-launch events without leaking secrets.
  • Web text shell, passkey-only enrollment, and remote setup remain later work in this proposal family after the local console path exists.
  • Graphical shell work is not part of the acceptance criteria.

Implementation Plan

  1. Text console substrate. TerminalSession is the first interactive console boundary. Keep Console output-only; terminal services own bounded line buffers, per-call echo mode, and cancellation behavior.

  2. Native shell binary. The shell proposal’s minimal REPL over capos-rt lists CapSet entries, inspect metadata, call granted capabilities including TerminalSession, use a granted restricted launcher or other scoped spawn authority for exact-grant child launch, wait, release, and print typed errors. The ordinary shell profile must not depend on BootPackage or broad ProcessSpawner authority.

  3. Credential store prototype. Manifest/cloud-bootstrap-backed verifier and public-credential records, a bounded RAM overlay for setup-created credentials, EntropySource integration for salts/session IDs/tokens, and Argon2id verification anchor the local path. Host-generated verifier inputs are bootstrap configuration, not acceptance evidence for future credential work.

  4. Console setup/login. The configured-password path and no-password setup path are implemented. Setup creates verifier state through CredentialStore, not ad hoc shell process config. The local password path now prompts for username> before hidden password>, routes SessionManager.login through an account/principal selector plus proof/source metadata, verifies only selected accounts that own console-password, and migrates the existing seeded console password to an explicit default operator account without creating username-enumeration terminal differences. Durable account-local verifier records remain future storage-backed work.

  5. Minimal session and broker. UserSession metadata and the policy broker return a narrow shell bundle. Anonymous bundles stay separate from ordinary shell login, and QEMU proofs show the shell cannot obtain broad boot authority by default.

  6. Audit and failure policy. Generic auth failure handling, bounded attempt backoff, hidden password entry, and redacted audit records are part of the completed local path. Future passkey/setup-token challenge state must preserve the same no-secret logging rule.

  7. Web text shell gateway. After networking and a terminal transport exist, add WebAuthn registration and authentication for the browser-hosted terminal. Support passkey-only enrollment through local setup or explicit bootstrap authority.

  8. Federated OIDC login. Add OAuthClient/OidcIdentityProvider integration to ConsoleLogin (device code) and WebShellGateway (auth code + PKCE). Extend CredentialStore with IdP trust records. Map acr/amr claims to AuthStrength. Require a manifest-declared subject allow-list for administrative sessions.

  9. Durability and recovery. Move credential and IdP-trust records from boot config or RAM into a storage-backed service once storage exists. Define recovery as a credential-admin operation, not an implicit bypass.

Security Notes

  • Password hashing belongs in userspace auth services, not the kernel fast path.
  • WebAuthn challenge state must be single-use and bounded by expiry.
  • The web gateway must validate origin and relying-party ID; otherwise passkey authentication is meaningless.
  • Setup tokens are credentials. They must be short-lived, single-use, audited, and hidden from ordinary process output.
  • Credential records are sensitive even though they are not raw secrets; avoid printing them in debug logs.
  • The shell and any agent running inside it must treat logs, terminal input, files, web pages, and service output as untrusted data.

Non-Goals

  • No graphical shell in this milestone.
  • No passwordless remote first-use takeover.
  • No kernel uid, gid, root, or login mode.
  • No default shell access to broad BootPackage, raw ProcessSpawner, DeviceManager, raw storage, or global supervisor caps.
  • No authentication proof passed through command-line arguments, environment variables, shell variables, audit records, or agent prompts.

Open Questions

  • Which Argon2id parameters fit the early userspace memory budget while still resisting offline guessing?
  • How should durable storage merge bootstrap verifier records with the first RAM overlay once a storage-backed credential service exists?
  • How should local console setup prove physical presence on cloud VMs where serial console access may itself be remote?
  • What is the first acceptable TLS/origin story for QEMU and local development WebAuthn testing?
  • Should passkey-only machines keep a disabled console password slot for later recovery, or should recovery be entirely credential-admin/passkey based?

Proposal: SystemInfo Capability

System-wide informational data (banner/MOTD today, hostname, help topics, and on-ISO documentation later) exposed as a single typed capability instead of ad-hoc per-feature kernel parameters.

Status: Phase 1 + Phase 2 implemented. Phase 1 introduced the SystemInfo capability (renamed from ShellConfig, schema field motd) and unified the print site so console and Telnet shells both call SystemInfo.motd() themselves. Phase 2 then moved post-login authority into AuthorityBroker.shellBundle: the broker mints SystemInfo plus a profile-scoped serviceEndpoints list (adventure + chat for operator shells, empty for guest and anonymous shells), so Telnet/SSH-launched operator shells can run chat-client/adventure-client without per-transport manifest forwarding. chat is the kernel-singleton chat endpoint (KernelCapSource::ChatEndpoint) so all operator shells share one chat-server queue; adventure is a fresh per-session endpoint. Last reviewed: 2026-04-29 05:59 UTC.

Problem

The pre-existing ShellConfig capability had a single method (motd) and was distributed via manifest cap grants. That was already a capability shape, but two things made it brittle:

  • The name claimed too little. “Shell config” suggests configuration of the shell binary, but the data is system-wide and transport-agnostic (banner text doesn’t belong to any one shell). Anything similar we wanted to expose later — hostname, help topics, manpages — would either squat on ShellConfig (wrong scope) or get its own one-method cap (proliferation).
  • The print site was asymmetric. init printed the banner over COM1 before launching the console foreground shell; the Telnet-spawned shell printed it itself after the gateway forwarded shell_config as a manifest grant. Two code paths, two places to keep consistent. The SSH Shell Gateway successor, and any future transport, would add a third.

The capability model already supports a clean fix: one cap, one print site, room to grow.

Design

Interface

interface SystemInfo {
    motd @0 () -> (text :Text);
    hostname @1 () -> (name :Text);
    # Future:
    # helpTopics @2 () -> (topics :List(HelpTopic));
    # manPage @3 (name :Text) -> (page :ManPage);
}

Adding methods later is a Cap’n Proto-compatible change. Each future addition gets its own kernel data source (or a userspace SystemInfo service backed by storage, when persistence exists). Callers that only need MOTD do not pay for the others.

Data Source

SystemInfo is currently kernel-backed and reads from manifest kernelParams.motd (renamed from shellMotd) and kernelParams.hostname (landed; defaults to capos). A CloudMetadata-derived or storage-backed mutable hostname remains future work; help topics and manpages will eventually be served by a userspace documentation service that holds a SystemInfo cap as one of its exports. The kernel implementation is intentionally minimal — it owns text the boot manifest already provided, and nothing else.

Distribution

A process gains SystemInfo by listing it as a manifest cap source:

caps: [{
    name: "system_info"
    source: {kernel: "system_info"}
}]

Phase 1 granted the cap to:

  • init — kept. init no longer reads SystemInfo itself, but the manifest spawn loop forwards init-held kernel-source caps to each service. The console foreground shell and any gateway service that receives system_info is reached through this forwarding, so init must hold the cap.
  • The default console foreground shell (new — needed so the console shell can print MOTD itself).
  • telnet-gateway, restricted-shell-launcher, and the SSH gateway terminal-host (each forwarded system_info to the child shell via RestrictedShellLauncher, the same mechanism that forwards creds/sessions/audit/broker; the telnet-gateway and SSH terminal-host demos are since removed with the kernel socket owner).

Phase 2 moved normal shell distribution into AuthorityBroker.shellBundle: the broker mints a fresh SystemInfo cap per session and returns it alongside the launcher, copied session, and any profile-scoped service endpoint caps allowed for that profile. RestrictedShellLauncher no longer requires a system_info pass-through grant.

The banner is printed by the shell on startup after it obtains its initial shell bundle, across all transports:

#![allow(unused)]
fn main() {
// shell/src/main.rs
fn write_motd_from_bundle(...) -> Result<(), i64> {
    let mut system_info = SystemInfoClient::new(bundle.system_info.capability());
    let motd = system_info.motd_wait(ring, WAIT_FOREVER)...;
    for line in motd.lines() {
        terminal.write_line_wait(ring, line, ...)?;
    }
    Ok(())
}
}

init is no longer responsible for printing MOTD — its write_motd_to_terminal helper is removed.

Why Phase 1 Stayed Manifest-Driven

Moving SystemInfo distribution into AuthorityBroker.shellBundle made architectural sense, but it required the broker to hold or be able to mint informational caps and changed the shell bundle shape. Phase 1 therefore isolated the rename, the unification of the print site, and the schema interface as separately reviewable prerequisites.

Phase 2: Broker-Minted SystemInfo and Service Endpoints

AuthorityBroker.shellBundle returns a RestrictedLauncher, a copied UserSession, SystemInfo, and any allowed profile-scoped service endpoint caps per call:

interface AuthorityBroker {
    shellBundle @0 (sessionCapId :UInt32, profile :Text)
        -> (launcherIndex      :UInt16,
            sessionIndex       :UInt16,
            systemInfoIndex    :UInt16,
            serviceEndpoints   :List(BundleEndpoint));
}

struct BundleEndpoint {
    name        @0 :Text;        # e.g. "chat", "adventure"
    capIndex    @1 :UInt16;
}

The broker mints:

  1. SystemInfo (always) — replaces the manifest grant.
  2. Service-endpoint caps the requested profile is allowed to reach (chat and adventure for operator profiles, none for guest or anonymous).

RestrictedShellLauncher’s required shell grants collapsed to creds, sessions, audit, and broker; system_info and service endpoint authority now arrive through the broker bundle, keeping the kernel launcher minimal.

Phase 2 implementation notes

  • Phase 2 landed in three sub-tiers (A: SystemInfo; B: adventure; C: chat). The broker holds a kernel-side Arc<Endpoint> for chat — the KernelCapSource::ChatEndpoint lazy singleton constructed by BootCapFactory — and Arc::clones it into every operator bundle. adventure is fresh per operator bundle.
  • The shell prefers manifest-granted (CapSet) caps over bundle service endpoints when both have the same name. The focused chat manifest now gives init the kernel singleton chat_endpoint to forward to chat-server and relies on the broker-issued chat endpoint for the normal shell path instead of a shell-local chat-server export, matching the Telnet and default shell bundle model. Normal shell @chat badge 200 syntax is now rejected by the parser before it can reach the delegated-client relabel check; lower-level smoke paths retain relabel fixtures for kernel/process-spawn enforcement.
  • RestrictedShellLauncher::REQUIRED_SHELL_GRANTS no longer requires system_info; the broker is now the single source for that cap.

Cross-References

  • Shell — banner ownership and help-topic discovery were implicit open questions; this proposal resolves “where does the banner live” (Phase 1) and “where does the post-login authority live” (Phase 2).
  • Networking — Telnet gateway/shell interaction; SystemInfo is now part of the broker bundle consumed by the shell after launch.
  • Boot to Shell — login flow runs after the shell has acquired its initial anonymous bundle and printed MOTD from broker-minted SystemInfo.
  • Userspace Authority Broker — Phase 2 makes the broker the single source of post-login authority, including informational caps.

Non-Goals

  • This proposal does not introduce persistent storage for system information. MOTD comes from the boot manifest; future fields will come from manifest, CloudMetadata, or a userspace documentation service when those exist.
  • This proposal does not add a separate pre-authentication issue/banner channel. MOTD is printed after initial shell-bundle acquisition; a true pre-auth warning banner would need its own reviewed distribution path.
  • Hostname is now served by SystemInfo.hostname @1, sourced from kernelParams.hostname (default capos) and printed by the shell hostname command. A mutable, CloudMetadata-derived, or storage-backed hostname is still out of scope until a consumer needs it.

Open Questions

  • Pre-authentication warning banner: MOTD now comes from the initial broker-issued shell bundle. If capOS later needs a banner before SessionManager/AuthorityBroker interaction, it should be a distinct issue-style surface rather than a regression to ad-hoc manifest grants.
  • Hostname source: the manifest-field path landed (kernelParams.hostname, read-only). A CloudMetadata-derived or storage-backed mutable hostname remains parked until a consumer needs to change it at runtime.
  • Help-topic discovery: tied to schema reflection and the SchemaRegistry open question in shell-proposal.md. Likely lives in a userspace documentation service, not in the kernel cap.

Proposal: System Manual Capability

A built-in, system-served reference manual: capOS should be able to explain itself from inside itself. The Manual capability serves Unix-style man pages, schema-derived interface manuals, and a man-shaped reference corpus through three surfaces – the shell, the self-served web UI, and a typed capnp API – without any ambient file access.

Status: Phases 1-4 settled. Phase 1 landed the Manual capnp interface, the boot-packaged ManualCorpus blob compiled by tools/manualc, the kernel Manual cap (kernel/src/cap/manual.rs), and the shell man/apropos builtins, proven by make run-system-manual-smoke. Phase 2 landed the self-served web-UI doc viewer. Phase 3 (schema-derived section-2 DESCRIPTION) was satisfied at Phase 1 and its already-landed contract is now locked by proofs (see Phasing). Phase 4 (programmatic API + agent export) is settled with deviation: the describe @4/buildInfo @5/topics @2 runtime support already shipped, so Phase 4 reduces to documenting that contract as stable and locking the consistency invariants that genuinely hold – byte identity between the in-system manual navigation and the published-site llms.txt is infeasible and undesirable (see Phasing). This proposal promotes the documentation surface that the SystemInfo proposal sketched as the # Future: helpTopics @2 / manPage @3 stubs into a dedicated capability, and gives the self-served web UI an in-system documentation source instead of relying on the externally hosted mdBook site. Last reviewed: 2026-05-26 23:17 UTC.

Phase 1 as-built notes

Two Phase-1 choices refine the original plan and are recorded here so the design matches the code:

  • Section-2 DESCRIPTION is sourced from .capnp doc comments at build time. The schema already carries per-interface doc comments, and tools/manualc parses the .capnp text directly (not the generated bindings), so section-2 pages take their NAME-line title and DESCRIPTION from the interface doc comment and their SYNOPSIS from the parsed methods. This is a pragmatic improvement over the originally-planned curated-prose-keyed-by-id: it cannot drift from the schema and needs no per-interface prose file. The build check therefore requires every interface to carry a non-empty doc comment. Phase 3 still adds doc-comment preservation in the generated bindings for live reflection (describe), which is a distinct mechanism from build-time text parsing.
  • topics @2 returns a section-based index. Pages are indexed by manual section (Commands, Capabilities, …). The Phase 2 web-UI viewer renders this section-based index in its topic sidebar; replacing it with the curated front-matter topics taxonomy remains future corpus/build work.
  • describe @4 is backed by a build-time interfaceId index. manualc computes each interface’s capnp type id (validated against the generated *_INTERFACE_ID constants) and emits an id->page-name index in the blob, so describe resolves an in-tree interface id to its section-2 page today.

Problem

Today capOS has rich documentation, but none of it is reachable from a running capOS instance. The corpus lives in docs/ and is rendered by the host-side mdBook pipeline (tools/mdbook-doc-metadata/); a booted system, a shell user, or the in-guest web UI cannot answer “what does this capability do?” or “how do I use this command?” without leaving the system. For a research OS whose whole thesis is that the typed interface is the contract, the inability to read those contracts in-system is a conspicuous gap.

Three concrete pressures motivate a dedicated capability:

  • A public explorer demo needs in-system docs. The cloud-deployment and self-served web-UI work point toward a publicly explorable instance. An explorer who has never seen capOS needs to discover capabilities, commands, and concepts from inside the UI – not by alt-tabbing to an external site that may drift from the running build.
  • SystemInfo is the wrong home. The SystemInfo proposal already foresaw this and stubbed helpTopics/manPage methods, while noting the tension: documentation would “either squat on ShellConfig (wrong scope) or get its own one-method cap (proliferation).” SystemInfo is for small system-wide scalars (MOTD, hostname). Documentation is a queryable corpus with search, sectioning, and cross-links. Bolting a content/query service onto a scalar-info cap is the wrong shape.
  • Schema-as-manual is a capability-native idea worth capturing. Because every capability is a typed Cap’n Proto interface, a capability’s reference page can be generated from its own schema. “The interface IS the permission” extends naturally to “the interface IS its own reference page.” No other OS doc system gets this for free; capOS should not throw it away.

Design Principle: Ground On man, Modernize Navigation

The design is deliberately conservative at its core and ambitious at its edges.

  • The core is man. The proven Unix model – ordered sections (NAME, SYNOPSIS, DESCRIPTION, OPTIONS, ERRORS, SEE ALSO, …), numbered manual sections by kind, and apropos/man -k keyword search – is the contract every page honors. The mechanics (man <name>, man <section> <name>, man -k) are immediately familiar to anyone who knows man.
  • The navigation is modern. On top of the man core we layer the discovery affordances that make documentation pleasant: a topic index (reusing the existing front-matter topics taxonomy), tldr-style example-first quick views, hyperlinked SEE ALSO cross-references, and an agent-readable export. Plan 9’s “follow the documentation pointers on demand” philosophy – navigate by need, not by linear reading – is the model for the cross-link graph.

The two are layered, not in tension: the modern surface is a renderer and index over man-shaped content, never a replacement for it.

Manual Sections (the capOS analog of man 1-8)

Classic man numbers sections by kind of thing documented. Plan 9 keeps the same idea but splits “devices / file servers / protocol / formats” – a split that maps cleanly onto a capability OS, where devices and services are capabilities. The proposed capOS sections:

SectionNameContents
1CommandsShell builtins and userspace command binaries (spawn, caps, login, …).
2CapabilitiesOne page per typed capability interface (Console, Timer, ProcessSpawner, DMAPool, …); SYNOPSIS schema-generated.
3Runtime & SDKcapos-rt / libcapos / capos-service APIs available to userspace programs.
5Manifests & SchemasBoot-manifest fields, CUE config, schema/capos.capnp structures and their wire contracts.
7ConceptsProse: the capability model, the ring protocol, threading contract, session-bound invocation context.
8OperationsOperator/admin surfaces: boot, run targets, remote-session gateway, cloud deployment.

The section numbers diverge from Unix deliberately: Unix 2 is syscalls, but capOS’s whole point is that it has essentially no syscall surface, so 2 is Capabilities instead. The numbering is capOS-specific and documented in intro(7); the mechanics are unchanged, so the muscle memory transfers. Section 2 is the capability-native centerpiece – the part no conventional OS can auto-generate, because conventional OSes have no machine-readable interface contract for every resource.

Interface

struct ManPage {
    name       @0 :Text;            # "console", "spawn", "capability-model"
    section    @1 :UInt8;           # 1,2,3,5,7,8 (see section table)
    title      @2 :Text;            # short NAME-line abstract
    body       @3 :Text;            # rendered page text (man-shaped sections)
    seeAlso    @4 :List(PageRef);   # cross-links -> SEE ALSO graph
    examples   @5 :List(Text);      # tldr-style example-first snippets
    source     @6 :Source;          # schemaReflection | prose | runtime
    lastReviewed @7 :Text;          # provenance from doc front matter
    buildId    @8 :Text;            # build/commit id this page was rendered from
}

struct PageRef   { name @0 :Text; section @1 :UInt8; siteOnly @2 :Bool; }
struct Topic     { key @0 :Text; title @1 :Text; pages @2 :List(PageRef); }
struct Apropos   { query @0 :Text; matches @1 :List(PageRef); }

enum Source { schemaReflection @0; prose @1; runtime @2; }

interface Manual {
    # man <name> [section]: fetch a single page.
    page      @0 (name :Text, section :UInt8) -> (page :ManPage);
    # man -k / apropos: keyword search over the prebuilt index.
    apropos   @1 (query :Text) -> (result :Apropos);
    # the modern topic index, reusing the docs front-matter taxonomy.
    topics    @2 () -> (topics :List(Topic));
    # enumerate a section (man -s 2: list all capabilities).
    section   @3 (section :UInt8) -> (pages :List(PageRef));
    # interfaceId -> section-2 page lookup (see note below).
    describe  @4 (interfaceId :UInt64) -> (page :ManPage);
    # the build/commit this manual blob was produced from.
    buildInfo @5 () -> (commit :Text, builtAt :Text);
}

Manual is read-only and holds no authority beyond serving text. Adding methods later is a Cap’n Proto-compatible change, matching the additive discipline the SystemInfo and DDF work already follow.

describe @4 is an interfaceId -> section-2 page lookup. It does not take or verify a live capability: the caller passes an interface id it already knows (capabilities expose interface_id() today, see capos-lib/src/cap_table.rs), and the Manual returns that interface’s manual page, or not-found for an id it does not document (it covers only in-tree interfaces). It is the programmatic complement to section-2 browsing – convenient for an SDK or agent that holds a cap and wants its reference page – not reflection on the live object.

Content Model: Man-Shaped Pages, Not Raw Markdown

A ManPage is structured text with the conventional ordered sections (NAME/SYNOPSIS/DESCRIPTION/SEE ALSO/…), not free-form Markdown. This is a deliberate, load-bearing choice. The published docs/ tree is long-form Markdown – proposals, architecture prose, mermaid diagrams, mdBook preprocessor directives – and is not a man-shaped corpus. The Manual therefore serves a purpose-authored, man-shaped corpus built at make time; it is a distinct artifact, not a verbatim mirror of docs/*.md. The guest renders the fixed section set (terminal pager / web pane) and needs no full Markdown engine.

What this corpus does and does not share with the published mdBook site:

  • Shared: the taxonomy provenance and the schema-derived section-2 interface membership. The man corpus and the published site are tagged from the same front-matter/topics vocabulary, so neither invents a category the other does not recognize. They do not share top-level navigation keys: topics @2 navigates by manual section (Commands, Capabilities, …), while the site’s llms.txt navigates the full docs/ tree by docs/SUMMARY.md section. The in-system index is a curated subset and deliberately diverges from the site’s navigation (see the Phase 4 settlement under Phasing).
  • Not shared: the long-form prose body. The manual is concise reference; the site is the depth. Concept pages (section 7) are short man-shaped summaries whose SEE ALSO points at the fuller site/docs/ page. The manual does not claim to reproduce that long-form content.

This removes the conflation between “reuse the corpus” and “serve man-shaped sections”: the taxonomy is reused; the man pages are curated.

Page provenance (the three sources)

ManPage.source records how each page’s body was produced:

  1. schemaReflection – interface manuals (section 2/5). The page structure is derived from the schema at build time: capnp reflection makes method/field names and signatures recoverable, so the SYNOPSIS (method list with ordinals and parameter/return shapes) is generated directly from schema/capos.capnp from day one and cannot drift from the live interface. The DESCRIPTION prose comes from .capnp doc comments. Two prerequisites are not done today and are tracked as their own tasks (see Sequencing): (a) the schema carries almost no doc comments, and (b) the no_std generated bindings under tools/generated/ do not preserve schema doc text. Until (a)/(b) land, the generated SYNOPSIS is real but the DESCRIPTION falls back to curated prose keyed by interface id. So the Phase-1 drift surface is the prose body only – not the method list – and a build check (below) forbids a missing page.
  2. prose – authored reference (section 1/3/7/8). Man pages authored in the manual corpus for commands, the runtime/SDK, concepts, and operations. Curated, man-shaped, taxonomy-tagged.
  3. runtime – live facts (section 8). A small set of pages interpolate live state (current run target, the caller’s granted capabilities for caps/inspect), sourced from existing caps such as SystemInfo, and are marked Source.runtime so they are never cached as static.

Subset rule (what is and isn’t in the manual)

The in-system manual ships sections 1/2/3/5 in full plus curated section-7/8 summaries. It deliberately does not carry the long-form research, proposal, and design corpus – that stays on the published site. To keep the boundary legible to an explorer rather than surprising:

  • PageRef.siteOnly marks topics that exist on the published site but have no in-manual page, so apropos results distinguish them visibly.
  • A concept page whose depth lives on the site ends with an explicit SEE ALSO link to the fuller site page.
  • The subset rule is itself documented in intro(7), and a build check classifies every docs/ page as in-manual, summarized, or site-only so the boundary stays explicit as the docs grow.

Delivery and freshness

The man corpus and generated interface manuals are compiled to a compact, read-only blob and delivered like the boot manifest – a boot-packaged payload read through an offset/length method, mirroring BootPackageCap (kernel/src/cap/boot_package.rs). The blob is built at make time, so a given ISO’s manuals match that build. Every page carries the build/commit id that produced it (ManPage.buildId plus Manual.buildInfo @5), so an explorer of a possibly-drifting public build can always tell which capOS version they are reading about. A build check fails the build if any in-tree capability interface lacks a section-2 page. A future storage-backed Manual service (once persistence exists) can serve a mutable corpus without changing the capability shape.

Three Surfaces

All three surfaces are clients of the same Manual capability. None of them re-implement documentation; they render ManPage values.

1. Shell man / apropos

A man builtin (and apropos alias for man -k) joins the shell’s existing command dispatch (shell/src/main.rs, the match command { ... } block that already handles help, caps, inspect, motd). It calls Manual.page / Manual.apropos on a Manual cap held in shell state and paginates the body to the terminal.

help, man, and inspect stay distinct and complementary:

  • help remains the terse, cap-free built-in command list for first orientation; help <command> becomes a shortcut to man 1 <command>.
  • man <name> is the full reference served by the Manual cap.
  • man <capability> complements the existing inspect <cap> command: inspect shows this instance, man shows the interface.

2. Self-served web-UI doc viewer (Implemented)

The self-served web UI service (demos/remote-session-web-ui/) already holds capabilities in session state and exposes them over HTTP routes that return JSON view-models – raw caps never reach the browser. A Manual cap added to that service backs the routes /api/man?name=&section=, /api/apropos?q=, and /api/topics, and a viewer page renders pages with a topic sidebar, a search box (apropos), and clickable SEE ALSO links. This is the surface that makes a public explorer demo self-explanatory: a viewer can browse the capability catalog and concept pages with no shell and no external site.

As-built notes:

  • The Manual cap is granted to the web-UI service via the manifest ({kernel: "manual"}), looked up from the service CapSet alongside console/sessions/broker, and never crosses to the browser; only the rendered JSON view-models do. The doc routes are read-only and require no login, since Manual confers no authority beyond documentation access.
  • The viewer is served as a separate /manual page, distinct from the login-proof page on /, so the session-proof leak assertions on / stay whole-body while the manual page may legitimately display capability interface names as documentation text.
  • SEE ALSO and topic refs are rendered from the structured seeAlso/pages PageRef lists (not by re-parsing body text). siteOnly refs link out to the published site in a new tab; in-manual refs navigate in-system. The Phase 1 corpus builder currently emits only in-manual refs (siteOnly is always false); the viewer’s siteOnly path is proven against the shipped ref-classifier, and wiring real siteOnly refs is follow-on corpus authoring.
  • The topic sidebar lists the section-based topics @2 index (Commands, Capabilities, …); the front-matter taxonomy-backed index remains future work.
  • Proof: make run-remote-session-self-served-web-ui drives a headless browser that logs in, renders console(2) with its source/buildId provenance, searches apropos timer, and asserts the siteOnly-vs-in-manual link classifier – without leaking any raw cap or session internals to the browser.

3. Programmatic capnp API

Because Manual is an ordinary capability, any process or agent granted it can fetch documentation programmatically – an SDK can surface inline help, and an in-system agent can ground itself on the real interface contracts. describe @4 resolves a held interface id to its section-2 page, topics @2 returns the navigation taxonomy, and buildInfo @5 (plus each page’s buildId) tells an agent exactly which capOS build it is reading about. Agents get one machine-readable index of everything the running system documents: the man-shaped subset, navigated by manual section. That subset is a curated projection of the published site rather than a byte-identical copy of its llms.txt navigation – the in-system manual and the published full-tree index deliberately differ in scope (see the Phase 4 settlement under Phasing), and the invariant that holds is shared taxonomy provenance, not identical keys.

The features that lift this above a flat page dump, all riding on the man core:

  • Topic index over apropos. topics @2 exposes the curated front-matter taxonomy; apropos @1 does free-text keyword fallback against a keyword index built into the blob at make time (NAME lines + tagged keywords), not a linear scan of page bodies. A reader navigates by topic when they know the area and by keyword when they do not.
  • SEE ALSO as a real graph. ManPage.seeAlso is structured PageRefs, not prose, so every surface renders them as links and an explorer can walk the concept graph – the Plan 9 “pointers on demand” model. siteOnly refs link out to the published site.
  • Example-first quick views. ManPage.examples carries tldr-style snippets so the common case (“how do I actually use spawn?”) is answered in five lines before the full DESCRIPTION.
  • Provenance is visible. source, lastReviewed, and buildId travel with every page, so a reader can tell an auto-generated interface manual from curated prose and see which build it describes.

Authority and Security Model

  • Read-only, no ambient authority. Manual only returns text. Holding it grants nothing but the ability to read documentation; it cannot mutate state or widen any other authority. Documentation access is itself a capability, consistent with Principle 1 (no ambient authority).
  • Scoped distribution. The web-UI service and operator shells receive Manual via manifest cap grants or the AuthorityBroker bundle, exactly as SystemInfo is distributed today. A public/anonymous web session can be granted a Manual with no risk, because it confers no authority.
  • Browser boundary unchanged. As with all web-UI routes, the browser receives only rendered ManPage JSON view-models; the Manual cap never crosses into browser JavaScript.
  • No code execution. Pages are structured text rendered by the viewer. The Manual never serves executable content, so an explorable public instance does not gain a new code path from documentation serving.

Phasing

  1. Phase 1 – capability + shell man. (Implemented.) Defined the Manual interface, authored the man-shaped corpus, delivered the boot-packaged blob, implemented page/apropos/topics/section/describe/buildInfo, stamped every page with the build/commit id, added the build check that forbids a missing section-2 page, and added the shell man/apropos builtins. Section-2 SYNOPSIS is schema-generated; section-2 DESCRIPTION is sourced from .capnp doc comments at build time (see Phase 1 as-built notes); sections 1/7 ship authored prose. Proof: make run-system-manual-smoke.

  2. Phase 2 – web-UI viewer. (Implemented.) Added Manual to the self-served web-UI service and shipped the viewer page (topic sidebar, apropos search, page provenance, and clickable SEE ALSO links with siteOnly external linking). This is the public-explorer-facing milestone. Proof: make run-remote-session-self-served-web-ui.

  3. Phase 3 – schema-derived section-2 DESCRIPTION. (Satisfied at Phase 1; proofs hardened.) The Phase-1 corpus builder already auto-generates the section-2 DESCRIPTION (and the per-method SYNOPSIS docs) from .capnp doc comments and backs describe @4 with a build-time interfaceId->page index, so the prose-drift window the original Phase-3 plan targeted is already closed. Phase 3 therefore reduces to locking and proving that already-landed contract, with two deliberate deviations from the original plan:

    • Fail-closed over warn-and-fallback. The build check fails when an interface lacks a doc comment (manualc enforce_coverage), rather than falling back to curated prose with a warning. Fail-closed keeps the served pages provably schema-sourced and is the reviewed Phase-1 choice; it is kept, not regressed.
    • No live runtime reflection. The prerequisite binding-preservation task emits doc text as Rust /// attributes, which are not runtime-accessible data, and the kernel cannot introspect interface docs at call time. Live reflection in describe is neither feasible at the current capnp version nor necessary: manualc reparses the live schema each build and make generated-code-check enforces schema<->binding parity, so the served signatures and DESCRIPTION cannot drift from schema/capos.capnp.

    Proofs: a manualc host test (describe_index_resolves_to_schema_derived_section2_pages) asserts that for every shipped interface the describe @4 descriptor resolves to the same schema-derived section-2 page page @0 serves, with the doc comment in the DESCRIPTION body; make run-system-manual-smoke asserts the served Console DESCRIPTION body and a method-doc line both originate from doc-comment text. Open follow-up: describe @4 has no userspace client today (the shell man builtin and ManualClient use page @0), so a runtime describe exerciser would need a ManualClient::describe_wait plus a caller (out of this slice’s tools/ + kernel/src/cap/manual.rs scope).

  4. Phase 4 – programmatic API + agent export. (Settled with deviation.) The programmatic surface this phase set out to stabilize – describe @4 (interfaceId -> section-2 page), buildInfo @5 (build/commit provenance), and topics @2 (navigation taxonomy) – already shipped at Phase 1 with kernel dispatch (kernel/src/cap/manual.rs) and ManualClient methods (capos-rt/src/client.rs). Phase 4 therefore reduces to declaring that contract stable for SDK/agent consumers and locking the consistency invariants that genuinely hold, with one deliberate deviation from the original “unify … so they share one source” framing:

    • describe @4/buildInfo @5 are stable programmatic API. describe @4 is the id-keyed complement to section-2 browsing (it takes an interface id the caller already holds; it does not reflect on a live object) and resolves through the build-time descriptor index manualc emits. buildInfo @5 returns the corpus commit, and every page carries that same commit as its buildId, so an agent grounding itself on a possibly-drifting public build can always resolve which capOS build any page describes. These are additive, read-only methods; widening them later stays Cap’n Proto-compatible.
    • Deviation: byte-identical navigation keys / a single source pass are infeasible and undesirable. The original plan called for the in-system topics @2 list to be identical to the published llms.txt navigation keys, emitted from one source pass. That contradicts the reviewed subset rule: the in-system manual is the man-shaped subset navigated by manual section (topics @2 keys commands/capabilities/runtime-sdk/ manifests-schemas/concepts/operations), while the published llms.txt indexes the full docs/ tree navigated by docs/SUMMARY.md site sections (Start Here/Runnable Demos/System Architecture/…). The two key sets are deliberately disjoint and cover different content scopes; forcing identity would either shrink the published agent index to the man subset or replace the manual’s section navigation, regressing a shipped, asserted artifact. The two artifacts are also produced by separate compilers in separate languages with separate inputs (tools/manualc, a Rust capnp-blob compiler reading schema/capos.capnp + docs/manual/; and tools/mdbook-doc-metadata/generate-llms-txt.js, a Node site generator reading docs/SUMMARY.md + page front matter), so a literal single pass is a re-architecture, not a documentation slice.
    • What is delivered instead: the shared invariants that cannot diverge are locked. The genuinely shared dimension is the front-matter/topics taxonomy provenance and the schema-derived section-2 interface membership, not the top-level navigation keys. A manualc host test (topics_taxonomy_is_canonical_and_stable) pins the served topics @2 taxonomy to its canonical key/title/section/order projection so the navigation cannot silently drift, and build_info_commit_grounds_every_page proves the buildInfo @5 round-trip stamps every page with the corpus commit.

    Proofs: the two manualc host tests above; make run-system-manual-smoke asserts buildInfo @5 grounds pages of both provenance kinds (the schema-derived console(2) and the authored spawn(1)) with one consistent, non-placeholder build id. Open follow-up: a true single-source agent export shared with the published llms.txt would require reconciling the subset rule (a documented man-subset projection of the site taxonomy) plus schema, Makefile, corpus, and kernel scope beyond this slice’s tools/ + proposal surface; and describe @4/topics @2 still have no shell exerciser (the shell man/apropos builtins use page @0/apropos @1), carried from the Phase 3 follow-up.

Sequencing and Priority

Recording this design now is cheap and the public-explorer angle is real, but Phase 1 implementation competes with foundational work. capOS does not yet have persistence, a userspace network stack, or the DDF Task-5 userspace device authority gate closed. Phase 1 (capability + boot-packaged corpus + shell man) is buildable on current infrastructure and does not depend on those gates; Phase 2 depends only on the already-implemented self-served web UI. Unless a public demo is imminent, sequence Phase 1 behind the foundational milestones on the priority ladder rather than ahead of them. The schema doc-comment authoring and binding-preservation prerequisites (Phase 3) can proceed independently and are worth doing regardless, because they also improve review and the published interface docs.

Relationship to Existing Proposals

  • SystemInfo proposal: retire the # Future: helpTopics @2 / manPage @3 stubs there in favor of this capability; SystemInfo stays scalar (MOTD/hostname/host metadata). That proposal’s front matter and interface comment should be updated to point here when this lands.
  • mdBook documentation-site proposal: complementary, not competing. mdBook remains the host-rendered public site (the long-form depth); Manual is the in-system concise reference. They share the front-matter/topics taxonomy provenance, so navigation cannot drift on the taxonomy it is drawn from. They do not share top-level navigation keys: the site’s llms.txt indexes the full docs/ tree by docs/SUMMARY.md section, while topics @2 indexes the man-shaped subset by manual section. The Phase-4 agent export is the in-system manual’s own machine-readable index of that subset, not a copy of the site’s llms.txt (see the Phase 4 settlement under Phasing).
  • Remote-session UI security proposal: the web-UI viewer inherits its view-model and browser-boundary rules; this proposal adds no new authority-bearing route.
  • Interactive command surfaces proposal: a future typed CommandSession could host a richer in-shell pager for man, but Phase 1 uses the existing line-based terminal write path.
  • Cloud deployment / public-release boundaries: in-system docs are a prerequisite for a self-explanatory public explorer demo; this proposal is the documentation half of that story.

Open Questions

  • Subset boundary precision. The subset rule (sections 1/2/3/5 full, 7/8 summaries, long-form on the site) needs a concrete inclusion list, and the build check that classifies each docs/ page as in-manual / summarized / site-only must be authored so the boundary stays legible as docs grow.
  • Schema doc-comment authoring. Section-2 DESCRIPTION quality (Phase 3) depends on writing real doc comments across schema/capos.capnp; that authoring is its own tracked work and gates the auto-generated path. A REVIEW.md rule now requires doc comments on new/changed interfaces so the gap does not regrow.
  • Structured page schema. Whether man sections are a fixed set of typed fields or a single tagged-text body; leaning toward a small fixed set so both renderers stay trivial.

Design Grounding

  • Capability dispatch and interface_id(): capos-lib/src/cap_table.rs; boot-packaged read-only blob pattern: kernel/src/cap/boot_package.rs.
  • Shell command dispatch for the man builtin: shell/src/main.rs.
  • Web-UI cap-holding + view-model boundary: demos/remote-session-web-ui/.
  • Front-matter / topics taxonomy reused for topics: tools/mdbook-doc-metadata/ and mdBook proposal.
  • Prior interface sketch and the scope tension this resolves: SystemInfo proposal.

Relevant Research

  • Unix manual conventions and section ordering – man-pages(7).
  • Plan 9’s section split (commands / devices / file servers / protocol / formats) and “follow the documentation pointers on demand” navigation model, which motivate the capability-section mapping and the SEE ALSO graph (docs/research/ Plan 9 report).
  • Cap’n Proto runtime reflection (RawStructSchema / dynamic schema), the basis for schema-derived section-2 SYNOPSIS and DESCRIPTION.
  • Modern discovery affordances – tldr/apropos/topic indexes – adopted as a navigation layer over, not a replacement for, the man core.

Proposal: Interactive Command Surfaces

Typed command surfaces for native interactive applications without moving application parsing into StdIO text streams.

Current Target Versus Future Design

The immediate target is deliberately narrower than this proposal:

  • capos-shell exposes generic process control commands, including spawn for asynchronous launch and run for launch-and-wait.
  • Chat and adventure clients are ordinary spawned commands, not shell builtins.
  • Interactive child I/O uses an explicit StdIO endpoint client with stdin/stdout/stderr-shaped semantics while the shell keeps ownership of its TerminalSession.
  • Focused QEMU smokes prove the resident-service plus shell-spawned-client path before the native command protocol hardens.

The future native design is the CommandSession/CommandSurface protocol below. It should replace semantic command parsing inside chat/adventure clients once the prototype has proved the process, grant, wait, and terminal bridging mechanics.

The native shell substrate this proposal extends is described in Shell; the agent-mode tool-use loop that will consume the same command surfaces as typed tool descriptors lives in Language Models and Agent Runtime.

Problem

The current chat/adventure worktree moved application commands out of capos-shell builtins and into ordinary shell-spawned clients. That fixes one bad boundary, but it leaves another one: the clients read lines from StdIO and parse command text such as go north, take key, /join #lobby, and say hello themselves.

That is still too stringly for capOS. The kernel and services already expose typed capabilities. Native interactive applications should not receive their primary operation as an unstructured terminal line and then rebuild an ad hoc parser. StdIO is useful for textual programs, logs, compatibility layers, and simple smoke harnesses. It is not the right semantic boundary for a native application command language.

The other design pressure is terminal reuse. The same native shell should work from a local UART, GUI pane, web terminal, or test harness. That argues for a terminal host process that owns terminal transport and rendering separately from the shell process that owns command routing and capability context.

Goals

  • Keep application-specific verbs out of capos-shell.
  • Keep application command semantics out of unstructured StdIO text parsing.
  • Let a user type familiar command forms such as go north or chat join #lobby while the executable representation is a typed invocation.
  • Support nested subcommands without hardcoding app grammar into the shell.
  • Let terminal hosts provide line editing, completion, history, resize, and GUI/web rendering from the same command metadata.
  • Preserve typed service authority: parsing a command never grants access, and every effect still requires the right capability.

Non-Goals

  • POSIX shell compatibility.
  • A global command namespace.
  • Making terminal text a security boundary.
  • Removing StdIO; it remains the byte/text stream adapter for programs whose interface really is textual.

Layering

flowchart TD
    Uart[UART TerminalHost] --> Terminal[Terminal entity]
    Web[Web TerminalHost] --> Terminal
    Gui[GUI TerminalHost] --> Terminal
    Terminal --> Shell[Native shell session]
    Shell --> Cmd[Interactive CommandSession]
    Cmd --> Adventure[Adventure service cap]
    Cmd --> Chat[Chat service cap]
    Shell --> Launcher[Restricted launcher]
    Shell --> Broker[AuthorityBroker]

The terminal host owns raw input/output, line discipline, presentation state, history, paste handling, resize events, and later GUI/web affordances. The terminal entity is the session object the host exposes to a foreground shell or application view. TerminalSession remains the capability boundary for a foreground text session, but it does not have to be implemented inside the shell.

The native shell owns command namespace, current capability context, spawn/wait state, and policy-mediated bundle changes. It can run from any terminal host because it talks to the terminal entity, not to a particular UART.

An interactive application owns a CommandSession. It exposes a command surface and receives structured invocations. The application may be a thin adapter over service capabilities, as the adventure client should be, or a resident service may expose the command session directly.

Command Pattern

command <args> is acceptable as user-facing syntax, but it must not become the application ABI. It is a parseable notation for a declared command surface. The shell or terminal host parses text into a CommandInvocation; the application receives typed fields.

Conceptual schema:

struct CommandSurface {
  revision @0 :UInt64;
  prompt @1 :Text;
  commands @2 :List(CommandSpec);
}

struct CommandSpec {
  path @0 :List(Text);
  summary @1 :Text;
  args @2 :List(CommandArg);
  flags @3 :List(CommandFlag);
  redaction @4 :List(RedactionClass);
}

struct CommandArg {
  name @0 :Text;
  kind @1 :CommandValueKind;
  required @2 :Bool;
  variadic @3 :Bool;
  restOfLine @4 :Bool;
  completions @5 :CompletionSource;
}

struct CommandInvocation {
  surfaceRevision @0 :UInt64;
  path @1 :List(Text);
  args @2 :List(CommandValue);
  flags @3 :List(CommandFlagValue);
}

interface CommandSession {
  describe @0 () -> (surface :CommandSurface);
  invoke @1 (command :CommandInvocation) -> (result :CommandResult);
  poll @2 (maxEvents :UInt16) -> (events :List(CommandEvent));
  close @3 () -> ();
}

The parser is generic:

  • Match the longest declared command path.
  • Parse arguments according to the declared shapes.
  • Treat ambiguous prefixes as errors with alternatives.
  • Treat restOfLine as one text argument; do not split it again in the app.
  • Attach redaction metadata before audit or transcript recording.
  • Re-read CommandSurface when a command returns a new revision.

The application can still reject a typed invocation if the command is no longer valid. That is ordinary semantic validation, not text parsing.

Subcommand Nesting

Nested subcommands work if the command path is represented as a token list rather than a single string. Examples:

go north
take brass-key
say hello there
chat join #lobby
chat who
inventory equip lantern
admin npc spawn wanderer room=atrium

Those become:

path=["go"], args={direction:"north"}
path=["take"], args={item:"brass-key"}
path=["say"], args={text:"hello there"}
path=["chat","join"], args={channel:"#lobby"}
path=["chat","who"], args={}
path=["inventory","equip"], args={item:"lantern"}
path=["admin","npc","spawn"], args={kind:"wanderer", room:"atrium"}

The shell does not need adventure-specific code for any of these. It needs a generic command tree, longest-prefix matching, value parsers, and completion hooks. The same mechanism can describe shell commands such as spawn, wait, login, and caps, even if the implementations remain inside the shell for now.

Subcommand nesting is also a better fit for GUI/web sessions than raw StdIO. A terminal host can render chat join as a command palette entry, offer room completions for go, or show buttons for zero-argument commands such as look, all from the same metadata.

Adventure Shape

The adventure command session should own only the caps it needs:

adventure       Adventure or Endpoint client cap
chat            Chat or Endpoint client cap
session         optional UserSession metadata cap

It should expose a dynamic surface derived from current player state:

  • look
  • go <direction> with room-specific direction completions
  • take <item> with visible item completions
  • drop <item> with inventory completions
  • inventory
  • say <text...> with restOfLine=true
  • chat join <channel>
  • chat who
  • quit

The shell or terminal host parses those forms. The adventure command session turns the resulting invocation into typed Adventure and Chat calls. The adventure service still validates the session-bound caller identity, room, exits, items, and chat channel authority. Dynamic completions are convenience, not authority.

This is the balance capOS wants: generic shell integration, app-owned command metadata, typed service calls, and no application-specific shell builtins.

The same describe-returned CommandSurface is the metadata source the agent runner in Language Models and Agent Runtime projects to typed tool descriptors with per-tool permission modes (auto / consent / stepUp / forbidden). A command surface is therefore not only a shell parsing input – it is the contract surfaced to interactive operators, scripted harnesses, and model-driven tool-use loops alike.

Role of StdIO

StdIO remains useful, but it should be demoted to a transport and compatibility interface:

  • output streams for simple textual programs,
  • test harnesses that script input and check transcript output,
  • POSIX personality descriptor emulation,
  • applications whose real protocol is text.

For capOS-native interactive applications, StdIO.read() should not be the primary command interface. A command session can still emit render events that the shell forwards to a terminal host, and a compatibility adapter can expose the same session as text when necessary.

Terminal Host Separation

The shell should not permanently own the terminal implementation. A separate terminal host process gives the system one shell that can be reused across different front ends:

  • local UART host for QEMU and early hardware,
  • web host for browser terminal sessions,
  • GUI host for a desktop pane or command palette,
  • test host for smoke scripts.

Each host owns a terminal entity and grants a foreground TerminalSession or equivalent view to the shell. The shell runs command sessions and returns render/update events. The host decides how to display them.

This also avoids a future false choice between “shell owns the terminal” and “child process receives the terminal.” The terminal entity can support a foreground lease, shell-mediated command sessions, and later split panes or GUI widgets without making every child process a terminal driver.

Migration Plan

  1. Land the current shell-spawned StdIO clients as an explicit prototype: no app-specific shell builtins, no terminal-cap delegation to children, and run available for blocking command execution.
  2. Add focused QEMU smokes for chat and adventure against that prototype so the resident service, exact grants, wait path, and terminal bridge have a stable regression target.
  3. Add a userspace CommandSession DTO/protocol in the shared demo/runtime layer, carried over ordinary Endpoint until a manifest-visible interface is worth committing.
  4. Teach capos-shell a generic command-surface parser and command-provider registry. Do not add chat, play adventure, go, take, or similar application verbs as hardcoded shell matches.
  5. Move adventure command parsing out of demos/adventure-client/ and into command descriptors plus typed Adventure/Chat invocations.
  6. Split terminal hosting from the shell when the local UART path needs to support a second front end or when the web terminal work starts. Until then, keep the current terminal implementation constrained to the TerminalSession boundary so the split is mechanical.

See Shell for the broader native-shell authority model the command-surface protocol plugs into, and Language Models and Agent Runtime for the agent-mode consumer that turns the same CommandSurface metadata into typed tool descriptors with explicit permission modes.

Proposal: Userspace Authority Broker and Init-Owned Shutdown

Problem

The current shell authentication path uses a kernel AuthorityBroker capability. The shell starts with anonymous authority, calls the broker for an anonymous bundle, then calls it again after password login for an operator bundle. That works, but it places session policy, launcher policy, and shell bundle construction inside the kernel.

That is the wrong long-term boundary. The kernel should provide primitive mechanisms: process creation, capability transfer, endpoint rendezvous, memory, terminal I/O, and process lifecycle. Login policy, operator profiles, service allowlists, and shell bundles are userspace policy and should be owned by init or an init-managed service.

Shutdown exposes the same issue. A shutdown command should not be a raw kernel poweroff capability passed to the shell. The natural capOS behavior is that the kernel halts when init and all remaining processes are gone. Shutdown policy should therefore be implemented as init-owned lifecycle orchestration: stop services, wait for them, release authorities, and then let init exit.

Current State

Implemented pieces today:

  • The kernel starts one init process from the boot manifest.
  • Init reads BootPackage, validates the init-owned service graph, spawns services, records exported capabilities, and waits for children.
  • The shell receives a terminal and anonymous authority, then upgrades after password login.
  • AuthorityBroker is a kernel capability implemented in kernel/src/cap/authority_broker.rs.
  • Demo launcher policy that used to live as kernel-side binary and worker allowlist constants is now carried by kernelParams.authorityBrokerPolicy in the boot manifest. capos-config validates that referenced binaries exist, duplicate entries are rejected, worker service grant names are explicit, and unknown worker service origins fail closed.
  • ProcessHandle supports wait, but not termination.
  • There is no init-owned lifecycle control capability yet.

The consequence is a mixed trust boundary: init owns service graph execution, but the kernel still owns shell session bundle policy.

Goals

  • Move authority-broker policy out of the kernel.
  • Make init, or an init-managed broker service, responsible for authenticated shell bundles.
  • Keep shell unauthenticated authority minimal.
  • Make shutdown an init-owned control operation, not a direct kernel shutdown cap.
  • Preserve the kernel rule that the system halts naturally when the last process exits.
  • Keep all authority transfer explicit and inspectable through capabilities.

Non-Goals

  • Do not add ambient service names or a global service registry.
  • Do not give shell raw ProcessSpawner before authentication.
  • Do not add a kernel “kill everything” syscall.
  • Do not introduce restart policy, persistence, or crash recovery in this proposal.
  • Do not solve multi-user policy; this proposal only moves the current local operator/anonymous policy out of the kernel.

Proposed Architecture

Init starts two policy-facing services:

  • authority-broker: userspace service that owns shell bundle policy.
  • shell: interactive shell, initially anonymous.

Init also keeps a private lifecycle table for services it spawned. That table contains process handles, service names, restart policy state, and shutdown ordering metadata. Init does not expose the raw table. It exposes attenuated control capabilities.

Capability Graph

flowchart TD
    Kernel[Kernel primitives] --> Init[init]
    Kernel --> Terminal[TerminalSession]
    Kernel --> Spawner[ProcessSpawner]
    Kernel --> Sessions[SessionManager]
    Kernel --> Audit[AuditLog]
    Kernel --> Creds[CredentialStore]

    Init --> Broker[authority-broker service]
    Init --> Shell[shell]
    Init --> Services[managed services]

    Broker --> ShellAnon[anonymous shell bundle]
    Broker --> ShellOp[operator shell bundle after login]
    Init --> Shutdown[init-owned ShutdownControl]
    Broker --> ShellOp
    Shutdown --> ShellOp

The shell talks to the broker over an endpoint. Before login, the broker returns an anonymous bundle with no service-management authority. After login, the broker returns an operator bundle that includes a restricted launcher and, if policy allows, an init-owned shutdown control capability.

Interfaces

The exact schema can evolve, but the minimum shape should separate broker policy from init lifecycle control.

interface AuthorityBroker {
    shellBundle @0 (sessionCapId :UInt32, profile :Text)
        -> (launcherIndex :UInt16,
            sessionIndex :UInt16,
            hasShutdownControl :Bool,
            shutdownControlIndex :UInt16);
}

interface ShutdownControl {
    shutdown @0 () -> ();
}

interface ProcessHandle {
    wait @0 () -> (exitCode :Int64);
    terminate @1 (reason :Text) -> ();
}

AuthorityBroker can be implemented as a userspace service using endpoint IPC instead of a kernel cap. ShutdownControl is produced by init, not by the kernel. ProcessHandle.terminate is a primitive lifecycle operation, but the kernel only targets one process handle; init owns the policy that decides which handles to terminate and in what order.

Shutdown Flow

  1. Shell starts anonymous and does not hold ShutdownControl.
  2. User runs login.
  3. Shell obtains an operator bundle from the userspace broker.
  4. If policy allows, the bundle includes ShutdownControl.
  5. User runs shutdown.
  6. Shell invokes ShutdownControl.shutdown.
  7. Init stops accepting new service operations.
  8. Init asks managed services to terminate in dependency order.
  9. Init waits for all service handles to exit.
  10. Init releases remaining capabilities and exits.
  11. The kernel observes no remaining runnable user processes and halts through the existing last-process-exited path.

This keeps final machine shutdown in the kernel, but keeps shutdown authority and orchestration in userspace.

Broker Migration Plan

Phase 1: Define Userspace Interfaces

  • Add schema for endpoint-served AuthorityBroker and ShutdownControl.
  • Keep the kernel broker temporarily for compatibility.
  • Keep the manifest-owned authorityBrokerPolicy shim as the compatibility source for admitted demo binaries until the userspace broker owns equivalent policy directly.
  • Add runtime clients for both interfaces.
  • Add QEMU proof that an anonymous shell cannot call shutdown.

Phase 2: Init-Owned Shutdown

  • Extend init with a lifecycle table for spawned services.
  • Add a private init service endpoint for shutdown requests.
  • Add ProcessHandle.terminate or equivalent single-process lifecycle primitive.
  • Make init terminate and wait for managed services before exiting.
  • Add QEMU proof that shutdown after login exits QEMU cleanly.

Phase 3: Userspace Authority Broker

  • Implement authority-broker as a userspace service.
  • Grant it only the policy inputs and capabilities needed to mint shell bundles.
  • Have shell obtain anonymous and operator bundles from that service.
  • Keep shell without raw ProcessSpawner; it should receive only restricted launch authority.
  • Add QEMU proof that pre-login shell cannot spawn privileged services and post-login shell can run the expected demo commands.

Phase 4: Retire Kernel Broker

  • Remove kernel/src/cap/authority_broker.rs.
  • Remove KernelCapSource::AuthorityBroker.
  • Remove kernel-side broker bundle construction and tests.
  • Update docs so the kernel boundary is again primitive-only.

Security Properties

  • Shell starts without shutdown authority.
  • Shutdown authority is granted only after an authenticated session is proven.
  • The broker cannot invent kernel powers; it can only delegate capabilities it received from init.
  • Init remains the root of service lifecycle policy.
  • Kernel process termination remains per-handle, not global.
  • Service shutdown is auditable because it flows through init and named process handles.

Open Questions

  • Should ShutdownControl.shutdown be one-way, or should it return staged progress events before init exits?
  • Should services receive a graceful StdIO close, a typed lifecycle signal, or only ProcessHandle.terminate?
  • Should the broker be a separate process, or should init directly expose the broker endpoint until service supervision is stronger?
  • How should restart policies interact with shutdown mode?
  • Should shutdown require a fresh authentication event, or is the current operator session sufficient?

Verification

Required QEMU proofs:

  • Anonymous shell: shutdown is denied or unavailable.
  • Operator shell: login returns shutdown authority.
  • Shutdown command causes init to terminate managed services and exit.
  • QEMU exits through the existing last-process halt path.
  • Existing adventure/chat demo still works before shutdown.

Host tests should cover:

  • Broker policy decisions for anonymous vs operator profiles.
  • Init shutdown ordering over a synthetic lifecycle table.
  • Manifest validation rejecting direct shell access to privileged lifecycle primitives before login.

Proposal: Go Language Support via Custom GOOS

Running Go programs natively on capOS by implementing a GOOS=capos target in the Go runtime.

Current Manual Pages

  • Go VirtualMemory Contract freezes the current allocator-facing memory contract for this proposal.
  • Programming Languages summarizes the current language support matrix and the distinction between native runtime adapters, POSIX compatibility adapters, and WASI host adapters. The Go row points back here for the native GOOS=capos track and to the WASI host adapter’s Phase W.8 TinyGo / upstream GOOS=wasip1 CUE evaluator path.
  • Userspace Binaries holds the overall language-runtime track. Its “Future: Go (GOOS=capos)” section delegates the native plan to this proposal, and its “Phase W.8 (TinyGo / Go-on-WASI CUE evaluator, blocked)” entry tracks the WASI-side interim path.
  • WASI Host Adapter documents the in-tree wasmi-backed host. Phase W.8 there is the TinyGo / upstream Go (GOOS=wasip1) CUE evaluator slice that runs inside the host adapter and bridges to the native Go track described in this proposal. The detailed plan lives in WASI Host Adapter Task 9.
  • In-Process Threading freezes the thread/process ownership contract that Phase 2 of this proposal builds on.
  • Park Authority freezes the compact CAP_OP_PARK / CAP_OP_UNPARK ABI that the Go runtime’s futex glue must target instead of a Linux-style futex syscall namespace.
  • Memory Management documents the implemented kernel memory and baseline VirtualMemory behavior.
  • Userspace Runtime documents the capos-rt client surface that a future Go runtime port will call.
  • LLVM Target is the main research grounding for Go runtime and target-triple work.

Motivation

Go is the implementation language of CUE, the configuration language planned for system manifests. Beyond CUE, Go has a large ecosystem of systems software (container runtimes, network tools, observability agents) that would be valuable to run on capOS without rewriting.

The userspace-binaries proposal keeps Go as a dedicated future runtime track. This proposal explores the native path: a custom GOOS=capos that lets Go programs run directly on capOS hardware, without a WASM interpreter in between. Go through WASI remains a narrower option for CPU-bound tools such as CUE evaluation before the native runtime port exists.

Why Go is Hard

Go’s runtime is a userspace operating system. It manages its own:

  • Goroutine scheduler — M:N threading (M OS threads, N goroutines), work-stealing, preemption via signals or cooperative yield points
  • Garbage collector — concurrent, tri-color mark-sweep, requires write barriers, stop-the-world pauses, and memory management syscalls
  • Stack management — segmented/copying stacks with guard pages, grow/shrink on demand
  • Network poller — epoll/kqueue-based async I/O for net.Conn
  • Memory allocator — mmap-based, spans, mcache/mcentral/mheap hierarchy
  • Signal handling — goroutine preemption, crash reporting, profiling

Each of these assumes a specific OS interface. The Go runtime calls ~40 distinct syscalls on Linux. capOS currently has 2.

Syscall Surface Required

The Go runtime’s Linux syscall usage, grouped by subsystem:

Memory Management (critical, blocks everything)

Go runtime needsLinux syscallcapOS equivalent
Heap allocationmmap(MAP_ANON)VirtualMemory.reserve + commit, or compatibility map
Heap deallocationmunmapVirtualMemory.unmap releases reservations and committed frames
Stack guard pagesmmap(PROT_NONE) + mprotectReserve uncommitted guard pages; use committed VM_PROT_NONE only when contents must be retained
GC needs contiguous arenasmmap with hintsContiguous virtual reservations; physical frames are committed sparsely
Commit/decommit pagesmadvise(DONTNEED)VirtualMemory.commit / decommit within reserved ranges

capOS needs: A sys_mmap-like capability or syscall that can:

  • Map anonymous pages at arbitrary user addresses
  • Set per-page permissions (R, W, X, none)
  • Allocate contiguous virtual ranges without requiring contiguous physical frames
  • Decommit without unmapping (for GC arena management)

This could be a VirtualMemory capability:

interface VirtualMemory {
    # Map anonymous pages at hint address (0 = kernel chooses)
    map @0 (hint :UInt64, size :UInt64, prot :UInt32) -> (addr :UInt64);
    # Unmap pages
    unmap @1 (addr :UInt64, size :UInt64) -> ();
    # Change permissions on mapped range
    protect @2 (addr :UInt64, size :UInt64, prot :UInt32) -> ();
    # Reserve virtual address space without physical frames
    reserve @3 (hint :UInt64, size :UInt64) -> (addr :UInt64);
    # Commit physical frames inside a reserved range
    commit @4 (addr :UInt64, size :UInt64, prot :UInt32) -> ();
    # Decommit physical frames while keeping the range reserved
    decommit @5 (addr :UInt64, size :UInt64) -> ();
}

The exact Go allocator contract is frozen in Go VirtualMemory Contract: map stays a compatibility operation, while reserve, commit, and decommit separate virtual address reservation from physical frame commitment and make guard-page behavior explicit.

Threading (critical for goroutines)

Go runtime needsLinux syscallcapOS equivalent
Create OS threadclone(CLONE_THREAD)Thread capability / in-process thread lifecycle
Thread-local storagearch_prctl(SET_FS)ThreadControl.setFsBase; per-ThreadRef TLS ownership for Go integration
Block threadfutex(WAIT)ParkSpace compact CAP_OP_PARK
Wake threadfutex(WAKE)ParkSpace compact CAP_OP_UNPARK
Thread exitexit(thread)ThreadControl.exitThread capability operation

capOS baseline: process-local thread lifecycle and private ParkSpace wait/wake exist as the kernel substrate. The remaining Go work is runtime integration: capos-rt clients, newosproc glue, per-ThreadRef TLS ownership, and GC/runtime coordination across those kernel threads.

ThreadControl.setFsBase is a current-ThreadRef operation, not a process-global mutation. Go integration must allocate a distinct TLS block and FS base for each runtime M/OS thread, and context switch must preserve FS base as per-thread state before true multi-threaded Go is treated as supported.

Design alternatives considered:

Option A: Kernel threads. The kernel manages threads (multiple execution contexts sharing one address space). Each thread has its own stack, register state, and FS base, but shares page tables and cap table with the process. This is what Linux does and what Go expects.

Option B: User-level threading. The process manages its own threads (like green threads). The kernel only sees one execution context per process. Go’s scheduler already does M:N threading, so it could work with a single OS thread per process — but the GC’s stop-the-world relies on being able to stop other OS threads, and the network poller blocks an OS thread.

Option A is the selected substrate for Go compatibility. Option B is more capability-aligned (threads are a process-internal concern), but it requires larger Go runtime modifications and does not fit the current kernel-thread checkpoint.

Synchronization

Go runtime needsLinux syscallcapOS equivalent
Park waitfutex(FUTEX_WAIT)ParkSpace compact CAP_OP_PARK
Park wakefutex(FUTEX_WAKE)ParkSpace compact CAP_OP_UNPARK
Atomic compare-and-swapCPU instructionsAlready available (no kernel support needed)

Linux futexes are a kernel primitive (block/wake on a userspace address). capOS exposes park authority through a ParkSpace capability from the start. Go futex glue should target the compact capability-authorized park operations defined in the ParkSpace architecture rather than introducing a Linux-style futex syscall namespace or routing failed wait / empty wake through generic Cap’n Proto method dispatch. Blocked/resume performance still needs measurement under Go’s runtime workload, but that does not change the authority or key model.

Time

Go runtime needsLinux syscallcapOS equivalent
Monotonic clockclock_gettime(MONOTONIC)Timer cap .now()
Wall clockclock_gettime(REALTIME)Timer cap or RTC driver
Sleepnanosleep or futex with timeoutTimer cap .sleep() or park timeout
Timer eventstimer_create / timerfdTimer cap with callback or poll

Timer cap now and sleep are implemented for monotonic time and bounded sleep. Wall-clock time and timerfd-style event sources remain future work. ThreadControl getFsBase and setFsBase are implemented for current-process runtime FS-base ownership; making FS base per-thread remains part of kernel threading.

I/O

Go runtime needsLinux syscallcapOS equivalent
Network I/Oepoll_create, epoll_ctl, epoll_waitAsync cap invocation or poll cap
File I/Oread, write, open, closeDirectory/File or Namespace/Store caps through Go’s OS adapter
Stdout/stderrwrite(1, ...), write(2, ...)Console cap
Pipe (runtime internal)pipe2IPC caps or in-process channel

Go’s network poller (netpoll) is pluggable per-OS — each GOOS provides its own implementation. For capOS, it would use async capability invocations or a polling interface over socket caps.

Signals (for preemption)

Go runtime needsLinux syscallcapOS equivalent
Goroutine preemptiontgkill + SIGURGThread preemption mechanism
Crash handlingsigaction(SIGSEGV)Page fault notification
Profilingsigaction(SIGPROF) + setitimerProfiling cap (optional)

Go 1.14+ uses asynchronous preemption: the runtime sends SIGURG to a thread to interrupt a long-running goroutine. On capOS, alternatives:

  • Cooperative preemption only. Go inserts yield points at function prologues and loop back-edges. This works but means tight loops without function calls won’t yield. Acceptable for initial support.
  • Timer interrupt notification. The kernel notifies the process (via a cap invocation or a signal-like mechanism) when a time quantum expires. The notification handler in the Go runtime triggers goroutine preemption.

Implementation Strategy

Phase 1: Minimal GOOS (single-threaded, cooperative)

Fork the Go toolchain, add GOOS=capos GOARCH=amd64. Implement the minimum runtime changes:

What to implement:

  • osinit() — read Timer cap from CapSet for monotonic clock
  • sysAlloc/sysFree/sysReserve/sysMap — translate to VirtualMemory cap
  • settls() — translate Go’s FS-base install to ThreadControl
  • newosproc() — stub (single OS thread, M:N scheduler still works with M=1)
  • futexsleep/futexwake — spin-based fallback (no real futex yet)
  • nanotime/walltime — Timer cap
  • write() (for runtime debug output) — Console cap
  • exit — sys_exit for current-thread termination; the process exits when its last live thread exits
  • exitThread — terminal ThreadControl.exitThread capability operation
  • netpoll — stub returning “nothing ready” (no async I/O)

What to stub/disable:

  • Signals (no SIGURG preemption, cooperative only)
  • Multi-threaded GC (single-thread STW is fine initially)
  • CGo (no C interop)
  • Profiling
  • Core dumps

Deliverable: GOOS=capos go build ./cmd/hello produces an ELF that runs on capOS, prints “Hello, World!”, and exits.

Current capOS status: the single-thread-runtime QEMU demo proves the capability-side checkpoint for this phase without a Go fork yet. It maps, protects, and frees heap pages through VirtualMemoryClient, uses TimerClient for monotonic now and sleep, keeps newosproc unsupported, and exercises the temporary park fallback path locally.

Estimated effort: ~2000-3000 lines of Go runtime code (mostly in runtime/os_capos.go, runtime/sys_capos_amd64.s, runtime/mem_capos.go). Reference: runtime/os_js.go (WASM target) is ~400 lines; runtime/os_linux.go is ~700 lines. capOS sits between these.

Phase 2: In-Process Threading + Park

Build on implemented kernel support for:

  • multiple threads per process on the single-CPU scheduler first;
  • private ParkSpace compact wait/wake;
  • current-thread FS-base updates through ThreadControl.

Update Go runtime:

  • newosproc() creates a real kernel thread
  • futexsleep/futexwake use the ParkSpace compact park ABI
  • thread creation allocates and owns distinct TLS state per ThreadRef
  • GC can coordinate across multiple kernel threads in one process
  • Enable real blocking instead of the temporary single-thread park fallback

Deliverable: Go programs can create multiple in-process kernel threads and block/wake through futexes on one CPU. Multiple CPU-core execution remains a later SMP milestone after the threading/park contract is settled.

The 7.1.0 thread/process ownership contract is now frozen in In-Process Threading. It keeps address space, cap table, CapSet, and the capability ring process-owned; makes saved context, kernel stack, block state, and FS base thread-owned; charges thread records and kernel stacks to process-owned ledgers; and preserves a single process ring waiter until a later ring-sharding design exists. The 7.1.1 park authority contract is frozen in Park Authority. It defines process-local ParkSpace authority for private park keys, a future MemoryObject-derived SharedParkSpace model for shared park-words, and compact CAP_OP_PARK / CAP_OP_UNPARK operations as the starting ABI for the Go runtime synchronization path.

Phase 3: Network Poller

Implement runtime/netpoll_capos.go:

  • Register socket caps with the poller
  • Use an async notification mechanism (capability-based poll() or notification cap)
  • net.Dial(), net.Listen(), http.Get() work

This depends on the networking stack being available as capabilities.

Deliverable: Go HTTP client/server runs on capOS.

Phase 4: CUE on capOS

With Go working, CUE runs natively. This enables:

  • Runtime manifest evaluation (not just build-time)
  • Dynamic service reconfiguration via CUE expressions
  • CUE-based policy enforcement in the capability layer

Kernel Prerequisites

PrerequisiteRoadmap StageWhy
Capability syscallsStage 4 (sync path done)Go runtime invokes caps (VirtualMemory, Timer, Console)
SchedulingStage 5 (core done)Go needs timer interrupts for goroutine preemption fallback
IPC + cap transferStage 6Go programs are service processes that export/import caps
VirtualMemory capabilityStage 5mmap equivalent for Go’s memory allocator and GC
ThreadControl capabilityExtends Stage 5settls equivalent before full in-process threads
Thread lifecycleExtends Stage 5Implemented substrate for multiple execution contexts per process; Go integration remains
ParkSpace capabilityExtends Stage 5Go runtime synchronization through compact park/unpark

VirtualMemory Capability

This is the biggest new kernel primitive. Go’s allocator requires:

  1. Reserve large virtual ranges without committing physical memory (Go reserves 256 TB of virtual space on 64-bit systems)
  2. Commit pages within reserved ranges (back with physical frames)
  3. Decommit pages (release frames, keep virtual range reserved)
  4. Set permissions (RW for data, none for committed inaccessible pages; pure guard pages should stay reserved but uncommitted)

The existing page table code (kernel/src/mem/paging.rs) supports mapping and unmapping individual pages. It needs to be extended with:

  • Virtual range reservation (mark ranges as reserved in some bitmap/tree)
  • Lazy commit (map as PROT_NONE initially, page fault handler commits on demand — or explicit commit via cap call)
  • Permission changes on existing mappings

The concrete ABI for the first explicit-commit path is in Go VirtualMemory Contract. It chooses explicit commit/decommit before demand paging, permits VM_PROT_NONE through reservation metadata plus non-present user PTEs, and requires separate virtual-reservation and physical-commit quota ledgers. Committed VM_PROT_NONE intentionally retains allocated frames and page contents for later protection restore. Pure guard pages should use reserved uncommitted pages so they consume virtual quota but no physical commit budget.

Thread Support

Extending the process model (kernel/src/process.rs) now follows the contract in In-Process Threading. See the SMP proposal for the PerCpu struct layout (per-CPU kernel stack, saved registers, FS base); Thread extends this for multi-thread-per-process. See also the In-Process Threading section in Roadmap for the roadmap-level view.

#![allow(unused)]
fn main() {
struct Process {
    pid: u64,
    address_space: AddressSpace,  // shared by all threads
    caps: CapTable,               // shared by all threads
    threads: Vec<Thread>,
}

struct Thread {
    tid: u64,
    state: ThreadState,
    kernel_stack: VirtAddr,
    saved_regs: RegisterState,    // rsp, rip, etc.
    fs_base: u64,                 // for thread-local storage
}
}

The scheduler (Stage 5) schedules threads, not processes. Each thread gets its own kernel stack and register save area. Context switch saves/restores thread state. Page table switch only happens when switching between threads of different processes.

Alternative: Go via WASI

For comparison, the WASI path from the userspace-binaries proposal:

Native GOOSWASI
PerformanceNative speed~2-5x overhead (wasm interpreter/JIT)
Go compatibilityFull (after Phase 3)Limited (WASI Go support is experimental)
GoroutinesReal M:N schedulingSingle-threaded (WASI has no threads yet)
Net I/ONative async via pollerBlocking only (WASI sockets are sync)
Kernel workVirtualMemory, threads, parkNone (wasm runtime handles it)
Go runtime forkYes (maintain a fork)No (upstream GOOS=wasip1)
GCFull concurrent GCConservative GC (wasm has no stack scanning)
Maintenance burdenHigh (track Go releases)Low (upstream supported)

WASI is easier but limited. Go on WASI (GOOS=wasip1) is officially supported but experimental — no goroutine parallelism, no async I/O, limited stdlib. For running CUE (which is CPU-bound evaluation, no I/O, single goroutine), WASI might be sufficient.

Native GOOS is harder but complete. Full Go with goroutines, concurrent GC, network I/O, and the entire stdlib. Required for Go network services or anything using net/http.

Recommendation: Start with WASI for CUE evaluation. The in-tree path is WASI Host Adapter Phase W.8 (and Task 9 of WASI Host Adapter): a CUE evaluator binary built against TinyGo or upstream Go’s GOOS=wasip1, loaded through the host adapter against a future ScriptPackage cap. Phase W.8 is blocked on the same std-userspace decision as W.7 today, but it is the smaller-step bridge to running Go logic on capOS before the native runtime port exists. If Go network services or full goroutine/GC semantics become a goal, invest in the native GOOS=capos track described here; the Userspace Binaries “Phase W.8” entry keeps both paths sequenced from the language-track view.

Relationship to Other Proposals

  • Userspace Binaries — owns the overall language-runtime track. This proposal adds concrete Go implementation details to the future “Future: Go (GOOS=capos)” branch there. The POSIX compatibility adapter is not sufficient for native Go because Go does not use libc on Linux; it makes raw syscalls. The GOOS approach bypasses POSIX entirely. The same userspace-binaries doc tracks Phase W.8 as the Go-on-WASI interim path.
  • Programming Languages — the matrix entry for Go points here for the native track and to the WASI host adapter’s Phase W.8 for the TinyGo / GOOS=wasip1 interim. Any change to the sequencing between native Go and Go-on-WASI must keep that row in sync.
  • WASI Host Adapter — Phase W.8 of the WASI host adapter ships a TinyGo or upstream Go GOOS=wasip1 CUE evaluator binary that runs inside the in-tree wasmi-backed host. That slice is blocked on the same std-userspace decision as W.7 today and bridges to the native Go track described here once it lands. The detailed plan lives in WASI Host Adapter Task 9.
  • Service Architecture — Go services participate in the capability graph like any other process. The Go net poller (Phase 3) uses TcpSocket/UdpSocket caps from the network stack.
  • Storage and Naming — Go’s os.Open()/os.Read() map to Namespace + Store caps via the GOOS file I/O implementation. Go doesn’t use POSIX for this — it has its own runtime/os_capos.go with direct cap invocations.
  • SMP — later multi-core scaling for Go after Phase 2. The first Phase 2 target is single-CPU in-process threads plus parking; per-CPU scheduling belongs to the later SMP milestone.

Open Questions

  1. Fork maintenance. A GOOS=capos fork must track upstream Go releases. How much drift is acceptable? Could the capOS-specific code eventually be upstreamed (like Fuchsia’s was)?

  2. CGo support. Go’s FFI to C (cgo) requires a C toolchain and dynamic linking. Should capOS support cgo, or is pure Go sufficient? CUE doesn’t use cgo, but some Go libraries do.

  3. GOROOT on capOS. Go programs expect $GOROOT/lib at runtime for some stdlib features. Where does this live on capOS? In the Store? Baked into the binary via static compilation?

  4. Go module proxy. go get needs HTTP access. On capOS, this would use a Fetch cap. But cross-compilation on the host is more practical than building Go on capOS itself.

  5. Debugging. Go’s runtime/debug and pprof expect signals and /proc access. What debugging capabilities should capOS expose?

  6. GC tuning. Go’s GC is tuned for Linux’s mmap semantics (decommit is cheap, virtual space is nearly free). capOS’s VirtualMemory cap needs to match these assumptions or the GC will need retuning. The first matching point is the reserve/commit/decommit contract in Go VirtualMemory Contract.

Estimated Scope

PhaseNew kernel codeGo runtime changesDependencies
Phase 1: Minimal GOOS~200 (VirtualMemory cap)~2000-3000Stages 4-5
Phase 2: Threading~500 (threads, park)~500In-process threading/park (7.1/7.2)
Phase 3: Net poller~100 (async notification)~300Networking, Stage 6
Phase 4: CUE on capOS00Phase 1 (or WASI)
Total~800~2800-3800

Plus ongoing maintenance to track Go upstream releases.

Proposal: Lua Scripting

How capOS should add Lua as a small capability-aware scripting environment without turning scripts into ambiently privileged shell fragments.

Problem

capOS needs a lightweight scripting path for operator workflows, demos, service glue, and eventually interactive shell automation. The native shell already exposes typed capabilities and explicit child grants, but a shell REPL is not a full programming language. Lua is attractive because it is small, embeddable, and designed to let a host provide the domain API.

The risk is predictable: “system scripting” often becomes an escape hatch around the operating system model. A script runner that receives broad ProcessSpawner, BootPackage, filesystem, network, or terminal authority and then exposes io, os, package.loadlib, or raw handle integers would recreate the ambient authority capOS is trying to avoid.

The target is not “make Lua root.” The target is:

  • Lua as ordinary userspace code.
  • Capabilities as the only authority.
  • Host-provided Lua libraries that map to typed capOS interfaces.
  • Exact grants for script processes, with no default filesystem, network, process, terminal, or debug authority.

Scope

In scope:

  • A capos-lua userspace runner for trusted operator and service scripts.
  • A small Lua host API over capos-rt typed clients.
  • A policy for standard Lua libraries on capOS.
  • Script packaging and shell launch shape.
  • Validation through QEMU scripts that prove granted and ungranted paths.

Out of scope for the first implementation:

  • LuaJIT.
  • Dynamic native Lua C modules.
  • A POSIX-compatible Lua environment.
  • Treating in-process Lua sandboxing as the isolation boundary for hostile scripts.
  • Kernel awareness of Lua.

Current Manual Pages

  • Programming Languages is the language-status index. The Lua row tracks the in-tree demos/lua-smoke/ runner against the Rust, Python, Go, C/C++, WASI, and POSIX adapter rows and is the page to update whenever the runtime label or phase status changes.
  • Userspace Runtime documents the implemented capos-rt surface (entry, allocator, syscall, CapSet lookup, typed ConsoleClient / TimerClient / VirtualMemoryClient) that the Lua runner consumes today through host::Host::register_console, register_timer, and register_memory. Any new Lua binding starts by identifying the matching typed client on this page, not by reaching into raw ring SQEs or method IDs.
  • Shell proposal defines the spawn-plan shape that the shell uses to launch ordinary userspace processes with exact grants. The Lua runner is a launched workload in that model, not a shell-embedded interpreter; future lua scripts/admin/inspect.lua with { ... } sugar must desugar to the same explicit spawn plan rather than inheriting the shell’s current CapSet.
  • Userspace Binaries proposal owns the userspace runtime, language-support, and compatibility-adapter plan that the Lua runner sits inside. Its “Future: Lua” section names this proposal as the authoritative design for capos-lua, and the Lua runner must keep matching its rules for unforgeable capability userdata, exact grants, curated standard libraries, no raw CapIds, and the C/libcapos dependency for the upstream PUC Lua port.

Research Grounding

Relevant research:

  • Capability research survey: keep typed Cap’n Proto interfaces as the permission boundary and avoid parallel rights flags.
  • Genode: route service access structurally; sessions are typed and resource-accounted.
  • Plan 9 and Inferno: per-process namespaces are useful precedent, but capOS should not turn scripts into path-global clients.
  • EROS, CapROS, and Coyotos: confinement depends on constructing the subject with only the capabilities it may use.
  • seL4: keep the privileged kernel surface small and let userspace policy build higher-level systems.

External Lua references:

  • The official Lua 5.5 manual describes Lua as an embeddable C library with a host program that registers C functions callable from Lua.
  • The official Lua version history says Lua 5.5.0 was released on 2025-12-22, while Lua 5.4.8 is the current 5.4 bug-fix release from 2025-06-04. It also says different x.y versions have different APIs and virtual machines, and precompiled chunks are not portable between versions.
  • The official Lua 5.5 readme says Lua is distributed as pure ISO C and normally builds into lua, luac, and liblua.a. That makes Lua a plausible native port once capOS has the C userspace and libcapos substrate; it does not make Lua runnable on today’s no-std Rust-only userspace by itself.

Rust implementation candidates checked:

  • mlua is a mature Rust binding layer for PUC Lua, LuaJIT, and Luau. It is not a pure-Rust VM. Its vendored path still builds C/C++ Lua-family sources through mlua-sys, cc, and lua-src/luajit-src, and the public crate uses std, libc, parking_lot, panic catching, and host linker/module assumptions. It is a useful API reference, but it does not avoid the native C/libcapos port.
  • piccolo is the only inspected pure-Rust implementation that looks like a credible capOS bootstrap candidate. It has a stackless VM, fuel-based stepping, memory tracking through gc-arena, safe userdata downcasting, and most core language behavior. The current crate is still std-based, depends on anyhow, thiserror, rand, ahash, and a git-pinned gc-arena, and its built-in I/O path writes to host stdout. Porting it to capOS would require a no_std + alloc fork plus host-library replacement, but that is likely less work than bringing up C Lua before libcapos.
  • silt-lua, hematita, and luar were also inspected. They are pure Rust in varying degrees, but their own READMEs/code show early, incomplete, or CLI-oriented implementations. They are not good foundations for capOS runtime work today.

Design Principles

  1. Lua is not a kernel feature. The kernel sees a normal process with a CapSet and a capability ring.

  2. The runner’s CapSet is the authority. Script text, module names, global variables, and Lua tables are data. They cannot create authority.

  3. In-process sandboxing is defense in depth, not confinement. A trusted service may embed Lua for local configuration or small trusted extensions. Untrusted user scripts must run in a separate process with a narrow CapSet, quotas, and no access to the host service’s private caps.

  4. The standard libraries are curated. Base, coroutine, table, string, math, and utf8 are reasonable starting points. io, os, package, debug, dynamic loading, and process execution are absent by default or replaced by capOS-specific libraries backed by explicit caps.

  5. No raw CapIds in Lua. A Lua capability value is host-owned userdata with a hidden metatable. Scripts can call methods exposed by the wrapper, but they cannot forge a handle by guessing an integer.

  6. Lua version is part of the runtime contract. Precompiled chunks, language behavior, and C API details are series-specific. capOS should pin the runner to a declared Lua series and expose that in manifests and smoke output.

  7. C module loading waits. Dynamic native modules need loader, linker, symbol, and authority policy. The first runner should statically link the selected Lua implementation and capOS host libraries.

Architecture

flowchart TD
    Shell[capos-shell] --> Launcher[RestrictedLauncher]
    Launcher --> Runner[capos-lua process]
    Runner --> Lua[PUC Lua VM]
    Runner --> Rt[capos-rt / libcapos host API]
    Rt --> Ring[capability ring]
    Ring --> Kernel[kernel CapObject dispatch]
    Ring --> Services[userspace services]

    ScriptPkg[ScriptPackage or Namespace cap] --> Runner
    Terminal[TerminalSession cap] --> Runner
    OtherCaps[Exact service caps] --> Runner

capos-lua is just another binary launched by the shell or init-owned service graph, matching the “language runtime as ordinary process” rule from Userspace Binaries. The parent chooses the script source and the exact caps. The runner creates one Lua state, installs selected libraries, wraps granted caps as userdata, loads the script with a controlled environment, executes it in protected mode, flushes queued releases, and exits with a normal process status.

The initial implementation should be a standalone runner, not Lua embedded in capos-shell. Keeping the runner as a child process prevents script bugs, Lua VM bugs, and accidental infinite loops from corrupting the interactive shell state. It also gives QEMU smokes a clear process boundary to inspect.

Version Choice

Use PUC Lua, not LuaJIT, for the first runner.

As of 2026-05-13, Lua 5.5.0 (released 2025-12-22) is still the current upstream series and Lua 5.4.8 (released 2025-06-04) is still the latest 5.4 bug-fix release. Lua 5.5 has features that fit capOS scripting: explicit global declarations, compact arrays, and static fixed binaries. It is the right default target for new capOS-native scripts.

Keep a narrow compatibility option open for Lua 5.4.8 if imported scripts or libraries require it. Do not mix bytecode or native modules between Lua series. A script package should declare:

language = "lua"
series = "5.5"
entry = "main.lua"

Source scripts are preferable to precompiled chunks for reviewability. If precompiled chunks are allowed later, they must be tied to the exact runtime series and treated as trusted build inputs.

There is one practical sequencing exception: a piccolo-based capos-lua-smoke may be the fastest way to prove the capOS host API before C userspace support exists. That should be treated as an implementation bootstrap, not as a promise of exact PUC Lua compatibility. If capOS takes that route, the smoke should declare the runtime as piccolo rather than lua-5.5.

Host API

The first host API should be explicit and boring:

local capos = require("capos")

local terminal = capos.require_cap("terminal", "TerminalSession")
terminal:write_line("hello from Lua")

local now = capos.require_cap("timer", "Timer"):now()
terminal:write_line("now_ns=" .. tostring(now))

capos.require_cap(name, interface) looks up a bootstrap cap by manifest name and checks the expected interface metadata before returning userdata. It fails closed if the cap is absent or has the wrong interface.

Generated or handwritten bindings should expose method names, not method numbers. The binding owns Cap’n Proto serialization through capos-rt or libcapos; scripts should not construct raw SQEs, raw method IDs, transfer descriptors, or cap_enter calls.

Transferred result caps become owned Lua userdata. Release is deterministic when possible:

do
  local h <close> = launcher:spawn({
    name = "child",
    binary = "timer-smoke",
    grants = { terminal = terminal },
  })
  local code = h:wait()
end

Finalizers may queue cleanup, but they are not the primary lifetime contract. The runner must flush owned-handle releases at script return and process exit.

Standard Library Policy

Initial allowed libraries:

LibraryPolicy
baseLoad selected safe functions. load is allowed only with text mode and a supplied environment.
coroutineAllowed for cooperative script structure. It does not map to OS threads.
table, string, math, utf8Allowed.
debugDenied by default. It pierces ordinary Lua abstraction and should require an explicit developer-profile cap.
ioDenied by default. Replace with capos wrappers over TerminalSession, future File, ByteStream, or Namespace caps.
osDenied by default. Replace time, exit, and process operations with cap-backed methods.
packageRestricted. require searches a script package or namespace cap, not host paths or environment variables.
dynamic C modulesDenied until native module loading has a reviewed authority model.

Lua _ENV is useful for presenting a small global namespace, but it is not a security boundary by itself. The security boundary is the process plus its CapSet.

Script Sources

The current ProcessSpawner.spawn shape names a binary and grants caps; it does not yet pass arbitrary argument vectors or script blobs. That creates an implementation dependency for useful Lua scripting.

Near-term options, in order:

  1. Smoke-only compiled script: capos-lua-smoke statically embeds one script string in .rodata and proves the host API. This is not the general product, but it verifies the Lua VM, allocator, CapSet lookup, and terminal output without new startup ABI.

  2. Runner config cap: init or the shell grants a read-only ScriptPackage or ConfigBlob cap to capos-lua. The runner asks that cap for main.lua and module bytes. This keeps script data out of the kernel and fits the existing capability model.

  3. Storage-backed scripts: after Store/Namespace exists, scripts live under a granted namespace. require searches only that namespace and only through a read-only script-package view unless the script also receives a writable namespace cap.

Do not add a Lua-specific boot manifest field or kernel cap. Script packaging belongs to init, shell, storage, or a userspace package service.

Shell Integration

The launch shape comes from the Shell proposal; Lua adds no new spawn primitive. The shell should treat Lua as a launched workload:

run "capos-lua" with {
  terminal: @terminal
  timer:    @timer
  scripts:  @home.sub("scripts/admin")
}

Later, the shell can add sugar such as:

lua scripts/admin/inspect.lua with { terminal: @terminal, timer: @timer }

That sugar must compile to the same explicit spawn plan. There is no implicit inheritance of the shell’s full current CapSet.

Agent mode can also use Lua, but Lua should be a tool target rather than the model itself. The agent runner may advertise “run this approved Lua script” as a consent-gated tool. The model still does not receive session caps.

Adventure Game Use

The adventure game is a good later demonstration target because it needs both strict authority and authorable behavior. The kernel and service capabilities still enforce authority; Lua should only express deterministic scenario logic over the caps granted to the script runner.

Suitable Lua-owned behavior:

  • mission beat selection,
  • deterministic NPC dialogue state machines,
  • quest-board text,
  • hint selection,
  • debrief variants,
  • scripted reactions that call typed game APIs through granted object caps.

Unsuitable Lua-owned behavior:

  • deciding whether a player has authority,
  • mutating relic custody without a typed service call,
  • applying combat damage outside the game service,
  • minting or transferring caps,
  • holding broad spawn, debug, filesystem, or network authority by default.

The useful proof is language independence: a Rust adventure service and a Lua scenario script should both demonstrate proper capability use, including bounded failures when a script lacks a required cap.

Blocking, Async, and Coroutines

The first runner can use synchronous typed client calls over the existing single-owner ring client. A blocking Lua method blocks the runner process, which is acceptable for the first operator-script use case.

Coroutines provide script-local cooperative structure, not OS scheduling. A future runtime reactor can resume Lua coroutines when capability completions arrive, but that should wait until the capOS runtime has a general demux path for threaded and async clients. Do not design Lua-specific CQ demultiplexing.

Security Model

Threat boundaries:

  • Script source is untrusted input until parsed and loaded in protected mode.
  • Script packages are trusted build or storage inputs only when their source, digest, author, and runtime series are review-visible.
  • The Lua VM is not trusted to confine hostile code inside a privileged host process.
  • Capability wrappers must validate method parameters, buffer sizes, transfer counts, and result-cap interface IDs before translating Lua values into ring calls.
  • Terminal and audit output must not print secrets. Lua error rendering should use bounded messages and avoid dumping arbitrary cap userdata internals.

Default deny list for untrusted scripts:

  • no debug,
  • no dynamic module loading,
  • no raw os/io,
  • no broad ProcessSpawner,
  • no broad network manager,
  • no boot package,
  • no mutable namespace unless that is the explicit script purpose,
  • no host environment variables.

Quotas matter. The first useful quota is process memory. CPU budgets, timer budgets, and capability-call quotas should follow the normal capOS scheduling and resource-accounting path rather than special Lua hooks.

Implementation Phases

Phase 0: Contract and Host Surface (in tree)

  • Proposal landed and docs/programming-languages.md records the Phase 0 status.
  • Initial runtime label is capos-lua-subset, not lua-5.x. Bytecode portability is explicitly out of scope.
  • Phase 0 ships a tiny hand-written tree-walking interpreter under demos/lua-smoke/ that exists to validate the long-term capability-aware host API design without committing capOS to a particular Lua dialect. Piccolo was investigated and not adopted: upstream does not compile no_std and the swap surface (anyhow, thiserror, std::io, std::sync, ahash::RandomState entropy) is large enough that the maintenance cost of a fork was judged to outweigh the benefit at this stage. The hand-written interpreter is replaced or kept as a research-grade sandbox once the C/libcapos PUC port lands.
  • Host surface in tree:
    • typed userdata over capos-rt::ConsoleClient and capos-rt::TimerClient,
    • obj:method(args) dispatch through host::Host::call_method,
    • errors flow back as Lua runtime errors via EvalError::Lua, never Rust panics on script-controlled inputs,
    • bounded execution via a per-run step counter (MAX_STEPS).
  • Future Phase 0 items (still open):
    • generalised capos.require_cap lookup,
    • capos.interfaces reflection for typed errors,
    • owned-cap release semantics for granted result handles.

Phase 1: Native Runner Smoke (in tree)

  • demos/lua-smoke/ builds as capos-demo-lua-smoke, gets embedded in system-lua-smoke.cue, and runs under make run-lua-smoke with QEMU’s isa-debug-exit to gate cleanly on script success or failure.
  • The smoke loads no Lua standard library at all (no io, os, package, debug, string, table, math); the only callable surface is the typed cap bindings registered in host::Host::register_*.
  • Iteration L.1 (2026-05-04 18:42 EEST, merge 050ac735) shipped the initial console:write_line and timer:now bindings.
  • Iteration L.2 (2026-05-05 19:30 UTC) added the third host binding, memory, wrapping capos-rt::VirtualMemoryClient. The Lua surface is memory:alloc(size) -> userdata, memory:write(buf, off, byte), memory:read(buf, off) -> int, memory:size(buf) -> int. The host binding owns the kernel-mapped address and the page-aligned size; the Lua side only ever sees an opaque userdata id and the byte values that came back through the typed binding. Each read/write is bounds- checked host-side before the single-byte volatile_* access. Per-call (MAX_MEMORY_ALLOC_BYTES = 64 KiB), aggregate (MAX_MEMORY_TOTAL_BYTES = 256 KiB), and buffer-count (MAX_MEMORY_BUFFERS = 64) ceilings rejected as typed Lua errors keep hostile scripts from exhausting the per-process virtual-memory quota before the kernel does. The smoke proof lines ([lua-smoke] memory:alloc size=4096, [lua-smoke] memory roundtrip 65,66,67, [lua-smoke] memory sum=198) are gated by tools/qemu-lua-smoke.sh.
  • Iteration L.3 (2026-05-13 09:28 EEST, commit 430ccd0e) added deterministic memory:release(buf) for the same smoke-only host binding. The host calls VirtualMemory.unmap with the exact mapped (addr, size) pair stored for the opaque buffer userdata, marks that buffer dead after the unmap succeeds, credits the live byte budget, and rejects later read, write, size, or release calls on that stale userdata as Lua runtime errors. The proof line [lua-smoke] memory:release size=4096 is gated by tools/qemu-lua-smoke.sh. This remains language-support behavior only: Lua receives no broader memory authority, raw address, raw cap id, or new kernel behavior.
  • Expected QEMU output is asserted by tools/qemu-lua-smoke.sh: smoke produces [lua-smoke] hello from lua-smoke v0, an elapsed_ns= measurement through timer:now, the L.2 memory round-trip lines, the L.3 release line, and a [lua-smoke] script ok proof line; init exits via exitWhenServiceExits.
  • Future Phase 1 items (still open):
    • typed wrong-interface and missing-cap failure modes returned as Lua runtime errors,
    • explicit denied-API proof (currently denied by construction because no Lua stdlib is loaded at all),
    • TerminalSession.writeLine parity in addition to the current Console.writeLine binding,
    • the next typed cap binding (process spawning or endpoint IPC).

Phase 2: Script Package Input

  • Add a userspace-owned script source cap or startup-config path.
  • Let shell/init launch capos-lua with a selected package and exact grants.
  • Implement restricted require over the package.
  • Add QEMU proof for a granted TerminalSession call and a denied ungranted cap lookup.

Phase 3: Generated Capability Bindings

  • Generate Lua binding metadata from schema/capos.capnp or from the same interface registry used by the native shell.
  • Expose method names and structured params/results.
  • Add transfer-result cap adoption and deterministic release tests.
  • Keep raw Cap’n Proto builders out of script code unless a separate developer diagnostic cap grants that power.

Phase 4: Shell and Service Use

  • Add shell sugar for script execution after the exact spawn plan exists.
  • Permit trusted services to embed Lua only when they can prove the embedded state holds no extra authority beyond what the script should use.
  • Add audit records for script launch, script package digest, grants, exit status, and authority-touching cap calls when audit caps are available.

Validation

The first implementation is not complete until it has QEMU evidence:

  • A Lua script prints through a granted TerminalSession.
  • The same script cannot use io, os.execute, debug, or an ungranted cap.
  • A missing or wrong-interface cap lookup returns a bounded Lua error.
  • An owned result cap is released deterministically.
  • The runner exits cleanly and does not wedge the shell.

Host tests should cover Lua value conversion and binding generation once those pieces are pure enough to test outside QEMU. Do not claim “Lua scripting works” from host tests alone; the useful behavior is authority-shaped process execution in capOS.

Open Questions

  • Whether the initial implementation should wait for libcapos C support or use a temporary Rust Lua VM to prove the host API earlier.
  • The exact startup-config mechanism for selecting main.lua before storage and general process arguments exist.
  • Whether Lua 5.5 should be the only supported series or whether a 5.4 runner is worth carrying for ecosystem compatibility.
  • How much schema reflection the Lua binding should expose before the native shell’s generic call surface lands.
  • Which audit fields belong in AuditLog once script launch becomes an operator workflow rather than a smoke.

Proposal: WASI Host Adapter

How capOS should host WebAssembly modules through the WebAssembly System Interface, without recreating ambient authority and without committing to a runtime that the userspace baseline cannot support today.

Problem

WASI is the natural sandboxed-execution path for capOS:

  • It is already designed to remove ambient authority. Preview 1 requires preopens — every file descriptor a module sees was granted by the host at startup. Preview 2 makes typed handles first-class through the Component Model.
  • A single host adapter unlocks every language with a useful WASI target: Rust, C/C++, Go (GOOS=wasip1), TinyGo, Python, Zig, AssemblyScript, any interpreter compiled to wasm.
  • Wasm linear-memory bounds checks plus capability scoping give defence in depth for untrusted plugins and third-party code without weakening the capOS isolation model.

The risk pattern is the same as POSIX: a host adapter that grants ambient authority would erase the property that makes WASI worth doing. Every WASI import must be backed by a typed capability the host process already holds. If the host does not hold the cap, the module cannot reach it.

WASI is not a substitute for native ports of languages that need real OS threads, full asynchronous I/O, signals, or large POSIX surfaces. Those remain the native runtime tracks. WASI is the right tool for sandboxing untrusted plugins, third-party scripts, isolated workloads, CPU-bound portable tools, and language ecosystems whose native capOS port has not yet been built.

Scope

In scope:

  • A capos-wasm userspace host adapter built on capos-rt.
  • A WASI Preview 1 surface whose imports map 1:1 to typed capOS capabilities.
  • Per-instance CapSet projection: each module sees only the caps the host grants for that instance.
  • Phase decomposition that picks one runtime for v0, lets later phases migrate to the Component Model and richer runtimes, and stays explicitly outside ambient authority.
  • Validation through QEMU smokes that prove granted and ungranted paths.

Out of scope for the first implementation:

  • wasi-threads (requires shared-memory + atomics + bulk-memory).
  • fork()-shaped semantics. Cannot clone wasm linear memory; same constraint as the browser-wasm proposal.
  • Synchronous signal delivery inside a wasm module. Fuel exhaustion plus host-driven termination are the only deterministic interruptions.
  • File-backed MAP_SHARED mmap.
  • Treating the wasm sandbox as the only isolation boundary for hostile modules — the capOS process boundary remains the primary boundary.
  • A custom non-portable WIT dialect with externref-typed cap handles. This proposal explicitly defers richer cap handles to Component Model resources (Phase W.7).

Current Manual Pages

  • Programming Languages summarizes WASI’s current status relative to Rust, Python, Go, C/C++, Lua, and POSIX adapter tracks.
  • Userspace Binaries Part 5 sketches the WASI host adapter at a higher level. This proposal supersedes that sketch with a full design surface; the userspace-binaries proposal continues to own the broader native-binary, language, and POSIX-adapter roadmap.
  • Userspace Runtime documents the implemented capos-rt surface that the host adapter consumes.
  • Browser/WASM covers the separate browser-hosted wasm experiment. The two proposals share wasm-runtime insight but target different substrates: WASI host adapter runs on capOS hardware; the browser proposal runs capOS concepts in a browser tab.
  • Lua Scripting covers a similar capability-scoped script runner shape; the WASI track is the untrusted / portable counterpart to that proposal’s trusted native runner.
  • Go Runtime covers the native GOOS=capos alternative to Go-on-WASI.

Research Grounding

Relevant research and external references:

In-tree references: this proposal lifts the capability-mapping table from docs/proposals/userspace-binaries-proposal.md Part 5 and the runtime survey/phase decomposition shape from comparable language-runtime planning work; concrete repo evidence appears inline below.

Design Principles

  1. WASI is not a kernel feature. The kernel sees a normal userspace process with a CapSet and a capability ring. The host adapter is one of many capos-rt-based binaries.
  2. The host adapter’s CapSet is the authority. WASI module bytes are data. They cannot create authority. Every import is satisfied by a cap the host already holds; absent caps are refused, not synthesised.
  3. Per-instance CapSets are subsets, not supersets. Each loaded module gets only the caps the manifest grants for that instance. The host’s own CapSet may be larger; the module never sees the parent.
  4. The wasm sandbox is defence in depth, not the isolation boundary. The capOS process boundary remains primary. Wasm bounds checking and immutable Module validation add a second software-enforced boundary inside the host process so an entire untrusted module image can be confined.
  5. Schema-first capability mapping. Each WASI function is backed by a typed capability, not by emulated POSIX semantics. POSIX-shaped integer fds in Preview 1 are a Preview 1 ABI requirement, not a capability model concession.
  6. Pick portable WASI, skip non-portable extensions. Custom imports with externref-typed cap handles would lock capOS into a non-portable WIT dialect that no other host implements. The Component Model’s typed resources are the right answer for first-class typed cap handles in wasm; defer to that path rather than inventing a one-vendor dialect.
  7. Fail closed. Any unimplemented WASI call returns ERRNO_NOSYS. Any cap lookup that fails returns the appropriate Preview 1 errno (ERRNO_BADF, ERRNO_ACCES, ERRNO_NOSYS). Modules cannot probe absent caps for ambient behavior.

Architecture

flowchart TD
    Manifest[boot manifest:<br/>system-wasm-host.cue] --> Host[capos-wasm process]
    Host --> Runtime[wasm runtime<br/>wasmi v0]
    Host --> Rt[capos-rt typed clients]
    Rt --> Ring[capability ring]
    Ring --> Kernel[kernel CapObject dispatch]
    Ring --> Services[userspace services]

    Runtime --> Module[wasm module instance]
    Module --> Imports{WASI imports}
    Imports --> FdTable[per-instance fd table /<br/>Preview 2 resource handles]
    FdTable --> Caps[granted typed caps]
    Caps --> Rt

capos-wasm is one userspace process. It hosts one or more wasm module instances. The runtime engine (wasmi for v0; see Runtime Selection below) is linked into that process. WASI imports are resolved by the host adapter’s import-resolver module against typed capOS clients. Each instance has its own per-instance fd table (Preview 1) or resource bundle (Preview 2) populated from the manifest grants for that instance.

The runtime exposes only what the host process can fulfil. If the host does not hold an EntropySource cap, random_get returns ERRNO_NOSYS. If the manifest did not grant a home namespace, the module’s preopen table does not contain it and path_open("/home/...") resolves to nothing.

Runtime Selection

For v0 (Phases W.1 through W.6), use wasmi. For W.7+, evaluate migration to wasmtime when capOS userspace gains std support and a futures executor, or to WAMR if minimal footprint becomes the dominant constraint and the C build path lands.

ConstraintwasmiWAMRwasm3wasmtime
Pure Rust, drops into capOS workspaceyesC (needs cc/build glue, no libcapos yet)C (same problem)yes
no_std + allocyes, advertised explicitlypartial (embedded, libc-shaped)yes (bare metal)no (needs std and a futures executor)
LicenseApache-2.0 / MITApache-2.0 with LLVM exceptionMITApache-2.0
Footprintsmall register-based bytecode (v0.32 5x speedup)~29 KB AOT, ~58 KB interpreter~64 KB code, ~10 KB RAMlarge (Cranelift JIT)
Sandboxingwasm spec + execution-engine isolationwasm spec + AOT validationwasm specwasm spec + Cranelift verifier
Fuel/gas meteringyes, built-innot advertisedyesyes
Capability transferexternref since 0.24; component model on roadmapreference types yes; component model partialpartial reference typesfull component model (best-in-class)
WASI versionspreview1 stable; preview2 on roadmappreview1 stable; preview2 partialpreview1 partialpreview1 + preview2 + components
Host function interfacemirrors wasmtime APIC API; Rust through wamr-rust-sdkC APIRust + C
Maintenancewasmi-labs, two security audits (2023, 2024)Bytecode Alliance, TSC-governedmaintainer in minimal-maintenance phaseBytecode Alliance flagship
Threadingnot in current scopeyes (wasi-threads)noyes

Why wasmi for v0:

  • Pure Rust drops directly into the capOS workspace. No C build chain required — the same chain libcapos does not yet provide.
  • Genuine no_std + alloc support means no host-side OS abstraction is required for the runtime itself; it sits cleanly on capos-rt.
  • Built-in fuel metering matches capOS’s preference for explicit resource accounting.
  • externref support is sufficient for any future v1 capability-handle experiment that does not block on the Component Model.
  • Mirroring the wasmtime API means that migrating to wasmtime in W.7 is rewiring imports, not rewriting host calls.

Not chosen for v0:

  • wasmtime needs std userspace and a futures executor. capOS userspace is no_std + alloc today; this is the same blocker that keeps the Rust capnp-rpc crate (v0.25) off capos-rt and queues the remote-session-client capnp-rpc rewrite behind an async runtime decision.
  • wasm3 is in maintainer-declared minimal-maintenance phase; not a good fit for a long-horizon capOS substrate.
  • wasmer has similar weight to wasmtime and does not align as cleanly with the Bytecode Alliance Preview 2 trajectory.
  • WAMR is a strong candidate when a C toolchain and libcapos exist and minimal footprint is the goal. It is the migration target for high-density wasm hosting later, but it is not the v0 baseline because the C substrate is not in tree.

WASI Version Stance

  • Preview 1 for v0 (Phases W.1 through W.6). POSIX-shaped, file-descriptor-based, C-friendly. Tier 2 in upstream Rust since 1.78 (May 2024); supported by Go 1.21+ (GOOS=wasip1 GOARCH=wasm), TinyGo, Clang --target=wasm32-wasi, Zig. This is the immediate unlock.
  • Preview 2 / Component Model for W.7+. Resources are first-class typed handles. They are the natural mapping for capOS capabilities — closer in shape to OwnedCapability<T> than to integer fds. WIT interfaces let cap-aware Rust crates export typed APIs that a wasm component on capOS or a native capOS service can consume the same way it consumes a capnp interface.

Skipping Preview 1 entirely and starting at Preview 2 is possible with wasmtime today, but harder with wasmi; doing so would push the entire v0 unlock behind the std-userspace decision. The Preview 1 first / Preview 2 later sequencing is the smaller-step path to running C, Rust, Go, Python, TinyGo on capOS.

Capability Mapping Surface

Preview 1: per-import mapping

Each Preview 1 import is backed by a typed capOS capability the host adapter already holds. POSIX inherits ambient authority through global path namespaces, integer fds, and a process credential table; WASI removes that by requiring preopens, and capOS pushes it further by requiring an explicit per-import cap mapping in the host adapter.

WASI preview1 importcapOS host-adapter implementation
args_get / args_sizes_getRead from a future capOS LaunchParameters cap or per-instance arena. Empty by default until that surface lands.
environ_get / environ_sizes_getRead from a KeyValueScope / ConfigOverlay cap when one exists; empty by default. Open question §6.
clock_time_get(MONOTONIC)Timer.now() over the host’s TimerClient.
clock_time_get(REALTIME)Future wall-clock cap; until then return ERRNO_NOSYS or ERRNO_INVAL.
proc_exit(code)Map to a host-internal “instance exited with code” status. The host process does not exit; the wasm instance does.
random_getThe kernel EntropySource cap (the in-tree CSPRNG capability; see schema/capos.capnp interface EntropySource and KernelCapSource::EntropySource). Refuse with ERRNO_NOSYS when the host adapter was not granted entropy authority.
fd_write(1, ...) / fd_write(2, ...)Pre-opened fd 1 to host’s Console / TerminalSession write path; fd 2 to same or a separate log cap if granted.
fd_read(0, ...)Pre-opened fd 0 to a granted TerminalSession or future StdIO input cap if available; else ERRNO_BADF. No bare in-tree StdinReader cap exists today; non-terminal stdin requires a future input cap.
path_open(preopened_dir_fd, path, ...)Resolve path inside the Namespace cap mounted as that preopen, then open through the namespace’s Store / File capability.
fd_read / fd_write on opened filesTranslate to the typed File capability behind the host-side fd table entry.
fd_closeDrop the typed cap handle (release-on-drop in capos-rt).
fd_seek / fd_tell / fd_filestat_getMethods on the File cap.
fd_prestat_get / fd_prestat_dir_nameEnumerate the host adapter’s preopened-directory table built from manifest grants.
sock_send / sock_recv / sock_shutdownTranslate to typed TcpSocket / UdpSocket cap calls.
poll_oneoffMultiplex over the host’s capability ring; CQEs are the event source. Open question §3.
fd_advise / fd_allocate / fd_renumberStub or ERRNO_NOSYS until needed.
sched_yieldNo-op or single-tick yield through the runtime’s scheduler.

Preview 2: WIT-resource mapping

When the host adapter migrates to Preview 2 (Phase W.7+), the imports become typed capOS capabilities directly through WIT resources:

WIT package / interfacecapOS host-side cap
wasi:io/streams (input-stream, output-stream resources)Wrap one capOS cap per stream (Console / TerminalSession / File / TcpSocket). The resource handle in wasm corresponds 1:1 to a host-side OwnedCapability<T>.
wasi:filesystem/types (descriptor resource)One OwnedCapability<File> or OwnedCapability<Directory> per descriptor. Preopened dirs become resource handles passed at instantiation.
wasi:clocks/{monotonic-clock,wall-clock}Timer / future wall-clock cap.
wasi:random/{random,insecure}EntropySource cap.
wasi:sockets/tcp (tcp-socket resource)TcpSocket cap.
wasi:cli/{stdin,stdout,stderr,environment,exit}Per-instance CapSet projection.
wasi:http/incoming-handler / outgoing-handlerMatch capOS HttpEndpoint / Fetch (drafted in service-architecture-proposal.md).

Components in the same store can pass resources to other components; the host mediates the move. This maps directly to capOS capability transfer semantics — the same shape as the kernel’s result-cap insertion for typed cap returns from a CALL.

Capability Handle Path in the Module

How a wasm module receives and refers to a capOS capability is one of the load-bearing design questions. Three options:

  1. Preview 1 + integer fds, host-side fd table only (recommended for v0). All caps live in the host process. The module sees integer fds. The host adapter maps fds to OwnedCapability<T> slots in its own per- instance table. Works with every existing wasip1 binary unchanged. A wasm module cannot pass a typed cap to another wasm module without going through the host.
  2. Custom externref import (alternative; not recommended). Requires the reference-types proposal (supported by wasmi >=0.24, wasmtime, wasmer; partial in wasm3). The host adapter exports custom imports like cap_call_ref that take an externref typed handle. This is non-standard and locks capOS into a one-vendor WIT dialect that no other host implements; it would also delay Preview 2 adoption because the dialect would need its own mapping policy.
  3. Preview 2 / Component Model resources (target for W.7+). Resources in the Component Model are unforgeable typed handles. Components that import wasi:filesystem/types.descriptor receive a handle that is the host-side OwnedCapability<File>. Components can pass resources to other components in the same store; the host mediates. Direct match to capOS capability transfer semantics.

Recommendation: ship Preview 1 + integer fds for v0; defer rich typed-cap-in-module support to Preview 2 in W.7. Skip the externref custom-import path entirely.

Per-Instance vs Per-Process Model

Two reasonable shapes:

  1. One wasm instance per capos-wasm process (recommended for v0). Faults are isolated at the capOS process boundary. Fuel and budget enforcement are per-process and use the existing capOS resource accounting. Manifest-grant shape stays simple: each manifest entry names one binary and one cap bundle.
  2. Many instances per capos-wasm process (alternative). Better density. Suits hosting many small modules (plugin systems, embedded scripts). Adds host-side scheduling concerns: a runaway instance can starve siblings; fuel/budget enforcement now has to demultiplex; the poll_oneoff reactor question becomes load-bearing.

Recommendation: one instance per process for v0. Revisit when instance count actually matters. The capOS process boundary is already a strong isolation primitive; trading it away for density before density is needed adds complexity for no v0 unlock.

Per-Instance CapSet Plumbing

Each loaded module gets a per-instance capability bundle. The host adapter receives manifest grants and projects them onto WASI imports.

The shape needs to land alongside argv/env passing — argv for wasm modules has the same lifecycle question as argv for native processes. When a future capOS LaunchParameters surface lands it becomes the canonical source for both argv and env. Until then, a small bounded text grant in the host adapter manifest is acceptable for v0 (Open Question §6 / §7).

Sketch of the manifest shape (pre-LaunchParameters):

wasm_host: {
    binary: "thing.wasm"
    args: ["--input", "data"]
    caps: {
        console:   @console
        timer:     @timer
        random:    @random
        // preopen 3 → home namespace; preopen 4 → tmp namespace, etc.
        preopens: [
            { fd: 3, namespace: @home_namespace, name: "/home" }
            { fd: 4, namespace: @tmp_namespace,  name: "/tmp" }
        ]
    }
}

Same authority model the rest of capOS uses: every cap the module sees is named in the manifest and granted by the parent. The wasm sandbox is defence in depth on top of capability scoping, not a replacement.

Trust Boundaries

BoundaryNative capOS serviceWASI host adapter + module
Authority sourceProcess CapSetHost CapSet then per-instance subset
Memory isolationPage tablesWasm linear-memory bounds-check plus page tables (host process)
Code integrityW^X + NXWasm module validation plus immutable WebAssembly.Module
Cap forgeryKernel-owned CapTableHost-owned per-instance fd table or resource-handle table; module sees opaque ints/handles only
Resource limitsKernel quotasWasm fuel + memory cap + host-side per-instance time/byte budgets
Side channelsHardware-level (Spectre etc.)Same hardware level, plus wasm-specific (e.g. timer resolution)

Wasm does not weaken capOS isolation; it adds a second software-enforced boundary that contains an entire untrusted module image. This is exactly the property that makes WASI a good fit for plugin and script loading.

What WASI Does Not Solve

  • fork(): cannot clone wasm linear memory mid-execution. Same reason the browser-wasm proposal documents. POSIX programs that fork-then-exec must use posix_spawn-shaped equivalents, or the host adapter must spawn a new wasm instance.
  • Synchronous signals: no preemption inside a wasm module without cooperative yield points or interrupted execution. Fuel exhaustion is the only deterministic interruption; gross preemption is “host kills the instance”. Acceptable for plugins.
  • Threads without wasm-threads: requires shared-memory + atomics + bulk-memory features and a runtime that supports them. Out of scope for v0.
  • Live mmap of files: wasm linear memory is not file-backed. Workable only for small read-or-write cycles.

Phase Decomposition

Smallest reviewable slices ordered by dependency. Each phase is independently demoable and gates the next.

Phase W.0 — Decision and host runtime selection (planning)

  • Decide runtime: wasmi vs WAMR (recommendation above).
  • Land this proposal and the matching docs/tasks/ task record for the first WASI host-adapter slice.
  • Resolve cross-cutting open questions §1, §3, §6, §7, and §8 below (the §8 vendoring posture decision gates the W.1 scaffold layout).

Deliverable: agreed proposal plus dispatchable task record. No code.

Phase W.1 — capos-wasm host process scaffold (no WASI yet)

Status: host-runtime scaffold landed 2026-05-05 19:12 UTC. Manifest and make run-wasm-host smoke moved into Phase W.2 (see Status note below).

  • New crate capos-wasm/ — userspace process built on capos-rt.
  • Vendor the chosen runtime (wasmi recommended; one local cargo dep patched for no_std + alloc if needed).
  • Host process can WebAssembly.compile(bytes) then instantiate(no imports) then run an empty _start. No imports resolved yet.
  • Manifest: new system-wasm-host.cue boots one host process with one embedded .wasm blob (the smoke binary).
  • Smoke: make run-wasm-host boots, host loads the empty blob, prints [wasm-host] empty module instantiated and exited, host exits cleanly.

Status note (revised 2026-05-06 20:19 UTC): the v0 W.1 slice landed only the host-runtime substrate — the capos-wasm/ standalone crate, the vendored vendor/wasmi-no_std/wasmi-1.0.9/ snapshot, and the make capos-wasm-build target — without a wasm-host binary, system-wasm-host.cue manifest, or make run-wasm-host smoke. The binary/manifest/smoke trio was rolled into Phase W.2 and landed there in W.2 sub-slice 1 (2026-05-06 20:19 UTC) using an inline 8-byte empty wasm module as the payload. Earlier drafts of this status note worried about re-cutting the same host binary twice (once empty, once with a Preview 1 surface) and proposed deferring the empty-module smoke until “hello, wasi” was ready; the actual outcome went the other way: the empty-module regression is its own slice that exercises wasmi’s Module::new + Linker::instantiate_and_start end-to-end on capOS, and later W.2 sub-slices extend the same binary in place with the Preview 1 import resolver and language-level smokes.

Deliverable: a wasm runtime crate compiles and links inside the capOS userspace no_std + alloc build. No imports, no host functions, no WASI. Validates the runtime crate works in no_std + alloc userspace and that the vendored wasmi snapshot exposes Engine and Store<HostState> to a future host binary.

Validation: make capos-wasm-build succeeds against targets/x86_64-unknown-capos.json with no_std + alloc; make fmt-check and the host test gates remain green; the kernel and other userspace crates are untouched (no kernel surface, no schema/capos.capnp change, no init/ change).

Phase W.2 — WASI Preview 1 stdout-only

Inherits from W.1: the wasm-host binary, system-wasm-host.cue manifest, and make run-wasm-host smoke originally listed under W.1 land here in sub-slice 1, so the same binary that future sub-slices extend with the Preview 1 import surface also provides the empty-instantiation smoke.

The phase is landing in four sub-slices, not one big drop, to keep each diff reviewable. random_get production wiring stays owned by Phase W.4 (entropy + clocks production-ready); W.2 leaves it stubbed as ERRNO_NOSYS:

  • W.2 sub-slice 1 (landed): wasm-host binary, system-wasm-host.cue empty-instantiation manifest, make run-wasm-host smoke, and the one-time userspace ABI bump (USER_STACK_BASE etc.) that wasmi’s ~3 MiB BSS forced.

  • W.2 sub-slice 2 (landed 2026-05-07 08:03 UTC): Preview 1 stdout-only imports (args/environ as empty, clock_time_get(MONOTONIC), proc_exit, fd_write(1,…)/fd_write(2,…)); everything else stubs as ERRNO_NOSYS including random_get (Phase W.4 promotes that to production). The wasm-host smoke now drives a 114-byte hand-encoded probe module that calls random_get, stores the returned errno in an exported global, and refuses to print the nosys=52 proof line unless it equals ERRNO_NOSYS.

  • W.2 sub-slice 3 (landed 2026-05-07 09:36 UTC): Rust hello, wasi smoke (demos/wasi-hello-rust/, system-wasi-hello-rust.cue, make run-wasi-hello-rust). The wasm-host binary now optionally reads a BootPackage cap, walks the manifest’s binaries[] for the wasi-payload entry, instantiates it through the same Preview 1 linker, and explicitly invokes the _start export (wasmi’s instantiate_and_start runs the WebAssembly start section, NOT WASI’s _start). The sub-slice 1+2 regression keeps running first; the existing make run-wasm-host smoke continues to pass because it does not grant boot.

  • W.2 sub-slice 4 (landed 2026-05-07 10:53 UTC): C hello, wasi smoke (demos/wasi-hello-c/, system-wasi-hello-c.cue, make run-wasi-hello-c). The wasm-host payload-load path landed in sub-slice 3 carries the C .wasm payload too — sub-slice 4 only added the C toolchain wiring (system clang-18 with --target=wasm32-wasi --sysroot=/usr against the Ubuntu wasi-libc + libclang-rt-18-dev-wasm32 packages), the second manifest, the matching smoke harness, and these closeout stamps. Phase W.2 is done.

  • W.2 sub-slice 1 (landed 2026-05-06 20:19 UTC): the wasm-host userspace binary, system-wasm-host.cue empty-instantiation manifest, tools/qemu-wasm-host-smoke.sh assertion harness, and the userspace-image budget bump that wasmi’s ~3 MiB BSS requires. USER_STACK_BASE moved from 0x60_0000 to 0x100_0000 in capos-config/src/process_layout.rs; RING_VADDR (capos-config/src/ring.rs) and CAPSET_VADDR (capos-config/src/capset.rs) shifted in lockstep, and every linker.ld assertion (init/, capos-rt/, demos/, shell/, capos-wasm/) and the system-spawn.cue stack-overlap-elf fixture were updated to match. No Preview 1 imports yet — the binary instantiates the inline 8-byte empty wasm module and exits cleanly through the existing capos-rt entrypoint.

  • W.2 sub-slice 3 (landed 2026-05-07 09:36 UTC) and W.2 sub-slice 4 (landed 2026-05-07 10:53 UTC): language-level Rust + C hello, wasi smokes plus the manifest-payload load path on the wasm-host binary. Phase W.2 is closed by sub-slice 4.

Sub-slice 1 (landed) delivered:

  • The wasm-host userspace binary built on the W.1 scaffold, instantiating an inline 8-byte empty wasm module through wasmi::Linker::instantiate_and_start.
  • Manifest system-wasm-host.cue (empty-instantiation regression).
  • Smoke make run-wasm-host (asserted by tools/qemu-wasm-host-smoke.sh).

Sub-slice 2 (landed) delivered:

  • capos-wasm/src/wasi/preview1.rs Preview 1 import resolver on top of the existing wasm-host binary, registering 46 wasi_snapshot_preview1 imports against a fixed-arity wasmi::Linker<HostState>.
  • Implemented surface: args_get, args_sizes_get, environ_get, environ_sizes_get (all return zero counts / empty buffers); clock_time_get(CLOCKID_MONOTONIC) via the host’s TimerClient (CLOCKID_REALTIME returns ERRNO_NOSYS until a wall-clock cap exists); proc_exit via capos_rt::syscall::exit; fd_write(1, …) and fd_write(2, …) via the host’s Console.write byte path with a fixed 4 KiB scratch ceiling (oversize total → ERRNO_INVAL); all other Preview 1 imports stubbed as ERRNO_NOSYS (including random_get, which Phase W.4 promotes against EntropySource).
  • Manifest update (system-wasm-host.cue now grants Console + Timer) and smoke harness update (tools/qemu-wasm-host-smoke.sh asserts the new [wasm-host] preview1 imports linked: ...; nosys=52 proof line in addition to the empty-instantiation regression).
  • Probe-driven evidence: a 114-byte hand-encoded probe module imports random_get, calls it once at instantiation, stores the returned errno in an exported global, and the host refuses to print the proof line unless the global reads back as ERRNO_NOSYS = 52.

Sub-slice 3 (landed 2026-05-07 09:36 UTC) delivered:

  • demos/wasi-hello-rust/ standalone crate built against the upstream wasm32-wasip1 target. Source is a single println!; the produced hello.wasm (~40 KiB) imports environ_get, environ_sizes_get, fd_write, and proc_exit from wasi_snapshot_preview1, all of which the sub-slice 2 resolver already implements.
  • capos_wasm::payload helper module: streams the capnp-encoded SystemManifest blob through BootPackage.readManifestChunk (4 KiB chunks) and walks binaries[] via raw capnp readers to return the bytes for a named entry. The wasm-host binary calls this only when the manifest grants the optional boot (BootPackage) cap, so the sub-slice 1+2 make run-wasm-host smoke – which does not grant boot – keeps passing unchanged.
  • system-wasi-hello-rust.cue manifest: lists the wasm-host ELF and the wasi-payload blob, grants Console + Timer + BootPackage to the wasm-host, and reuses the shared cue/defaults package.
  • tools/qemu-wasi-hello-rust-smoke.sh smoke harness: asserts the existing sub-slice 1 + 2 proof lines, the new Hello from WASI on capOS payload stdout (the load-bearing evidence), and the clean process/scheduler exit pair. The wasm-host payload-stage proof line is not asserted because wasi-libc’s _start is allowed to terminate via proc_exit from inside the Preview 1 import handler, in which case the host process exits before the wasm-host can print its post-_start proof line.
  • make wasi-hello-rust-build cargo wrapper that clears RUSTFLAGS/CARGO_ENCODED_RUSTFLAGS so the kernel-target rustflags pinned in the repo .cargo/config.toml do not leak into the wasm build.
  • capos-rt re-export additions: capos_capnp and default_reader_options are now reachable from capos_rt::* so capos-wasm keeps a single direct path-dep on capos-rt and the vendored wasmi tree (adding capos-config directly to capos-wasm triggered an unrelated cargo workspace-inheritance error against the vendored wasmi at vendor/wasmi-no_std/wasmi-1.0.9/).

Sub-slice 4 (landed 2026-05-07 10:53 UTC) delivered:

  • demos/wasi-hello-c/ standalone C smoke (NOT a Cargo crate; built directly with system clang-18 + lld via the Makefile wasi-hello-c-build target). Source is a single printf("Hello, wasi from capOS C\n") main() compiled with --target=wasm32-wasi --sysroot=/usr against the Ubuntu wasi-libc + libclang-rt-18-dev-wasm32 apt packages; the produced hello-c.wasm (~46 KiB) imports five functions from wasi_snapshot_preview1: fd_close, fd_fdstat_get, fd_seek, fd_write, and proc_exit. fd_write and proc_exit reach the host’s granted Console cap and the clean capos-rt exit path implemented in sub-slice 2; fd_close, fd_fdstat_get, and fd_seek return ERRNO_NOSYS = 52 from the same sub-slice 2 stub surface, which is sufficient for wasi-libc’s stdout-only path.
  • system-wasi-hello-c.cue manifest: same shape as system-wasi-hello-rust.cue, lists the wasm-host ELF and the wasi-payload blob, grants Console + Timer + BootPackage to the wasm-host, and reuses the shared cue/defaults package.
  • tools/qemu-wasi-hello-c-smoke.sh smoke harness: asserts the existing sub-slice 1 + 2 proof lines, the new Hello, wasi from capOS C payload stdout (the load-bearing evidence), and the clean process/scheduler exit pair.
  • make wasi-hello-c-build target that runs system clang with RUSTFLAGS/CARGO_ENCODED_RUSTFLAGS cleared (matching the wasi-hello-rust-build shape so the two flows stay symmetric).
  • No host-side change to capos-wasm/: the manifest-payload load path landed in sub-slice 3 carries the C .wasm payload through the same wasm-host binary unchanged.

Deliverable: the first WASI-hosted, sandboxed portable-payload language path lands on capOS. Both Rust (wasm32-wasip1) and C (wasm32-wasi) hello, wasi payloads run inside the wasmi interpreter under the wasm-host capOS process and reach the host’s granted Console cap through Preview 1 fd_write. Native C already boots through the libcapos C-substrate (make run-c-hello) and the POSIX adapter (make run-posix-dns-smoke); this phase specifically adds the WASI-hosted path – in particular, C runs on capOS through the WASI surface without requiring any libcapos/POSIX work in tree, because the wasm-host’s host-side imports cover everything the wasi-libc stdout-only path needs.

Phase W.2 closed 2026-05-07 10:53 UTC. Phase W.3 closed 2026-05-07 18:25 UTC. Phase W.4 closed 2026-05-07 20:09 UTC.

Phase W.3 — Per-instance CapSet plumbing + LaunchParameters

Status: landed 2026-05-07 18:25 UTC. Per-instance CapSet selection keeps using the existing manifest cap-grant block on initConfig.init.caps (no new cap needed for the v0 argv path); the new surface is the bounded-text argv grant on initConfig.init.wasiArgs. The wasm-host pulls it out of the manifest blob through its already-granted BootPackage cap, validates it against the bounds in capos-wasm/src/payload.rs (WASI_ARGS_MAX_COUNT = 32, WASI_ARGS_MAX_ARG_BYTES = 4096, WASI_ARGS_MAX_TOTAL_BYTES = 8192), packs it into a per-instance HostState argv buffer, and reflects it back through Preview 1 args_get / args_sizes_get. A 2026-05-13 successor mirrors the same bounded-text pattern for environment variables through initConfig.init.wasiEnv, validated against WASI_ENV_MAX_COUNT = 32, WASI_ENV_MAX_ENTRY_BYTES = 4096, and WASI_ENV_MAX_TOTAL_BYTES = 8192, with interior NULs rejected before the payload instantiates. Open Question §5 / §6 / §7 status is recorded in the section below; a future capOS LaunchParameters cap is still the migration path for argv and environment together.

  • Per-instance CapSet selection: keeps using the manifest-defined cap-grant block (initConfig.init.caps) the W.2 sub-slice 3 / 4 smokes already exercised. Phase W.3 does not add a new cap; it adds the wasiArgs bounded-text grant alongside the cap list. Future phases (W.4 entropy, W.5 namespaces, W.6 sockets) will extend the same caps block with their respective surfaces.
  • Bounded-text argv grant: initConfig.init.wasiArgs is a CUE text list. Schema/schema/capos.capnp is unchanged because initConfig is already CueValue and unknown sub-fields under initConfig.init are ignored by the existing manifest decoder. The wasm-host walks the field directly through raw capnp readers in capos-wasm/src/payload.rs::read_wasi_args. An absent or empty wasiArgs keeps the W.2 “no argv” behaviour (args_sizes_get reports zero, args_get writes nothing) so the existing make run-wasm-host, make run-wasi-hello-rust, and make run-wasi-hello-c smokes stay unchanged.
  • Bounded-text environment grant: initConfig.init.wasiEnv is a CUE text list of entries such as KEY=value. It uses the same raw capnp reader path as wasiArgs, the same no-schema-change initConfig CueValue extension point, and the same empty-by- default behavior: absent or empty wasiEnv makes environ_sizes_get report zero and environ_get write nothing. Oversized entry count, oversized individual entries, oversized packed total bytes, and interior NUL bytes make wasm-host abort with stable exit codes rather than truncating or corrupting the WASI Preview 1 NUL-terminated layout.
  • Migration to a future LaunchParameters cap: when capOS gains a capability-shaped LaunchParameters surface (the same one envisioned by docs/proposals/userspace-binaries-proposal.md Part 5 and the future shell launch flow), the wasm-host will swap read_wasi_args for a typed LaunchParametersClient lookup and the manifest-side wasiArgs field becomes redundant. The bounds constants stay relevant either way (a typed LaunchParameters cap will still need byte ceilings before it ships argv into wasm linear memory).
  • Smoke: demos/wasi-cli-args/ (Rust, wasm32-wasip1) reads argv[1] and prints it through println! -> fd_write(1, …) -> the host’s Console cap. The harness (tools/qemu-wasi-cli-args-smoke.sh) asserts the existing sub-slice 1 + 2 regression lines plus the load-bearing capos-wasi-cli-args-sentinel line.

Deliverable: per-instance CapSet selection works (commit landed 2026-05-07 18:25 UTC; smoke make run-wasi-cli-args).

Phase W.4 — WASI Preview 1 random + clocks production-ready

Status: landed 2026-05-07 20:09 UTC. The wasm-host looks up an optional per-instance EntropySource cap from the CapSet under the well-known name random. When the manifest grants it, the typed EntropySourceClient is installed on HostState after the W.2 sub-slice 2 probe regression runs (so the probe’s random_get(0, 0) call still observes the closed-fail ERRNO_NOSYS = 52 path byte-identically with the W.2/W.3 proof line). Preview 1 random_get then drains arbitrary wasm-supplied byte ranges into the manifest-granted entropy stream by chunking against the kernel cap’s per-call MAX_ENTROPY_FILL_BYTES = 64 ceiling and walking up to RANDOM_GET_MAX_BYTES = 65_536 total bytes per Preview 1 invocation. Truncated kernel responses, RDRAND unavailable status, and any transport-level error surface as ERRNO_IO; out-of-bounds wasm pointer writes surface as ERRNO_FAULT; oversized requests surface as ERRNO_INVAL. The ungranted-variant manifest still routes Preview 1 random_get through the no-grant refusal branch which never enters the kernel, so an instance without an EntropySource grant cannot leak entropy.

  • Wire the kernel EntropySource cap (the in-tree CSPRNG capability; see EntropySourceClient and KernelCapSource::EntropySource) through the host adapter as the backing for random_get. The same cap is the natural future analogue of the browser’s crypto.getRandomValues surface.
  • Wall-clock support stays deferred until capOS has a typed WallClock / RealTimeClock cap. clock_time_get(CLOCKID_REALTIME) keeps returning the W.2 sub-slice 2 sentinel ERRNO_NOSYS so a Preview 1 guest can distinguish “host refused” from a kernel / transport failure; future phases promote it once the wall-clock cap lands. The monotonic clock keeps using the manifest-granted Timer cap unchanged.
  • Smoke: demos/wasi-random/ (Rust, wasm32-wasip1) reads N=64 bytes via a raw Preview 1 import binding (avoiding wasi-libc’s panic-on-errno wrapper so the ungranted-variant payload can print a refusal sentinel and exit with code 52 rather than aborting). The granted-variant smoke (make run-wasi-random / tools/qemu-wasi-random-smoke.sh) asserts the W.2 sub-slice 1 + 2 regression proof lines, the load-bearing [wasi-random] entropy_bytes=64 entropy_bound_ok=true line, and a clean exit; the ungranted-variant smoke (make run-wasi-random-ungranted / tools/qemu-wasi-random-ungranted-smoke.sh) asserts the same regression lines plus the load-bearing [wasi-random] random_get returned errno=52 (ENOSYS) refusal sentinel and refuses the granted-variant entropy line.

Deliverable: Preview 1 random_get is wired to the kernel EntropySource cap with the closed-fail refusal contract, the clock_time_get(REALTIME) deferral is documented, and the ungranted-variant smoke proves both. A 2026-05-13 compatibility slice also promotes authority-free Preview 1 imports that need no new cap: clock_res_get(CLOCKID_MONOTONIC) returns the monotonic nanosecond resolution, sched_yield returns success as a no-op, fd_fdstat_get for stdout/stderr returns character-device write metadata, and fd_seek for stdout/stderr returns ERRNO_SPIPE. The direct-import make run-wasi-stdio-fd smoke requires all promoted imports to return non-ERRNO_NOSYS results. The remaining non-filesystem / non-socket Preview 1 imports that still return ERRNO_NOSYSpoll_oneoff, proc_raise, fd operations that need file or close-state authority, and the path_* paths – stay future work; promoting each to “honest” needs either the typed capability it would route through (for example a WallClock / RealTimeClock cap for REALTIME or namespace/file caps for storage fds and paths) or an explicit decision to keep the NOSYS refusal as the v0 honest behaviour. Phase W.4 closed 2026-05-07 20:09 UTC.

Harness-hardening landed on 2026-05-13: make run-wasi-preview1-refusals boots a direct-import payload that calls representative blocked filesystem/socket imports with no Namespace/File/Store/socket authority in the manifest and requires each return to equal ERRNO_NOSYS = 52. The initial slice (2026-05-13 08:50 UTC) covered path_open, fd_prestat_get, fd_read, sock_send, sock_recv; a follow-up (2026-05-13 21:15 UTC) extended the harness to also cover fd_pread, fd_pwrite, path_create_directory, and sock_shutdown, bringing the total to nine covered imports. As each filesystem import gains a real implementation its no-preopen errno migrates from ERRNO_NOSYS = 52 to ERRNO_BADF = 8 (path_open / fd_prestat_get / fd_read with Phase W.5; path_create_directory on 2026-05-24 10:09 UTC; fd_pread / fd_pwrite when positional I/O landed – see below); the harness asserts the current errno per import rather than a blanket NOSYS. Only the socket imports (sock_send / sock_recv / sock_shutdown) still return ERRNO_NOSYS = 52. This records fail-closed evidence for the current surface only; it does not implement W.6 behavior.

Phase W.5 — WASI Preview 1 filesystem (landed 2026-05-17 05:42 UTC)

  • Map preopened-dir fds to a manifest-granted root Directory cap from the per-instance CapSet. The v0 surface ships a single preopen at fd 3 named /preopen-0; the manifest CapSet slot name is root (matching the POSIX adapter P1.4 Slice 4 bootstrap). Namespace / Store integration is deferred until a use case requires the content-addressed pseudo-fs shape – the kernel caps remain available for a future slice (storage Phase 3 slice 3 landed them).
  • Implement path_open, fd_read, fd_write, fd_seek, fd_close, fd_filestat_get, fd_prestat_get, and fd_prestat_dir_name against the kernel Directory / File cap interface in capos-wasm/src/wasi/fs.rs. The resolver mirrors POSIX P1.4 Slice 4 (libcapos-posix/src/path.rs): non-leaf segments walk Directory.sub; the leaf mints either an existing or freshly created File via Directory.open(flags=CREATE|TRUNCATE).
  • Preview 1 base and inheriting rights are stored in the host fd table. The single preopen advertises only implemented directory/path rights and inheritable File rights; path_open refuses requested base or inheriting rights outside the preopen’s inheriting set, and opened File fds retain exactly the requested rights. fd_fdstat_get reports those stored rights, and fd_fdstat_set_rights can only attenuate them. fd_read, fd_write, fd_pread, fd_pwrite, fd_seek, fd_tell, fd_filestat_get, and fd_filestat_set_size check the stored File rights before constructing a FileClient; path_create_directory, path_remove_directory, path_unlink_file, path_filestat_get, fd_readdir, and preopen fd_filestat_get check the preopen rights before constructing a DirectoryClient or resolving the path.
  • WASI fd_close only releases the local cap-table slot. The kernel-side File.close() would invalidate the Arc<FileCap> that the parent Directory holds keyed by entry name, breaking re-open of the same path; WASI semantics expect fd_close to release the per-process fd without deleting the underlying file. New path_open calls for the same path mint a fresh local handle against the same kernel-side entry.
  • Preopen sandbox: the resolver refuses absolute paths (leading /) and parent-escape segments (.., .) with ERRNO_NOTCAPABLE = 76. The single preopen has no parent reachable through any path syntax.
  • The make run-wasi-fs smoke (system-wasi-fs.cue, demos/wasi-fs/, tools/qemu-wasi-fs-smoke.sh) completes a full path_open(CREAT+TRUNC) / fd_write / fd_close / re-open / fd_filestat_get / fd_seek / fd_read round trip, asserts both the absolute-path refusal and the parent-escape refusal, and proves narrowed File/preopen rights fail closed with ERRNO_NOTCAPABLE before the underlying File/Directory client call. The make run-wasi-preview1-refusals smoke continues to prove the fail-closed contract for an ungranted manifest: path_open(3, ...), fd_prestat_get(3), and fd_read(3, ...) now return ERRNO_BADF = 8 (no preopen) instead of the pre-W.5 stub ERRNO_NOSYS = 52 (path_create_directory joined this BADF group 2026-05-24 10:09 UTC, and fd_pread / fd_pwrite joined when positional I/O landed – see below); only the socket imports continue to return ERRNO_NOSYS.
  • Kernel authority surface landed 2026-05-14 (RAM-backed File, Directory, Store, and Namespace kernel caps with QEMU smokes run-file-server-smoke, run-directory-server-smoke, run-store-namespace-smoke). W.5 wires the wasm-host adapter to the Directory / File subset of that authority; Store / Namespace integration is deferred until a use case requires it.
  • fd_readdir landed 2026-05-24 08:44 UTC over the existing preopen Directory cap (DirectoryClient::list – no schema or generated-bindings change). fs::fd_readdir_impl enumerates the preopen, rejecting open file fds with ERRNO_NOTDIR = 54 and unknown fds with ERRNO_BADF = 8; preview1::fd_readdir serializes the fixed 24-byte little-endian Preview 1 dirent records (d_next, zero d_ino, d_namlen, d_type from DirEntry.is_dir) followed by name bytes, with cookie-based resume and a short-buffer truncation contract that never writes past buf_len. The make run-wasi-fs smoke now also enumerates the smoke.txt it created (readdir_found_smoke=true) and proves the short-buffer truncation.
  • fd_tell and fd_filestat_set_size landed 2026-05-24 09:34 UTC, completing the File-cap method triad (no schema or generated-bindings change – File.truncate already shipped). fs::fd_tell_impl is a pure host-side read of the maintained FileEntry::position (symmetric with fd_seek’s SET/CUR branches); fs::fd_filestat_set_size_impl calls FileClient::truncate_wait and leaves the file offset unchanged per the WASI contract. preview1::fd_tell returns ERRNO_SPIPE = 70 on a stdio fd (mirroring fd_seek) and writes the position as LE-u64; preview1::fd_filestat_set_size rejects a negative size with ERRNO_INVAL = 28 and maps non-file fds to ERRNO_BADF = 8. The make run-wasi-fs smoke now asserts fd_tell reports the post-write position (tell_ok=true) and fd_filestat_set_size shrinks the file (truncate_size=4), plus the stdio refusals for both imports.
  • path_create_directory and path_remove_directory landed 2026-05-24 10:09 UTC over the preopen Directory cap (DirectoryClient::mkdir / remove – no schema or generated-bindings change; Directory.mkdir/remove already shipped). fs::path_create_directory_impl / path_remove_directory_impl reuse the path_open resolve-parent-and-leaf path and the same preopen sandbox, so absolute paths and .. segments are refused with ERRNO_NOTCAPABLE = 76 before any kernel call; the mkdir result-cap (a fresh Directory handle the WASI layer does not retain) is released immediately to avoid leaking a cap-table slot. The make run-wasi-fs smoke now creates subdir, confirms it via fd_readdir (directory d_type), removes it, confirms it is gone, and asserts the directory-write sandbox refusals (mkdir_ok=true rmdir_ok=true dir_escape_refused=true). Implementing path_create_directory moves its no-preopen errno from ERRNO_NOSYS = 52 to ERRNO_BADF = 8 (the base-fd preopen lookup precedes the path), so the make run-wasi-preview1-refusals harness now asserts it in the BADF group.
  • fd_pread and fd_pwrite landed 2026-05-30 14:49 UTC as positional I/O over the host File cap (no schema or generated-bindings change – the kernel File.read / File.write methods already carry an explicit byte offset, and fd_read / fd_write already drive them). fs::fd_pread_impl / fs::fd_pwrite_impl mirror fd_read_impl / fd_write_file_impl but use the WASI-supplied offset and, per the WASI Preview 1 contract, leave FileEntry::position untouched – the defining positional-I/O invariant. preview1::fd_pread / fd_pwrite reuse the same guest-memory iovec gather/scatter helpers fd_read / fd_write were refactored onto (one walker, not two), reject a negative offset with ERRNO_INVAL = 28, and return ERRNO_SPIPE = 70 on a stdio fd (mirroring fd_seek / fd_tell). The make run-wasi-fs smoke now writes “ABCD” at offset 2, reads it back at offset 2, and asserts the fd’s stream position is unchanged (pwrite_pread_ok=true pos_unchanged=true), that a negative offset is refused (pread_neg_offset_inval=true), and that a stdio fd surfaces a non-ERRNO_NOSYS error (ppos_stdio_refused=true). The make run-wasi-preview1-refusals harness moves both imports into the BADF group (fd 3 is a bad descriptor against an absent preopen).
  • path_filestat_get and path_unlink_file landed 2026-05-30 as path-resolved metadata/removal over the host File / Directory caps (no schema / generated-bindings change). fs::path_filestat_get_impl resolves the leaf under the preopen, opens a transient read-only File (flags = 0), runs File.stat, and releases the transient cap before returning the size; fs::path_unlink_file_impl deletes the named entry through Directory.remove (the same void-result op path_remove_directory uses, which removes file leaves). Both enforce the absolute/.. ERRNO_NOTCAPABLE sandbox in resolve_parent_and_leaf before any kernel call; preview1::path_filestat_get accepts and ignores the lookupflags symlink-follow bit (no symlinks in v0) and writes the 64-byte filestat via write_filestat. The make run-wasi-fs smoke stats smoke.txt by path (size 4, regular-file type) and unlinks it, and make run-wasi-preview1-refusals moves both imports into the BADF group. The remaining ERRNO_NOSYS returns are the deliberately deferred surfaces (fd_advise, fd_allocate, the sync family, the path timestamp/symlink/link family (path_filestat_set_times, path_symlink, path_readlink, path_link, path_rename), poll_oneoff, proc_raise, and the W.6-blocked socket family).

Deliverable: a wasm module can read and write files inside a preopened capOS directory.

Phase W.6 — WASI Preview 1 sockets (gated on userspace network stack)

  • sock_send, sock_recv, etc. against TcpSocket / UdpSocket caps when the userspace network stack lands.
  • Until then, an HTTP client over Fetch / HttpEndpoint is a reasonable shim for HTTP-only use.
  • make run-wasi-preview1-refusals proves representative socket imports (sock_send, sock_recv, sock_shutdown) fail closed with ERRNO_NOSYS = 52 when no socket cap is present. This is current refusal evidence only; W.6 remains blocked until the networking authority exists.

Deliverable: a wasm module can serve HTTP requests inside a capOS process.

Phase W.7 — Move to wasmtime or migrate to WASI Preview 2 / Component Model

  • If the runtime selected in W.0 was wasmi, decide whether to swap to wasmtime once std/futures runtime is available in capOS userspace.
  • Or instead promote wasmi to wasip2 / Component Model support (wasmi roadmap covers components, but maturity is behind wasmtime).
  • Map WIT resources to typed OwnedCapability<T> slots. This is the natural place to bridge capOS capabilities into wasm as first-class typed handles. Capability transfer between wasm components becomes a host-mediated resource handoff.
  • Component-Model support enables cap-aware Rust crates to export their typed interfaces as WIT, which a Rust capOS service can consume the same way it consumes a capnp interface.
  • Schema serial-surface coordination: this phase will likely add new variants under schema/capos.capnp for component-model resource bridging. Serialise with other schema-touching plans (docs/backlog/index.md Concurrency Notes).

Deliverable: a wasm component on capOS exports a typed interface that a native capOS process can call.

Phase W.8 — TinyGo / Go-on-WASI integration for CUE

  • Build a CUE evaluator binary against TinyGo or upstream Go’s GOOS=wasip1. Run it in the host adapter against a CUE source blob granted as a ScriptPackage (future package-cap surface, same shape as the planned LaunchParameters work).
  • Reuses existing CUE workflows; capOS just hosts the evaluator.

Deliverable: capOS can evaluate CUE manifests at runtime without the host toolchain. Bridges to the eventual native Go track (go-runtime-proposal.md).

Languages Targeting WASI

What capOS gets “for free” once the host adapter exists, ranked by how mature each language’s WASI target is. This is the leverage argument: one host adapter unlocks every row at once.

LanguageWASI statusToolchainNative capOS alternativeWhen WASI wins
Rustwasm32-wasip1 Tier 2 since 1.78; wasm32-wasip2 Tier 2 since 1.82cargo build --target wasm32-wasip2targets/x86_64-unknown-capos.json (implemented)Untrusted Rust plugins. Cross-compiled tools.
C / C++wasi-libc + Clang --target=wasm32-wasi; wasi-sdk packagedclang --target=wasm32-wasifuture libcaposAny C/C++ tool needing portability before libcapos lands. CPython-on-WASI today is the canonical example.
Go (upstream)GOOS=wasip1 since Go 1.21 (Aug 2023). Single-thread, blocking I/O, no goroutine parallelism.GOOS=wasip1 GOARCH=wasm go buildfuture GOOS=capos (go-runtime-proposal.md)CUE evaluation, go run style tools, single-goroutine compute.
TinyGowasip1 supported; wasip2 supported in dev branchtinygo build -target=wasip2n/aSmaller Go binaries; Component Model export of typed interfaces.
Python (CPython)wasm32-unknown-wasip1 Tier 2 (PEP 11)Upstream CPython buildfuture native CPython through POSIX adapterSandboxed Python plugins, configuration scripts.
AssemblyScriptDesigned for wasm; WASI host integration via runtimeascn/aLightweight typed scripting. Less interesting on capOS than Lua.
ZigNative wasm32-wasi target; no runtime overheadzig build-exe -target wasm32-wasin/aZig systems code in a sandbox.
Lua / interpreters in generalA Lua interpreter compiled to wasi runs Lua scripts in a wasm sandboxCompile any C interpreter to wasm32-wasiLua piccolo runner (lua-scripting-proposal.md)When Lua scripts are untrusted. The piccolo native-Rust runner remains the right answer for trusted capOS scripting.
JavaScriptQuickJS-on-wasi works todayCompile QuickJS to wasm32-wasiQuickJS native runner (future)Untrusted JS plugins; portable JS without writing a native QuickJS runtime.
.NET (mono-wasi)Experimentaldotnet wasi-experimentaln/aIf a port of a .NET tool is required. Low priority.

When WASI vs Native

These are complementary tracks, not competitors.

  • Native wins for foundational services, performance-critical code, anything calling typed capOS caps directly, anything needing real threads, full async I/O, or first-class participation in the cap graph.
  • WASI wins for portability or untrusted code execution, for any existing C/C++ program with wasi-libc support that cannot wait for libcapos, for CPU-bound CUE evaluation before native Go lands, and for sandboxed user-submitted scripts.

The browser-wasm proposal captures the same intuition: the cap-ring layer is the only stable interface that survives substrate swaps. The WASI host adapter is another substrate swap, this time at the language level instead of the hardware level.

Validation

The first implementation is not complete until it has QEMU evidence:

  • A wasm module prints through a granted Console / TerminalSession.
  • The same module cannot use fd_write to a fd it was not granted, cannot open a path outside its preopened namespaces, and cannot call an unimplemented WASI function without receiving ERRNO_NOSYS.
  • A missing or wrong-interface cap lookup returns the appropriate WASI errno (not a host-side panic, not silent success).
  • An owned result cap is released deterministically when the instance exits.
  • The host adapter exits cleanly and does not wedge the kernel.

Host tests should cover WASI value conversion and import-resolver generation once those pieces are pure enough to test outside QEMU. Do not claim “WASI works” from host tests alone; the useful behavior is authority-shaped wasm execution in capOS.

Open Questions

  1. Per-instance vs per-process. One wasm instance per capos-wasm process (recommended) or many? Affects fuel/budget enforcement and the manifest shape. Resolved 2026-05-13 16:46 UTC — one wasm instance per capos-wasm process. Phases W.2–W.4 shipped on top of this shape: capos_wasm::Runtime owns exactly one wasmi::Engine and one Store<HostState>, and HostState aggregates the per-instance Console / Timer / RingClient / optional EntropySource / optional BootPackage clients plus the per-instance WasiArgs / WasiEnv bundles. That host state IS the per-instance state; there is no second instance to demultiplex against. The decision aligns with capOS capability discipline: the per-process CapSet is the authority boundary, manifest grants are scoped one binary at a time (docs/capability-model.md), and the capOS process boundary already provides the fault, fuel/budget, and audit isolation a multi-tenant wasm host would otherwise need to rebuild inside the runtime. Preview 2 / Component Model migration in Phase W.7 inherits the same per-process shape — one capos-wasm process per top-level component — and gains nothing from packing many components into one process while the OS-level isolation is free. A future multi-instance host (plugin sandboxes, embedded scripts) is allowed but must come back as a separate proposal that names the density target, the fuel and poll_oneoff reactor design, and the audit/observability shape; it does not block any current phase.

  2. Capability handle path: extension import or pure WASI-only? Custom externref imports lock capOS into a non-portable WIT dialect. Working answer: skip the custom-import path entirely; jump straight to Preview 2 / Component Model in Phase W.7.

  3. poll_oneoff semantics over the capOS ring. Block the host process’s cap_enter (simple, scales to one instance per process), or run a single-thread reactor that drives multiple instances in round-robin (scales to many instances per process)? Coupled to Q1. Resolved 2026-05-13 16:46 UTC — blocking cap_enter against the single per-process instance, with the surface expanded one subscription kind at a time as the underlying caps land. v0 keeps the W.2 sub-slice 2 ERRNO_NOSYS stub already in capos-wasm/src/wasi/preview1.rs: there is no portable subset of poll_oneoff we can answer correctly without Namespace / File / TcpSocket / UdpSocket caps, and the existing make run-wasi-preview1-refusals harness proves the refusal closes cleanly. Phase W.5 (filesystem) is the first phase that consumes a real subscription kind — eventtype_clock against monotonic time plus eventtype_fd_read / eventtype_fd_write against preopen-fd File handles — and will implement those subscription kinds by walking the subscription array, demultiplexing each subscription onto a single blocking cap_enter over the per-process ring, and returning the events the kernel completes. Phase W.6 adds the socket subscription kinds against TcpSocket / UdpSocket once the userspace network stack lands. A multi-instance reactor stays out of scope: §1 resolves to one wasm instance per capos-wasm process, so poll_oneoff only ever has to demultiplex one instance’s subscription set, and the kernel ring is already a completion-queue primitive that fits that shape directly. Realtime clock subscriptions remain ERRNO_NOSYS until a typed WallClock cap exists (same ceiling as clock_time_get(CLOCKID_REALTIME)).

  4. Fuel budget defaults and exhaustion semantics. wasmi exposes fuel; what is the default budget per instance, and what is the exhaustion behaviour (instance traps and exits, or instance pauses pending refill from a FuelGrant cap)? Affects the cap surface. Working answer: trap-and-exit default; defer the FuelGrant cap until long-running plugins exist.

  5. Typed result-cap from a host call into a wasm module. Preview 1 has no externref. How does the host hand a typed cap back to the instance after a CALL that returns a transferred result cap? Working answer: v0 reifies result caps as integer fds in the per-instance fd table; the host returns fd numbers from capability-issuing imports. Defer typed caps in wasm imports to Preview 2 / Component Model in Phase W.7, where WIT resources match the shape directly. Phase W.3 status (2026-05-07 18:25 UTC): unchanged. W.3 does not introduce any capability-issuing import, so no result-cap reification path landed; the working answer carries forward into W.5 (filesystem) / W.6 (sockets), which are the first phases that will exercise it.

  6. environ_get source. Empty-by-default, or backed by a KeyValueScope / ConfigOverlay cap? Resolved by Phase W.3 (2026-05-07 18:25 UTC) and the 2026-05-13 follow-up — bounded manifest-provided text grant, empty when absent. Migration to a future LaunchParameters cap remains the open path. Original working answer: empty for v0 unless the manifest supplies a bounded text environment grant; bind to whatever environment cap a future capOS LaunchParameters surface produces (no in-tree plan owns this yet; the shell proposal sketches the broader launch-args/environment discussion). Phase W.3 decision (2026-05-07 18:25 UTC): kept empty-by-default and shipped the argv text grant only. 2026-05-13 update: the same bounded manifest-text pattern now exists as initConfig.init.wasiEnv, a CUE text list under the existing initConfig CueValue field (no schema/capos.capnp change). Capacity bounds in capos-wasm/src/payload.rs:

    • WASI_ENV_MAX_COUNT = 32 environment entries.
    • WASI_ENV_MAX_ENTRY_BYTES = 4096 per entry (NUL terminator not included).
    • WASI_ENV_MAX_TOTAL_BYTES = 8192 for the packed environment buffer including per-entry NUL terminators. Interior NUL bytes inside an entry are rejected. The decoder tolerates an absent or empty wasiEnv, in which case Preview 1 environ_get / environ_sizes_get report zero entries (the W.2 behavior). A future LaunchParameters cap remains the migration path for argv and environ together.
  7. args_get source. Reuse a future capOS LaunchParameters surface (not yet in tree), or ship a wasm-host-specific text grant in the manifest until that surface lands? Resolved by Phase W.3 (2026-05-07 18:25 UTC) — bounded manifest-provided argv text grant on initConfig.init.wasiArgs, migrating to the future LaunchParameters cap once it exists. Original working answer: ship a small bounded text grant for v0; migrate to the future LaunchParameters surface once it exists. Phase W.3 decision (2026-05-07 18:25 UTC): shipped as initConfig.init.wasiArgs, a CUE text list under the existing initConfig CueValue field (no schema/capos.capnp change). Capacity bounds in capos-wasm/src/payload.rs:

    • WASI_ARGS_MAX_COUNT = 32 argv entries.
    • WASI_ARGS_MAX_ARG_BYTES = 4096 per entry (NUL terminator not included).
    • WASI_ARGS_MAX_TOTAL_BYTES = 8192 for the packed argv buffer including per-entry NUL terminators. Interior NUL bytes inside an argv entry are rejected (would corrupt the WASI Preview 1 NUL-terminated layout). Each violation surfaces through a stable wasm-host exit code so harnesses can distinguish them from generic decode failures. The decoder tolerates an absent or empty wasiArgs, in which case Preview 1 args_get / args_sizes_get report zero entries (W.2 behaviour). Migration to the future LaunchParameters cap stays the open path per the original working answer.
  8. Vendoring posture for wasmi. vendor/wasmi-no_std/ (forked, patched) or a cargo-vendor-style mirror of upstream default-features = false? Same question as the piccolo Lua track. Resolved 2026-05-05 19:12 UTC: mirror-as-is. The vendored snapshot at vendor/wasmi-no_std/wasmi-1.0.9/ is a static-pinned copy of upstream v1.0.9 with no source patches; cargo default-features = false strips std/wat cleanly out of the box. Provenance and refresh procedure are recorded in vendor/wasmi-no_std/VENDORED_FROM.md. This posture is independent of what the Lua track chooses; if the two tracks diverge, document the divergence in each track’s VENDORED_FROM.md.

  9. WASI module distribution and versioning. Shipped inline in a manifest blob (today), or via a future Store/Namespace? Working answer: inline blobs for v0; revisit after the storage proposals land.

  10. Component-Model adoption timeline. Skip Preview 1 entirely and target Preview 2 from day one? Possible with wasmtime, harder with wasmi today. Working answer: ship Preview 1 first because it unlocks Rust, C, Go, Python, TinyGo immediately; layer Preview 2 on once wasmi’s component support hardens or migrate to wasmtime.

  11. Out-of-tree wasm packaging. Will capOS ship pre-built .wasm binaries from the boot manifest only, or will operators bring their own? Same scoping question as the future LaunchParameters / package-cap surfaces. Working answer: in-tree only for v0–v6; out-of-tree once a Store cap can hold blobs.

  12. Audit cap shape for wasm instance lifecycle events. Same open question as Lua scripting Phase 4. Component-Model paths benefit from per-instance audit because resource handoffs are interesting events to record. Working answer: defer until the userspace audit cap surface exists.

Progress 2026-05-13 16:46 UTC: §1 (per-instance vs per-process) and §3 (poll_oneoff semantics) resolved. §1 is locked at one wasm instance per capos-wasm process, matching the per-process Runtime + Store<HostState> shape shipped through Phases W.2–W.4 and the per-process CapSet authority boundary; future multi-instance hosting must come back as a separate proposal. §3 keeps the W.2 sub-slice 2 ERRNO_NOSYS poll_oneoff stub for v0 and pre-commits Phases W.5 / W.6 to extend it one subscription kind at a time (monotonic clock + fd read/write in W.5 against Namespace/File caps, sockets in W.6 against TcpSocket/UdpSocket caps), demultiplexed onto a single blocking cap_enter over the per-process ring; multi-instance reactors remain out of scope. §6 (environ_get) and §7 (args_get) reclassified as resolved by Phase W.3 (2026-05-07 18:25 UTC) with the bounded manifest-text grants on initConfig.init.wasiEnv / initConfig.init.wasiArgs; the migration path to a future LaunchParameters cap is preserved.

Relationship to Other Proposals

  • Userspace Binaries owns the broader native-binary, language, and POSIX-adapter roadmap. This proposal supersedes Part 5 of that proposal with the full WASI host adapter design.
  • Programming Languages is the reader-facing summary of language support; the WASI row points at this proposal.
  • Browser/WASM is the separate browser-hosted wasm experiment. Both proposals share wasm-runtime insight but target different substrates.
  • Lua Scripting is the trusted capability-scoped script runner using a native (likely piccolo) Lua VM. WASI-hosted Lua is the untrusted alternative.
  • Go Runtime is the native GOOS=capos alternative to Go-on-WASI. Go-on-WASI is the v0 path for CUE evaluation; native Go is the path for full Go runtime semantics.
  • Storage and Naming defines the Directory / File / Store / Namespace surfaces that Phase W.5 consumes.
  • Networking defines the TcpSocket / UdpSocket surfaces that Phase W.6 consumes.
  • Service Architecture defines Fetch / HttpEndpoint, useful as the v0 networking shim before the full userspace network stack lands.

Proposal: POSIX Compatibility Adapter

How capOS should host POSIX-shaped C software without recreating the ambient authority that makes POSIX hard to confine, and which two ports validate the adapter for the first time.

Problem

capOS is not POSIX and is not trying to become POSIX. But useful software – DNS resolvers, line-editing libraries, shells, archivers, compilers, network clients – assumes a POSIX surface. Rewriting each of these in capability- native Rust would forfeit decades of debugging, security review, and performance work for no isolation gain: a POSIX program whose only authority is a typed capability set is already as confined as an equivalent native one.

The risk pattern is the one POSIX historically gets wrong: a translation layer that synthesises ambient authority (a global /, an inherited credential table, a kernel-managed file descriptor map) rebuilds the property capOS is trying to leave behind. A useful adapter must do the opposite – every POSIX call must be backed by a typed capability the calling process already holds, or it must fail closed with a documented errno.

Two upstream programs are the natural first validators of that adapter:

  • A POSIX shell exercises the broadest surface (process, pipe, file, env, signal stubs, stdio).
  • A DNS resolver exercises the smallest network surface (UDP socket, one-shot poll-equivalent, time, log).

Both are already small, mature, and BSD/MIT-licensed. Picking the smallest representative of each category makes the adapter’s first job a real port, not a synthetic test.

Scope

In scope:

  • A two-layer C substrate: libcapos (thin Rust staticlib, capability ring + CapSet + raw syscalls + heap, C ABI) and libcapos-posix (POSIX shape on top: fd table, errno, path resolution, posix_spawn shim, signal stubs, pthread mapping).
  • A first POSIX shell port that builds against libcapos-posix with no hidden ambient authority.
  • A first DNS resolver port that builds against libcapos-posix with no hidden ambient authority.
  • Phase decomposition (P1.1, P1.2, P1.3) that defers the adapter’s biggest dependencies (Namespace + File caps for the shell file path; UDP cap for the resolver) into clearly-named gating phases.
  • Validation through QEMU smokes that prove granted and ungranted paths.

Out of scope for the first implementation:

  • Binary compatibility with Linux ELFs. Both ports are sources-on-disk recompiled against libcapos-posix.
  • Full POSIX compliance. The adapter ships exactly the surface dash and dns.c exercise, plus any free additions that fall out.
  • Real fork() (parent state inheritance, COW, sibling address-space surgery before exec). Only fork() followed promptly by execve() is supported, via a posix_spawn-shaped shim.
  • Real signal delivery. signal()/sigaction() accept the call, store the handler, never invoke it. kill(2) requires a future ProcessHandle cap.
  • Job control, process groups, sessions, controlling terminals.
  • musl, glibc, or any other host libc. The substrate is Rust-authored and exposes a C ABI; it is not a libc port.
  • Hosted C++. ABI decisions for C++ remain tracked in docs/proposals/userspace-binaries-proposal.md.

Current Manual Pages

  • Programming Languages summarizes POSIX adapter status relative to Rust, C/C++, Python, Go, Lua, and WASI tracks; the C row records the shipped libcapos.a + libcapos_posix.a surface, and the POSIX-shaped software row records P1.1/P1.2/P1.3 closeouts plus the in-progress P1.4 dash-port phase shape over the bootstrap-granted root Directory cap surface, including the signal/time stub closeout.
  • Userspace Binaries Part 4: POSIX Compatibility Adapter sketches the POSIX adapter at a higher level. This proposal supersedes that sketch with the full design surface; the userspace-binaries proposal continues to own the broader native-binary, language, and adapter roadmap.
  • Userspace Runtime documents the implemented capos-rt surface that libcapos mirrors for C consumers.
  • Networking defines NetworkManager, TcpListener, and TcpSocket and explicitly defers UdpSocket until DNS / userspace-network work needs it. The DNS resolver port in this proposal defines the UDP cap surface; the TCP cap surface is reused unchanged.
  • Storage and Naming defines the Namespace, Directory, File, and Store cap shape; these gate the shell port’s filesystem surface (Phase 2/3 of that proposal).
  • Service Architecture frames the future Resolver cap as the long-term consumer of the resolver process built in this track.
  • Shell covers the native capos-shell. The POSIX shell port (dash) is for porting validation, not as a replacement for the native shell.
  • WASI Host Adapter is the parallel untrusted-portable execution path; both proposals share fd-table and per-import authority insight, but target different substrates.

Research Grounding

Relevant research and external references:

  • POSIX shell candidates surveyed: dash (Debian Almquist Shell, ~13 kSLOC, BSD; the canonical small POSIX-strict shell); busybox ash; OpenBSD ksh (oksh); toybox toysh. Source repositories cited inline in the candidate comparison table.
  • DNS resolver candidates surveyed: dns.c by William Ahern (single-file MIT, ~10 kSLOC, no dependencies); c-ares; GNU adns; udns; SPCDNS; musl’s embedded res_query; trust-dns-resolver. Source repositories cited inline in the candidate comparison table.
  • libcapos prior art: this proposal builds on the libcapos shape sketched in Userspace Binaries “Future: C via libcapos” / “Future Phase: libcapos for C”. The C substrate is designed as a Rust staticlib with a C ABI rather than musl, redox relibc, or a hand-rolled libc. Fuchsia’s fdio + musl pattern and Redox’s relibc pattern are the comparable points; capOS deliberately picks neither.
  • POSIX surface translation: Cygwin’s fork() emulation is the closest prior art for fork-for-exec semantics on top of a non-fork substrate; the capOS shim inverts the default (capOS cannot fork; the shim emulates the useful case) but uses the same call-pattern recognition.

In-tree research grounding:

  • Genode – per-session typed service interfaces and resource accounting are the closest precedent for routing every POSIX wrapper through a typed cap rather than through an ambient kernel syscall table. POSIX adapter wrappers should follow the same pattern at the library boundary instead of the kernel boundary.
  • OS Error Handling – cross-OS comparison of error-model surfaces. Informs the bidirectional mapping between CapError / CapException and POSIX errno (Open Question §4) and the decision to keep one shared mapping table at the C boundary rather than per-wrapper bespoke mappings.
  • LLVM Target – target triple, calling convention, and bare-metal toolchain options for capOS C consumers. Informs Open Question §11 on the linker / toolchain choice (clang --target=x86_64-unknown-none-elf -nostdlib -static).

This proposal also lifts the capability-mapping shape and the “every translation has authority backing” property from the WASI host adapter proposal, and the libcapos staticlib shape from the userspace-binaries proposal Part 2. It deliberately does not adopt the musl + __syscall hook pattern noted in the userspace-binaries proposal “musl as a Base (Optional, Later)” section, because the layered Rust staticlib shape is preferred over a libc port for the v0 surface.

External:

  • dash – Debian Almquist Shell, ~13 kSLOC, Debian’s /bin/sh since Squeeze (2011).
  • busybox ash – alternative Almquist port, embedded.
  • oksh – portable OpenBSD ksh, public domain, larger surface.
  • toybox toysh – 0BSD, currently incomplete.
  • c-ares – modern async DNS resolver, MIT, larger.
  • dns.c – single-file non-blocking DNS, MIT, no deps.
  • GNU adns – async DNS resolver, GPL-2.0+.
  • musl resolver – embedded in musl libc; not available without linking musl.
  • udns – small async stub-only resolver, LGPL-2.1.

Design Principles

  1. POSIX is not a kernel feature. The kernel sees ordinary userspace processes with a CapSet and a capability ring. libcapos and libcapos-posix are static libraries linked into those processes.
  2. Two layers, one C ABI per layer. libcapos is the C-ABI mirror of capos-rt: capability ring, CapSet, raw syscalls, heap. It has no errno, no fd table, no open/read/write. libcapos-posix builds the POSIX shape on top. Programs that do not need POSIX semantics may link only libcapos.
  3. Authority is per-process, granted at spawn. Every fd a POSIX program sees was granted to its parent process at spawn time and projected onto an fd by libcapos-posix. There is no ambient /, no inherited credential table, no global signal source.
  4. Schema-first, not POSIX-first, at the boundary. Each POSIX wrapper is backed by a typed capability call with a documented errno mapping. POSIX-shaped integer fds and POSIX-shaped errno are an ABI requirement of the C substrate, not a capability-model concession.
  5. Fail closed. Any unimplemented POSIX call returns ENOSYS and sets errno. Any cap lookup that fails returns the documented errno. Programs cannot probe absent caps for ambient behaviour.
  6. No fork without exec. Only fork() followed by execve() is supported. The shim turns the pair into posix_spawn(). Bare fork() used to clone state in-process fails on the next non-trivial syscall.
  7. No real signals. Handlers are accepted and stored, never delivered. kill(2) requires a future ProcessHandle cap and even then is limited to SIGKILL. Programs that depend on SIGCHLD job control are out of scope.
  8. The C substrate is Rust. libcapos and libcapos-posix are Rust crates with crate-type = ["staticlib"], all symbols #[no_mangle] extern "C". This is not musl, not a hand-rolled libc.

Architecture

flowchart TD
    Shell["POSIX shell binary<br/>(e.g. dash)"]
    Resolver["DNS resolver binary<br/>(e.g. dns.c)"]
    Posix["libcapos-posix<br/>(POSIX adapter, Rust staticlib, C ABI)"]
    PosixDetail["fd table per process<br/>path resolver over Namespace + Store<br/>errno mapping (TLS cell)<br/>posix_spawn over ProcessSpawner<br/>signal stubs<br/>pthread over ThreadSpawner"]
    Posix --> PosixDetail
    Capos["libcapos<br/>(thin Rust staticlib, C ABI)"]
    CaposDetail["cap_call / capset_get / capset_iter<br/>sys_exit / sys_cap_enter<br/>heap (malloc/free over capos-rt allocator)<br/>typed wrappers for Console / Terminal / etc."]
    Capos --> CaposDetail
    Rt["capos-rt<br/>(no_std + alloc Rust)"]
    Ring["capability ring"]
    Kernel["kernel CapObject dispatch"]
    Services["userspace services"]

    Shell -->|"open/read/write/exec/..."| Posix
    Resolver -->|"socket/sendto/recvfrom"| Posix
    Posix -->|"extern C"| Capos
    Capos -->|"Rust FFI re-export"| Rt
    Rt --> Ring
    Ring --> Kernel
    Ring --> Services

libcapos is the C-ABI projection of capos-rt. libcapos-posix is the POSIX projection on top. Every POSIX call ultimately resolves to either a capability invocation through the ring or a synthetic answer (errno, ENOSYS) computed without authority.

libcapos: C-Facing Substrate

Headers expected to ship under include/capos/:

// capos.h -- capability primitives only
typedef struct cap_ring cap_ring_t;
typedef uint32_t        cap_id_t;
typedef uint64_t        iface_id_t;

cap_ring_t *capos_ring(void);                     // process ring handle
int  cap_call(cap_ring_t *ring,
              cap_id_t cap, uint16_t method,
              const void *params, size_t plen,
              void *result, size_t rlen,
              size_t *out_len);
int  capset_get(const char *name,
                cap_id_t *out_cap, iface_id_t *out_iface);
size_t capset_iter(void (*cb)(const char*, cap_id_t, iface_id_t,
                              void*), void *ud);
_Noreturn void sys_exit(int code);
uint32_t       sys_cap_enter(uint32_t min_complete, uint64_t timeout_ns);

// Heap (backed by capos-rt fixed heap; grow-on-demand later if needed)
void *capos_malloc(size_t);
void  capos_free(void*);
void *capos_calloc(size_t, size_t);
void *capos_realloc(void*, size_t);

There is no errno here, no open/read/write. Those live one layer up. libcapos is the C-ABI mirror of capos-rt: startup, ring, CapSet, raw syscalls, heap.

Build artifact: target/.../libcapos.a plus headers. Naming for the C library is intentionally just libcapos, mirroring how the Rust runtime crate is capos-rt. The C library name libcapos is distinct from any Rust service framework that may carry a similar name; this proposal owns the C-substrate name and treats Rust-framework naming as out of scope.

libcapos-posix: POSIX Surface

Headers under include/capos/posix/: unistd.h, fcntl.h, errno.h, sys/socket.h, netdb.h, sys/stat.h, dirent.h, string.h, stdlib.h (subset), sys/types.h, pthread.h (subset), signal.h (stub).

Implementation language: Rust, same crate-type pattern as libcapos, but linked separately so a binary that does not need POSIX can omit it.

Errno bridge: per-thread errno cell stored in TLS slot owned by libcapos-posix; populated by every wrapper that maps a Rust CapError to a POSIX errno value. See “errno Convention” below.

File descriptor table

Per-process userspace state inside libcapos-posix. Not a kernel object – neither libcapos nor the kernel know anything about fds.

#![allow(unused)]
fn main() {
// libcapos-posix/src/fd.rs (sketch)
struct FdEntry {
    backing: FdBacking,       // Console / Stream / Listener / File / Dir
    flags:   i32,             // O_NONBLOCK, FD_CLOEXEC, ...
    cursor:  u64,             // for seekable backings
}

enum FdBacking {
    Stdin,                    // Console / TerminalSession (read side)
    Stdout,                   // Console (write side)
    Stderr,                   // Console (write side)
    File   { file: Cap<File>, dirty: bool },
    Dir    { dir:  Cap<Directory>, iter: usize },
    Tcp    { sock: Cap<TcpSocket> },
    Udp    { sock: Cap<UdpSocket> },
    Listener { l: Cap<TcpListener> },
}

static FD_TABLE: Mutex<BTreeMap<i32, FdEntry>> = ...;
static NEXT_FD:  AtomicI32 = AtomicI32::new(3);
}

dup/dup2/close operate on this table. dup increments a refcount on the underlying cap; close releases when the last fd holding the cap drops. Cap drop runs through capos-rt owned-handle release. The fd table is a strict per-process userspace structure; it is not shared with the kernel and is never serialised on the wire.

Standard fds wired at _start:

  • fd 0: stdin cap from CapSet (TerminalSession, Console, or future StdinReader-shaped cap, whichever is granted).
  • fd 1: stdout Console cap.
  • fd 2: stderr Console cap (or distinct Log cap if granted).

Process model: fork-for-exec only

capOS process creation is ProcessSpawner.spawn(name, binaryName, grants) (kernel/src/cap/process_spawner.rs). There is no fork(), no exec()-in-place.

Decision matrix (working answers; the policy choice is Open Question §6 and is not settled until that question is confirmed):

OptionWhat it providesCostWorking answer
Emulate fork() as posix_spawn with inherited cap-set, recording inter-call dup2/close as posix_spawn file actionsExisting fork+exec and fork+dup2+exec pipeline patterns work with one patch siteDaemonisation and arbitrary COW state inheritance between fork and exec still breakRecommended primary for the shell, with documented “fork-for-exec only” semantics. Whether the shim records inter-call file actions or requires the port to call posix_spawn with explicit file actions is Open Question §6.
Return ENOSYS for any fork()HonestEvery POSIX program that uses fork must be patchedRecommended safety net when fork-for-exec is misused
Process-shadow: a “POSIX process” wraps a capOS processGeneralLarge kernel + runtime change; doubles process accountingRecommended reject for v0; revisit only if a real POSIX program needs it

Working answer: fork-for-exec, with hard-fail as the safety net (subject to Open Question §6 confirmation before P1.3 begins). Two libcapos-posix shim variants are on the table; §6 selects between them:

  • Variant A – recording shim. libcapos-posix exposes fork() and execve() as a coupled shim that:
    1. fork() records “next exec is the real spawn” in TLS and returns 0 unconditionally. Only the if (pid == 0) branch ever executes; the legacy else branch is unreachable because pid is always 0. Porters MUST move the parent flow (drop unused write end, drain read end, waitpid) to AFTER the if-block, with the synthetic pid handed off via child = execve(...); near the end of the if-body. Pictorially:
      pid_t child = fork();          // returns 0 unconditionally
      if (child == 0) {
          dup2(...); close(...);     // recorded into TLS
          child = execve(...);       // returns synthetic_pid > 0
          if (child < 0) {           // surface to error path
              goto exec_failed;
          }
      }
      /* parent flow runs here, NOT in an else branch */
      close(...);
      read(...);
      waitpid(child, ...);
      
      There is no else branch in the v0 contract, only the post-if parent flow.
    2. dup2() / close() calls between fork() and execve() are recorded as posix_spawn file actions on the pending spawn rather than mutating the parent’s fd table.
    3. execve(path, argv, envp) consumes the recorded intent, calls ProcessSpawner.spawn() with attenuated grants and the recorded file actions, and returns the synthetic child pid as its own return value (a deliberate v0 deviation from POSIX). The pseudo-child branch is still the original parent process, so porters MUST NOT call _exit() on failure: _exit() would terminate the actual shell. The recommended pattern surfaces the failure to the caller’s normal error path:
      int spawn_pid = execve(...);
      if (spawn_pid < 0) {
          /* execve() failed before any spawn; recording state is
           * already cleared and the parent fd table is unchanged.
           * Return up to the caller with the matching errno. */
          goto exec_failed;     /* or equivalent error-recovery path */
      }
      child = spawn_pid;        /* parent flow: waitpid(child) */
      
      On failure execve() returns -1 with errno set; callers MUST surface the failure to their normal error path rather than calling _exit(), because the pseudo-child branch is still the parent process and _exit() would terminate the actual shell.
    4. Any fork() not followed by execve() before a syscall outside the recorded-action allowlist (e.g. setsid) returns -1 / ENOSYS on that downstream call.
  • Variant B – patched-port shim. libcapos-posix exposes only posix_spawn() with explicit file actions, plus stub fork() / execve() that return -1 / ENOSYS. Each port (dash and successors) is patched to translate its fork+dup2+exec sequence into a single posix_spawn() call with the equivalent file actions.

posix_spawn() is the preferred primitive in either variant and gets a direct mapping to ProcessSpawner.spawn(). The choice between Variant A and Variant B is Open Question §6.

fd-backing-cap inheritance (kernel precursor). For a fork/execve child to inherit a parent fd that is backed by an opened Directory/File cap, that cap must be forwardable through ProcessSpawner.spawn. Read-only Directory/File caps are now minted Copy/SameSession (directory::transfer_result_cap, readonly_fs, installable_image, and the kernel:directory/kernel:file bootstrap sources), so the shim can forward an opened read-only directory or file to the spawned child as a Raw spawn grant; the child looks it up by name from its CapSet and projects it back onto the inherited fd. The disk-backed writable filesystem stays NonTransferable (single-writer policy), so a writable fd cannot be inherited this way. The kernel handoff is proven in isolation by make run-spawn-grant-directory; see Capability Model “Read-only filesystem caps are forwardable”. The recording shim emits these grants. As of posix-recording-shim-full-fd-inherit (done 2026-05-27) inheritance is full-fd-table by default, matching POSIX fork+execve: execve forwards every open parent slot – not only dup2/close-touched ones – as a stdio_<child-slot> spawn grant, with the recorded actions applied as edits on top of that baseline. Per backing: Directory/Console/File/TerminalSession forward as SpawnGrantMode::Raw over the Copy-transferable cap (the parent keeps its own fd; an aliased slot Copy-shares to several child slots), and a Pipe end forwards as a single Move (leaving the parent slot a Moved sentinel). A slot marked FD_CLOEXEC/O_CLOEXEC is dropped from the child unless an explicit recorded dup2 named that child slot (POSIX dup2 clears close-on-exec). A non-forwardable backing inherited implicitly (Udp, an already-moved slot, or a shared Pipe) is skipped non-fatally; an explicit dup2 of one fails closed. The child’s posix_inherit_stdio() reconstructs each grant into the matching fd slot by interface id, wrapping an inherited directory fd through fdopendir(). End-to-end proofs make run-posix-fd-inherit-default (parent inherits stdio + directory by default with no stdio dup2; CLOEXEC fd excluded; terminal retained via Raw; Copy-share alias) and make run-posix-execve-inherit-smoke (the explicit-dup2 parent, now redundant but still correct). Because the v0 POSIX open surface mints only Copy/SameSession File/Directory caps, the disk-backed writable NonTransferable filesystem cannot enter the fd table here; if a future writable open path mints one, full inheritance needs a pre-spawn transferability check to skip it (today it would surface as the whole-spawn ENOEXEC). An inherited File resets to offset 0 (the parent’s seek position is userspace state that does not travel with the cap).

The recording-shim execve(path, argv, envp) path also forwards argv without changing the generated ProcessSpawner.spawn(name, binaryName, grants) schema: the parent validates the C argv vector, writes a bounded binary argv record into a private Pipe, and grants only the read end to the child as posix_argv. Child code opts in with posix_args(), which prefers posix_argv when present and otherwise falls back to manifest initConfig.init.posixArgs through the boot BootPackage cap. The pipe payload is capped by the existing 4 KiB Pipe transport, so direct large manifest posixArgs remain the wider PID-1 channel. Malformed or over-budget execve argv fails before fd-action replay; the focused proof asserts this does not mutate the parent’s fd table.

Signals

Stubbed. capOS has no signal mechanism today and the cap model disagrees with ambient asynchronous interrupts.

  • signal() / sigaction() accept the call, store the handler in a per-process table, never invoke it. Return success.
  • kill(pid, sig) returns -1 / EPERM unless the caller has a ProcessHandle cap for the target – and even then the only signal honoured would be SIGKILL, which maps to a future ProcessHandle.kill() outside this v0 POSIX surface.
  • raise(sig) returns -1 / ENOSYS. Self-delivery is still signal delivery, and capOS v0 intentionally does not fake it.
  • sigemptyset / sigfillset / sigaddset / sigdelset / sigismember are real bit operations on the caller’s sigset_t (a uint64_t). sigprocmask keeps a per-process blocked mask so ports can save and restore it during job control, honours SIG_BLOCK / SIG_UNBLOCK / SIG_SETMASK, and force-clears SIGKILL / SIGSTOP per POSIX – but the mask is stored, never enforced, because there is no delivery to block. sigpending always reports an empty set for the same reason.
  • pause() / sigsuspend() / sigwait() block forever (or with timeout) via sys_cap_enter(0, timeout); they never wake from a signal.
  • SIGPIPE is never delivered. Writes on a closed connection return -1 / EPIPE.

This is acceptable for a shell + DNS resolver. Anything that depends on real signals (job control with Ctrl-Z, Ctrl-C across pipelines, real SIGCHLD) is out of scope for the first port. Job control in the shell must be reimplemented over typed control caps, not signals.

errno convention

Per-thread errno cell in TLS owned by libcapos-posix. Mapping table (libcapos-posix/src/errno_map.rs):

capOS CapError / CapExceptionPOSIX errno
CapError::NotFoundENOENT
CapError::PermissionDeniedEACCES
CapError::DisconnectedECONNRESET
CapError::TimeoutETIMEDOUT
CapError::ResourceExhaustedENOMEM / EMFILE (context dependent)
CapError::InvalidArgumentEINVAL
CapError::WouldBlockEAGAIN
(fall-through)EIO

Wrappers always: clear errno, call, on error set errno + return -1 (int) or NULL (pointer). Same convention as glibc / musl.

Threading

pthreads -> capOS in-process threading. Substrate already exists in the kernel: ThreadSpawner, ThreadControl, ThreadHandle, per-thread FS-base, ParkSpace.

Mapping:

  • pthread_create -> ThreadSpawner.spawn + start-routine trampoline.
  • pthread_exit -> ThreadControl.exitThread.
  • pthread_join -> ThreadHandle.join (block via cap_enter).
  • pthread_self -> TLS slot or ThreadControl.currentId.
  • pthread_mutex_* -> ParkSpace-backed mutex (futex-style park / unpark).
  • pthread_cond_* -> ParkSpace + bounded waiter queue.
  • pthread_key_* -> fixed-size TLS slot table per thread.

This is in scope but not on the critical path for the shell or DNS resolver – both can run single-threaded for v0. The pthread shim is deferred to a v1 successor.

First Port: POSIX Shell

Candidate survey

ShellLicenseSizeDepsPOSIX coverageVerdict
dash (upstream)BSD~13 kSLOC, ~134 KBtiny libc subset; no readline; no termcapStrict POSIX, no extensionsRecommended primary
busybox ash (upstream)GPL-2.0~8 kSLOC of shell/ash.c + busybox infraDesigned for embedded, modularPOSIX + selectable extensionsHeavier framework cost; useful later when capOS wants a coreutils set
toybox toysh (upstream)0BSDcurrently incompleteDesigned for self-contained ELFPOSIX + Bash compat target, not finishedSkip – explicitly described upstream as still under development
oksh (upstream)Public domain~308 KB binary, 0 depsOptional ncurses for clear-screen onlyKorn-shell superset of POSIXBigger surface than v0 needs to validate libcapos-posix
Custom Rust shelln/an/an/an/aReject – defeats the purpose of porting C. Native shell already exists at shell/ (capos-shell).

Recommended primary: dash.

Reasons:

  1. Smallest established POSIX-strict shell. ~13 kSLOC is small enough for the porting team to read the entire codebase.
  2. No readline / termcap dependency. The shell talks to whatever fd 0 gives it. This is exactly what libcapos-posix provides through TerminalSession or Console.
  3. Strict POSIX means the port does not accidentally validate Bash extensions that libcapos-posix does not implement.
  4. Already proven as a porting target on Linux from Scratch, OpenWrt, and Alpine. Patterns for replacing the libc layer (__syscall, stubbed sigaction) are well documented.
  5. Debian uses it as /bin/sh since Squeeze (2011), so any “POSIX shell only” script base in the wild is dash-compatible.

Open Question §1 below records this candidate as the final decision (Decided (P1.4 Slice 1, 2026-05-24 00:53 UTC)).

Required POSIX surface (v0)

What a dash instance actually exercises before printing a prompt and running ls | grep foo:

GroupCalls (minimum set)Backed by
Process startup_start shim, argv/envp parsing, exitlibcapos _start, sys_exit
Stdioread(0,...), write(1,...), write(2,...)Console / TerminalSession cap
Allocationmalloc/free/calloc/realloclibcapos heap
String/formatprintf/fprintf/memcpy/strlen/strcmp/strchr/strncpy/…libcapos-posix string/printf subset
File I/Oopen/close/read/write/lseek/stat/fstat/access/unlinkNamespace + File caps
Directoryopendir/readdir/closedirDirectory cap
Pipespipe(), dup2(), close() on fdsNEW Pipe capability (P1.3)
Processfork+execve (fork-for-exec only), posix_spawn, wait/waitpidProcessSpawner + ProcessHandle.wait
Envgetenv/setenv/putenvPer-process env vector in libcapos-posix; populated from a future LaunchParameters cap when one lands
Signalssignal/kill/sigaction (stubs)TLS-stored handlers, never delivered
Timetime/gettimeofday/nanosleepTimer cap
Control flowsetjmp/longjmp over jmp_buflibcapos x86_64 SysV global_asm (<setjmp.h>); no sigsetjmp
Miscgetpid/getuid/getgidgetpid from capos-rt bootstrap pid; uid/gid hardcoded for v0

The control-flow row was absent from the original minimum set above; dash’s exception/interpreter control flow is built on setjmp/longjmp over a real jmp_buf (pervasive in error.h/main.c/eval.c/parser.c/trap.c), so it is a hard precursor for the dash build pipeline. It landed via the libc-setjmp-longjmp task: the x86_64 SysV primitive in libcapos/src/setjmp.rs with a <setjmp.h> header, re-exposed under libcapos-posix/include/capos/posix/, and proven in QEMU by make run-posix-setjmp. sigsetjmp/siglongjmp are intentionally absent (dash uses only the plain primitive; the v0 signal layer has no asynchronous delivery and thus no signal mask to save).

Like the control-flow row, the table above also understated the header layout and breadth of the libc surface a program of dash’s size needs. A -nostdinc compile/link probe of the full vendored dash TU set (2026-05-25 21:40 EEST) showed dash uses bare POSIX includes (<unistd.h>, <fcntl.h>, …) — not the capOS capos/posix/*.h namespace — so it requires a -nostdinc capOS POSIX sysroot plus a missing surface. This landed (2026-05-25 22:23 UTC, libc-dash-sysroot-surface): libcapos-posix/sysroot/include/ is the bare-header sysroot forwarding into the capos/posix/* namespace, and the surface was completed — strerror/qsort/umask/abort/setlocale/getrlimit/times/tcgetattr/ strtoll/strtoull/sig_atomic_t/NSIG/sigsuspend, the str* set, the <termios.h>/<sys/resource.h>/<sys/times.h>/<locale.h>/<sys/types.h> headers, and further items the table still understated: the C/POSIX-locale multibyte layer (<wchar.h>/<wctype.h>, mbrtowc/wctype/iswctype/…) that expand.c uses unconditionally, strpbrk, lstat, getgroups, wait3, vfork, byte-order helpers, environ, and the sys_siglist array. The full vendored dash TU set now compiles -nostdinc against the sysroot with no unresolved libc symbols; proof make run-c-libc-surface. The dash build pipeline (posix-p1-4-dash-build-pipeline) landed on top of it (2026-05-26 05:11 UTC): make dash builds and links target/dash/dash.elf. See docs/backlog/posix-adapter-dash-port.md Slice 12.5.

Critical gap: pipe(). The shell pipeline ls | grep foo requires fd 1 of ls to feed fd 0 of grep. capOS has no pipe capability today. This is the first-port-blocking item; see Phase P1.3.

What dash will not get in v0

  • Job control (Ctrl-Z, bg, fg, & background): requires real SIGCHLD/SIGTSTP. Skip; documented as out of scope.
  • Process groups, sessions, controlling terminals: same reason.
  • trap for signals other than EXIT: handlers stored, never fired.
  • read -t (timeout): doable via Timer cap; defer to v1.
  • ulimit: returns 0 / ENOSYS. Quotas are kernel-side capability ledgers, not POSIX rlimits.

Validation smoke

make run-posix-shell-smoke:

  1. Boot a manifest that grants dash a TerminalSession (stdio), a read-only bootstrap-granted Directory cap rooted at a tiny in-rodata pseudo-fs (the resolver remains Namespace-shaped for forward parity with the future userspace Namespace service; the v0 manifest grants a Directory because that is what Storage Phase 3 slice 2 ships as a kernel CapObject today), a ProcessSpawner narrowed to one allowed binary (ls-shim), and a Timer cap.
  2. Pipe a heredoc into stdin: ls; echo done.
  3. Assert kernel log shows done and clean exit.

Stretch goal smoke: cat foo | grep bar end-to-end (depends on the pipe primitive landing).

First Port: DNS Resolver

Status update (post-smoltcp). The original v0 DNS smoke (posix-dns-resolver, Phase P1.2 Phase B) drove a hand-rolled A query through a raw kernel UdpSocket cap; that smoke is retired with the qemu-only kernel UDP owner. Name resolution now goes through a typed system DnsResolver capability (network-system-dnsresolver-cap-local-proof), and libcapos-posix exposes the standard POSIX surface over it: getaddrinfo / freeaddrinfo / gai_strerror (src/netdb.rs, include/capos/posix/netdb.h) resolve one IPv4 A result through a granted dns_resolver endpoint and map the typed resolver status onto addrinfo / EAI_*, with no ambient UDP fallback (a process without the cap gets a deterministic EAI_FAIL). A read-only /etc/resolv.conf projection is materialized at open() time from the resolver status (writes fail closed with EACCES; absent without the cap). Proof: make run-posix-getaddrinfo. The candidate survey below is retained as the original design rationale; vendored dns.c is no longer on the critical path for the resolver bridge. AAAA / sockaddr_in6, AI_* flags, and /etc/services remain follow-ups (each fails closed: EAI_FAMILY / EAI_BADFLAGS / EAI_SERVICE).

Candidate survey

LibraryLicenseSource sizeDepsAsync styleVerdict
musl res_query (upstream)MIT~2 kSLOC for resolver coreEmbedded in muslSynchronous (parallel queries internally)Available only if the build links musl; capOS does not. Skip.
c-ares (upstream)MIT, C89~30+ kSLOC, multi-file, configure-drivenPOSIX sockets, optional threadsNative async (callbacks + select/poll/event loop)Largest surface, most mature, most invasive port
dns.c (wahern) (upstream)MITsingle-file C, ~10 kSLOC, no depsNone – caller provides socket I/O via three pluggable patterns (pollfd / events / timeout)Non-blocking, no required callback shapeRecommended primary
GNU adns (upstream)GPL-2.0+Multi-file, ~10-15 kSLOCPOSIX, no event-loop integrationAsync, opaque stateLicense is GPL-2.0+, not BSD/MIT. Skip unless capOS accepts a GPL component in the demo path.
udns (upstream)LGPL-2.1smallPOSIXAsync stub-onlyLGPL plus older project; skip unless dns.c blows up
SPCDNSLGPLsmallencode/decode only, no socketn/aSkip – provides no resolver loop
trust-dns-resolver in RustApache-2 / MITlargeTokioasyncReject – defeats the purpose of porting C. Native Rust resolver is a separate path.

Recommended primary: dns.c by William Ahern.

Reasons:

  1. Single-file, zero deps. Drops into the build with a minimal cc rule. The build avoids configure scripts, pkg-config, optional feature matrices, and multi-file build orchestration.
  2. No fixed I/O model. dns.c is designed around three common methods (pollfd, events, timeout). The host adapter plugs capability-backed socket I/O without rewriting the resolver core, replacing socket()/sendto()/recvfrom()/poll() with libcapos-posix wrappers that return fd-shaped results backed by UdpSocket / TcpSocket caps.
  3. MIT license is capOS-compatible.
  4. ~10 kSLOC means port review can read it end-to-end.
  5. C89, no threading assumption, no global state surprises (resolver handle is opaque per-instance) – fits a single-process v0 design.

Open Question §2 below records that the candidate is a recommendation, not a final decision.

Required POSIX surface (v0)

The DNS resolver port exercises a very narrow POSIX subset:

GroupCallsBacked by
Stdio (logs only)write(2,...)Console cap
Allocationmalloc/free/calloc/realloclibcapos heap
Timeclock_gettime/gettimeofdayTimer cap
Sockets (UDP)socket(AF_INET, SOCK_DGRAM, 0), sendto, recvfrom, bind, close, setsockopt (subset)NetworkManager + UdpSocket cap
Pollingpoll(fds, nfds, timeout_ms)Synthesised: each fd carries its underlying cap; libcapos-posix uses cap_enter(min_complete=1, timeout_ns) with one CQE per ready fd. No new kernel surface needed for v0 if dns.c uses one fd per query.
Resolv configOne in-rodata bounded text blob inlined into libcapos-posix (single nameserver entry; v0 ships before any storage cap exists)No open / Namespace cap required for v0

No pipes, no fork, no exec, no signals, no /etc/resolv.conf-by-path, no Namespace or File caps required. The DNS resolver is strictly easier than the shell.

The v0 surface intentionally omits TCP fallback for truncated responses and intentionally omits any path-based config file. The optional TCP fallback row uses socket(SOCK_STREAM), connect, send, recv through the existing NetworkManager + TcpSocket cap, but only on a later iteration once the v0 UDP-only smoke is green; see “What dns.c will not get in v0” below.

Critical gaps:

  • UdpSocket capability. The networking proposal Phase B implements TCP + listener only; UDP “is deferred until the userspace network stack or DNS work needs it; it is not part of the Telnet Shell Demo contract” (networking-proposal.md). The resolver port creates the UDP path; it does not consume an existing one.
  • The future Resolver cap concept (in service-architecture-proposal.md “DNS resolver – consumes a UdpSocket, exports Resolver”) is a target once the UDP path exists. The first port produces the exported shape.

What dns.c will not get in v0

  • DNSSEC validation: dns.c supports it, depending on /etc/resolv.conf trust anchor config. Defer.
  • TCP fallback for truncated responses: implement on a second iteration once the TCP capability path is reusable.
  • mDNS: out of scope.
  • Recursive mode (acting as a recursive resolver): out of scope; v0 ships stub-only.

Validation smoke

make run-posix-dns-smoke:

  1. Boot a manifest that grants the resolver process a NetworkManager (or future narrowed UdpSocket-only authority), a Console cap, and a Timer cap. The single-nameserver resolv config is the in-rodata bounded text blob compiled into libcapos-posix; no Namespace or File cap is needed for v0.
  2. The resolver opens a UDP socket, sends a query for a known A record to QEMU’s user-mode 10.0.2.3 (slirp’s built-in DNS) or to an in-host test resolver.
  3. Resolver prints the resolved IPv4 address.
  4. Assert kernel log line matches.

Trade-offs and Ordering

Smallest-deps comparison

PortC surface neededNew capOS infrastructure requiredDifficulty
DNS resolver (dns.c)malloc, time, socket subset, write(2), open RO file, poll-equivalentUDP socket cap + NetworkManager exposure of UDP; otherwise reuses Phase B TCP path infraSmaller – strictly additive (UDP is missing today but the kernel-side smoltcp stack supports it)
POSIX shell (dash)malloc, full stdio, file I/O, directory iteration, pipe(), fork-for-exec, exec, wait, env, time, signals (stub)Pipe primitive (new), Namespace+File cap surface, ProcessSpawner sidecar work to honour fd-action grants, env-vector handoffLarger – touches storage / IPC / process surfaces

Which blocks which

  • Both ports can run in parallel at the libcapos / libcapos-posix layer level: each pulls a disjoint subset of POSIX surfaces.
  • DNS resolver blocks on a new capOS surface (UDP cap exposure) but does not block on pipe(), fork(), or exec().
  • Shell blocks on (in order of probable cost): pipe primitive, ProcessSpawner fd-action support for stdin / stdout redirection, Namespace+File cap availability, env vector / LaunchParameters.
  • The library substrate (libcapos staticlib + libcapos-posix scaffold) blocks both. Once the substrate exists, the two ports proceed in parallel.
  1. libcapos staticlib v0 (Phase P1.1). The thin Rust .a with cap_call, capset_get, sys_exit, sys_cap_enter, heap. Plus a “C hello world” smoke that calls console_write_line() (mirrors the userspace-binaries proposal “Future Phase: libcapos for C”). This phase is the prerequisite for both P1.2 and P1.3.
  2. libcapos-posix scaffold – fd table, errno cell, stdio wrappers for fd 0/1/2, stub signals, _start glue that registers argv / envp from LaunchParameters (or empty arrays if that surface has not landed), basic malloc/free re-export.
  3. dns.c port (Phase P1.2). The schema half of P1.2 (the UdpSocket interface and NetworkManager.createUdpSocket method) landed in Phase A and released the shared schema serial surface; Phase B (kernel UDP path, libcapos-posix, dns.c vendoring, demo) does not re-acquire the surface and so does not contend with P1.3 on the schema half.
  4. dash port (P1.3 lays the pipe + fork-for-exec primitives; Storage Phase 3 slices 1-3 land the kernel-side File / Directory / Store / Namespace CapObjects and KernelCapSource grant sources that back the dash v0 smoke’s read-only in-rodata pseudo-fs; the actual dash vendoring is a successor task that owns the libcapos-posix file / dir / stdio / env / printf surface and the smoke harness rather than new kernel surface). P1.4 does not touch schema/capos.capnp and so does not contend on the shared schema serial surface.

Critical path

The DNS resolver is the smaller-deps first slice only because of the shell’s pipe / file dependencies. With P1.3 (pipe + fork-for-exec) and Storage Phase 3 slices 1-3 (RAM-backed File / Directory / Store / Namespace CapObjects) both landed, the shell-first prerequisite gates are closed; the remaining P1.4 work is dash vendoring + per-call-site patching, the multi-translation-unit C build, and the smoke harness.

What this slice does not promise

  • Not a path to running glibc-built binaries unchanged. Both ports are sources-on-disk recompiled against libcapos-posix. Binary compatibility with Linux ELFs is not in scope.
  • Not job control, not signals, not full POSIX session/pgrp model.
  • Not a libc – the POSIX surface ships just enough for dash and dns.c. printf family lands in libcapos-posix only because both ports need it; this is not a <stdio.h> for general use.
  • Not a reason to skip the native Rust paths – capos-shell (Rust shell/ crate) remains the default capOS shell. dash is for porting validation, not as the system shell.
  • Not a foundation for hosted C++. C++ requires explicit ABI decisions tracked separately in docs/proposals/userspace-binaries-proposal.md.

Phase Decomposition

Phases are dispatch-ready. P1.1 closed 2026-05-05 13:28 UTC at merge fe5f5208. P1.2 splits into Phase A (closed 2026-05-05 18:02 UTC, schema additions + open questions + capos-rt typed client) and Phase B (open, kernel UDP path + dns.c demo). P1.2 Phase B does not touch schema/capos.capnp and so does not contend with P1.3 on the shared schema serial surface; P1.3 still adds a Pipe interface and must queue on the surface per docs/backlog/index.md Concurrency Notes when selected.

Phase P1.1 – libcapos C-substrate v0 + C hello-world smoke

Closed 2026-05-05 13:28 UTC at merge fe5f5208 (initial slice b2e09bce, transfer-record helper 81a88fab). Delivered scope:

  • New crate libcapos/ with crate-type = ["staticlib"] (cargo [lib].name = "capos" so the archive lands as libcapos.a) exposing the capos-rt syscall, ring CALL, CapSet lookup, typed Console.writeLine wrapper, and malloc/free/calloc/realloc heap shims through extern "C".
  • Public C header at libcapos/include/capos/capos.h.
  • make c-hello builds the C smoke directly with clang + lld using the shared demos/linker.ld, links against libcapos/target/.../libcapos.a, and reuses capos-rt’s _start through libcapos’s capos_rt_main trampoline.
  • Demo demos/c-hello/ (single .c file calling console_write_line).
  • Manifest system-c-hello.cue.
  • No POSIX surface, no errno, no pthreads.
  • Validation: make run-c-hello boots; the C binary prints [c-hello] hello from c-hello (the marker tools/qemu-c-hello-smoke.sh greps) and exits cleanly.

Phase P1.2 – UDP cap surface + dns.c stub resolver smoke

P1.2 splits into two dispatch waves so the kernel-side wave can serialise behind the active DDF hostile-smoke work on kernel/src/cap/network.rs and kernel/src/virtio.rs without holding the schema-only wave.

Phase P1.2 Phase A – schema + open questions + capos-rt client

Closed 2026-05-05 18:02 UTC. Delivered scope:

  • Open questions §2 (DNS resolver = dns.c by William Ahern), §4 (errno via per-thread TLS cell exposed through __errno_location()), §5 (static-array fd table in libcapos-posix, 32-fd cap for v0), and §8 (four-method blocking UDP shape with the wait deadline owned by the ring client, not a per-method timeoutNs parameter) resolved in this proposal.
  • Schema additions to schema/capos.capnp: new UdpSocket interface (sendTo, recvFrom, close) plus the new NetworkManager.createUdpSocket method. Generated bindings refresh verified via make generated-code-check.
  • New UDP_SOCKET_INTERFACE_ID constant in capos-config/src/lib.rs.
  • New typed UdpSocketClient in capos-rt/src/client.rs, mirroring the existing TcpSocketClient shape (create/send_to/recv_from/ close).
  • Schema serial-surface release: this slice held the surface during schema additions and released it at merge.

Phase P1.2 Phase B – kernel UDP path + dns.c + demo

Closed 2026-05-05 21:21 UTC. Delivered scope:

  • Kernel: extended kernel/src/cap/network.rs with the UDP path mirroring the existing TCP path (UdpSocketCap, handle_create_udp_socket/handle_udp_send_to/handle_udp_recv_from/ handle_udp_socket_close, deferred-recv parking via PendingUdpRecv), and added UDP runtime methods on the existing scheduler-polled smoltcp runtime in kernel/src/virtio.rs (create_udp_socket/send_udp/recv_udp/close_udp_socket over a bounded MAX_PUBLIC_UDP_SOCKETS slot table with generation-bumped handles).
  • New standalone Rust staticlib crate libcapos-posix/ (NOT a workspace member, mirrors the libcapos pattern) producing libcapos_posix.a. Provides:
    • per-process static-array fd table (MAX_FDS = 32), per Open Question §5;
    • single-thread errno cell exposed via __errno_location(), per Open Question §4;
    • socket(AF_INET, SOCK_DGRAM, 0) / sendto / recvfrom / close() over UdpSocket and clock_gettime(CLOCK_MONOTONIC, ...) / gettimeofday(&tv, NULL) over Timer (single-shot Timer.now() calls in v0; long retry loops handled by the consumer).
    • C headers under libcapos-posix/include/capos/posix/: errno.h, sys/socket.h, time.h, unistd.h.
    • Reuses libcapos’s installed runtime through a renamed extern crate libcapos_::runtime::with(...) (the underscore avoids colliding with libcapos’s C-side capos_* exports). libcapos was promoted to crate-type = ["staticlib", "rlib"] to support this.
  • Vendored vendor/dns-c-wahern/ (William Ahern dns.c at rel-20160808, commit 4ec718a77633c5a02fb77883387d1e7604750251, MIT). Mirror-as-is; only src/dns.c and src/dns.h retained alongside LICENSE and README.md per the WASI W.1 vendoring discipline. See vendor/dns-c-wahern/VENDORED_FROM.md.
  • New C smoke demos/posix-dns-resolver/main.c that links against libcapos.a + libcapos_posix.a and drives a hand-rolled DNS A query for example.com to QEMU slirp DNS at 10.0.2.3:53. The binary uses the vendored dns.c as a reference but does NOT compile dns.c whole into the smoke. Rationale: dns.c expects a POSIX header set (signal.h, fcntl.h, poll.h, netinet/in.h, arpa/inet.h, netdb.h, sys/select.h, sys/un.h) substantially wider than the v0 libcapos-posix surface. Compiling dns.c whole would require either patching the vendored tree or shipping a much larger POSIX header surface than this slice scopes; documented as follow-on work in VENDORED_FROM.md.
  • New focused-proof manifest system-posix-dns.cue (own CUE package, imports the shared capos.local/cue/defaults package per the slice-3 defaults pattern) granting the smoke console, network_manager, and timer.
  • New Makefile target run-posix-dns-smoke and harness tools/qemu-posix-dns-smoke.sh. The smoke prints [posix-dns-resolver] resolved example.com -> <addr> (an arbitrary IPv4 dotted-quad slirp returns from upstream resolution) and exits cleanly. Verified at 2026-05-05 21:21 UTC: make run-posix-dns-smoke returns 0 with resolved example.com -> 104.20.23.154 in the kernel log; make run-net regression keeps S.11.2.7 / S.11.2.8 hostile-smoke proof lines green.

Depended on Phase P1.1 and Phase P1.2 Phase A.

Phase P1.3 – Pipe capability + fork-for-exec scaffolding

Closed 2026-05-07 09:55 UTC. make run-posix-pipe-smoke is the load-bearing gate; it drives the dash-shaped pipeline pattern end to end through the kernel Pipe capability and the recording-shim fork+execve path.

What landed:

  • Schema: new Pipe interface (read / write / close / isClosed) and ProcessSpawner.createPipe(bufferBytes). The generated tools/generated/capos_capnp.rs baseline was refreshed through the canonical capnpc step and make generated-code-check passes.
  • Kernel: kernel/src/cap/pipe.rs ships the bounded SPSC byte ring with EOF-on-close semantics, kept symmetric with the UDP recv ceiling (4 KiB). Each cap half stores an Arc<PipeShared> plus a direction; close on one side flips the shared closed flag and the per-tick poll completes the peer. kernel/src/cap/mod.rs and kernel/src/sched.rs integrate the new poll alongside the existing network poll.
  • Kernel: kernel/src/cap/process_spawner.rs gains handle_create_pipe, mirroring the UDP-socket result-cap transfer pattern. The existing spawn Move-grant path is reused; no changes to the spawn ABI.
  • Userspace runtime: capos-rt/src/client.rs exposes typed PipeClient (read/write/close/isClosed and matching *_wait) plus ProcessSpawnerClient::create_pipe / create_pipe_wait and the CreatePipeResult projection of the two transferred halves.
  • libcapos-posix: new pipe.rs and process.rs modules. The fd table grows a FdBacking::Pipe variant; dup_for_dup2() clones the OwnedCapability<Pipe> so an aliased fd does not release the underlying cap until the last fd drops. pipe, read, write, dup, dup2, fork, execve, waitpid, _exit, and posix_inherit_stdio are exposed via C ABI. dup2 and close inside a fork-recording window route through process::maybe_record_dup2 / maybe_record_close rather than mutating the parent fd table; execve consumes the recorded actions as stdio_<N> spawn grants – Pipe/TerminalSession forwarded Move, Console/Directory/File forwarded Raw over their Copy-transferable caps – and returns the synthetic child pid as its own return value so the user pattern becomes int spawn_pid = execve(...); if (spawn_pid < 0) /* surface error to the caller; do NOT _exit because the pseudo-child branch is still the parent process */; child = spawn_pid; (no setjmp / longjmp involved – earlier iterations longjmp’d back to the fork() call site, which dropped back into a returned-and-deallocated stack frame and was undefined behaviour). After a successful spawn, each Move-granted source fd slot is replaced with a FdBacking::Moved sentinel and the underlying OwnedCapability is forgotten so the parent does not queue a stale CAP_OP_RELEASE for the moved cap_id; a subsequent close(src) on the parent side (the dash-shaped pattern’s “I no longer hold the write end”) removes the sentinel without a kernel round trip. A Raw/Copy grant (Console/Directory/File) is non-destructive: the parent’s own fd is restored intact, since the kernel handed the child a separate alias. The child side adopts each stdio_<N> grant back into slot N by interface id (fd::inherit_stdio_grants), wrapping an inherited directory fd through fdopendir(); proof make run-posix-execve-inherit-smoke.
  • libcapos-posix successor surface: direct posix_spawn and posix_spawn_file_actions_init / destroy / adddup2 / addclose reuse the same action-replay helper behind the recording-shim execve path. Recording-shim execve now delivers argv through the private posix_argv Pipe grant described above. Direct posix_spawn still accepts argv and envp for source compatibility but does not deliver them to the child yet; direct-spawn argv/environment remain empty until a typed LaunchParameters / environment grant exists.
  • libcapos-posix stdio successor: landed at commit aa6a56d7 (2026-05-13 11:03 UTC). fd 1 and fd 2 initialize to the granted Console cap when present, but only after any stdio_<N> recording-shim grants have been adopted into their slots. fd 0 is not synthesized from Console; read(0, ...) stays closed unless a real stdin backing is granted. make run-posix-stdio-smoke prints distinct stdout/stderr markers through POSIX write and proves the no-stdin refusal path.
  • Demo: demos/posix-pipe-shim/main.c (parent) and demos/posix-pipe-child/main.c (child). The parent pipes, forks, the child-pseudo-context dup2()s the write end onto STDOUT_FILENO, closes both pipe fds, and execve()s the child; the child calls posix_inherit_stdio(), writes “hello via pipe” to fd 1, closes it, and exits 0; the parent drains the read end through read() until EOF, waitpid()s, and emits [posix-pipe] read 14 bytes: hello via pipe.
  • New manifest system-posix-pipe.cue (own CUE package, imports the shared capos.local/cue/defaults package). New Makefile target run-posix-pipe-smoke and harness tools/qemu-posix-pipe-smoke.sh. Verified 2026-05-07 09:55 UTC: make run-posix-pipe-smoke returns 0 with the proof line in the kernel log; make run-smoke and make run-spawn regressions stay green.
  • Schema serial-surface coordination: held the surface for the P1.3 schema additions and released on merge.

Open Question §6 closed: Variant A (recording shim) is the adopted answer. fork() records “next exec is the real spawn” in TLS and returns 0; the shim translates inter-call dup2/close into spawn-grant Move actions; and execve() performs the spawn and returns the synthetic child pid as its own return value (the caller forwards the pid to the parent flow’s waitpid via int spawn_pid = execve(...); if (spawn_pid < 0) /* surface error to the parent's normal error path; the pseudo-child branch is still the parent process so do NOT _exit */ ; child = spawn_pid;). Earlier iterations used setjmp / longjmp to fake the fork-return-twice semantic; that approach was replaced because the longjmp jumped back into fork()’s already-returned (and deallocated) stack frame, which is undefined behaviour. Variant B (patched-port posix_spawn only) is rejected for v0. Variant A still requires a small dash-side patch – the four-line “capture spawn_pid; bail on -1; assign back to child” snippet at each fork-exec site – because successful execve() now returns the synthetic pid where unmodified dash assumes execve only returns on failure. That patch surface is much narrower than Variant B’s “consolidate every fork+dup2+exec into a single posix_spawn call with explicit posix_spawn_file_actions” rewrite, which is why Variant A is the chosen v0 path. A 2026-05-13 successor exports the direct posix_spawn() surface over the same code path. Recording-shim execve argv now travels through a private posix_argv Pipe grant; direct posix_spawn argv/envp remain ignored until LaunchParameters / environment support lands.

Open Question §9 closed: kernel-allocated bounded SPSC ring with EOF-on-close, exposed as two cap halves sharing Arc<PipeShared>. Reader-closed surfaces bytesWritten = 0 to the writer (the EPIPE-equivalent chosen to avoid expanding the kernel ExceptionType vocabulary). Writer-closed surfaces eof = true to the reader after the buffered bytes drain. The shared MemoryObject + userspace ring alternative is rejected because EOF across process exits and bounded waiter wake semantics need kernel-side state anyway.

Depended on Phase P1.1.

Phase P1.4 – dash vendoring + libcapos-posix file/dir/stdio/env/printf surface

Status (2026-05-23 07:52 UTC): in flight. Slice 3 (libcapos-posix FdBacking File / Directory / Terminal variants + smoke) closed at commit ae58f936; Slice 4 (absolute-path resolver over a bootstrap-granted root Directory cap plus functional open()/opendir()) landed at commit 94b29177; the posix-file-directory-client-capos-rt closeout at commit f97d9833 (2026-05-23 06:23 UTC) adds functional lseek(), lazy readdir() over Directory.list, and the focused make run-posix-file proof. Slice 7 adds the focused printf/string C library subset and proves it with make run-posix-printf. Slices 8/9 add signal-registration stubs plus Timer-backed time() / nanosleep() / sleep() and prove them with make run-posix-signal-time. The kernel-side capability surface required for the v0 dash smoke landed under Storage and Naming Phase 3 slices 1-3: RAM-backed File (kernel/src/cap/file.rs), Directory (kernel/src/cap/directory.rs), and Store / Namespace (kernel/src/cap/store.rs, kernel/src/cap/namespace.rs) CapObjects, plus the matching KernelCapSource::file / directory / store / namespace manifest grant sources, are sufficient backing for the “read-only Namespace cap rooted at a tiny in-rodata pseudo-fs” the smoke described in §Validation smoke needs. Earlier proposal drafts called Phase P1.4 “blocked on the Namespace + File cap surface”; that framing is stale – the open work has moved out of the kernel and into the libcapos-posix userspace surface, the dash port itself, and the smoke harness. A userspace Store / Namespace service over a real backing store (the remaining Phase 3 item in the storage proposal) is not a prerequisite for the v0 dash smoke; the kernel bootstrap-grant Directory cap is the v0 backing.

The concrete checklist lives in docs/proposals/posix-adapter-proposal.md Task 4 and the long-form decomposition is in docs/backlog/posix-adapter-dash-port.md. This proposal records the phase shape and the substantive outstanding work groups; the backlog file owns per-step ordering.

Current closed surfaces and outstanding work groups, all in userspace and userspace-adjacent harness surface (no further kernel cap work needed for the v0 smoke):

  • dash vendoring + patch. Closed (posix-p1-4-dash-vendor, 2026-05-24 19:40 UTC). dash v0.5.13.4 is vendored mirror-as-is (full upstream tree, byte-identical) under vendor/dash/ with vendor/dash/VENDORED_FROM.md. The per-call-site Variant A patch (capture execve()’s synthetic pid return value, bail on -1, assign back to child) – the shape recorded in Open Question §6 and the Decisions §6 entry – lives under vendor/dash/patches/ as two .patch files: 0001-execve-return-synthetic-pid.patch propagates the synthetic pid up through tryexec()/shellexec() (the execve() call site), and 0002-vforkexec-adopt-synthetic-pid.patch adopts it at the vforkexec() fork-exec site. Cumulative diff 45 changed lines (< 50). dash’s inter-call dup2 / close between fork and execve already records through libcapos-posix and needs no per-call patching. Design evidence only: nothing compiles/runs at this slice; the C-build and shell-smoke slices below prove the behavior.
  • C-build pipeline for vendored multi-file C sources. Landed (posix-p1-4-c-multifile-build). The existing c-build helper compiles single-file demos/*/main.c smokes against libcapos.a + libcapos_posix.a. dash is a multi-translation-unit C codebase; the Makefile gained the reusable capos-c-multitu-elf define (instantiated with $(eval $(call ...))) that compiles a list of vendored .c files each to an object and links them with libcapos_posix.a + libcapos.a into a userspace ELF without dragging in an external libc. Toolchain remains clang --target=x86_64-unknown-none-elf -nostdlib -static per Open Question §11 and the libcapos C-substrate plan. Proven by the two-TU demos/c-multifile/ demo and make run-c-multifile, which asserts a cross-TU computed line.
  • dash build pipeline (autotools config.h + host table generators). Landed (posix-p1-4-dash-build-pipeline, 2026-05-26 05:11 UTC). The generic multi-TU rule runs no configure and no host generators, so the dash-specific prerequisites live under vendor/dash/capos/: a pinned config.h (derivation + host-table caveat in vendor/dash/VENDORED_FROM.md) and gen-tables.sh, which stages a patched source copy (keeping vendor/dash/src byte-identical) and runs dash’s six host generators (mktokens, mksyntax, mknodes, mksignames, mkbuiltins, mkinit). The Makefile dash target funnels dash_CFILES + the five generated tables through capos-c-multitu-elf against libcapos_posix.a + libcapos.a in the -nostdinc sysroot mode, producing target/dash/dash.elf (static, 0 undefined symbols, both Variant A fork-exec patches compiled in). make dash proves build + link; the runtime QEMU proof is the dependent shell smoke below.
  • File / directory I/O surface in libcapos-posix. Typed FileClient and DirectoryClient wrappers landed in capos-rt/src/client.rs at commit 747a8611 (2026-05-16 20:07 UTC); FILE_INTERFACE_ID / DIRECTORY_INTERFACE_ID constants are already in capos-config/src/lib.rs. Slice 3 added the FdBacking::File / FdBacking::Directory / FdBacking::Terminal variants in libcapos-posix/src/fd.rs at commit ae58f936 and the matching smoke. The current surface implements open, close, read / write (joining the existing pipe/UDP read/write dispatch), lseek, opendir, readdir, and closedir; make run-posix-file proves these through a live POSIX C process. File-backed fds now store the POSIX access mode from open(): read rejects O_WRONLY, write rejects O_RDONLY, ftruncate requires a write-capable fd, and O_RDONLY | O_TRUNC is denied before the resolver can reach Directory.open. dup / dup2 preserve the stored mode, and the recording-shim execve path grants a private posix_fd_rights metadata pipe so inherited File fds reconstruct the same attenuation in the child fd table. make run-posix-open-smoke and make run-posix-file carry the same-process denial checks; make run-posix-execve-inherit-smoke proves the recording-shim inheritance path preserves read-only and write-only File fd modes.
  • Path resolver over a root Directory cap. A resolver in libcapos-posix/ walks a path through a bootstrap-granted root Directory cap and returns File / Directory result caps via existing IPC cap-transfer machinery. A v0 per-process current-working- directory string (getcwd / chdir, libcapos-posix/src/cwd.rs) plus cwd-relative resolution for open / opendir / stat / access / unlink / mkdir landed (make run-posix-cwd); chdir stores only the normalized path string and drops the validated cap, so cwd inheritance across spawn is still deferred. .. is not collapsed: escape is prevented by the kernel Directory cap’s lack of a parent edge, not a resolver clamp. The Namespace / Store resolver shape remains documented for a future real filesystem service.
  • Remaining file metadata calls. stat, fstat, access, and unlink remain fail-closed stubs until a dash call site requires the stable struct stat and remove-contract shape.
  • Stdio over TerminalSession. FdBacking::Terminal adopting the bootstrap-granted TerminalSession cap as fd 0 / fd 1 / fd 2 when the manifest supplies one. Implements Open Question §7’s decision (canonical fd 0 backing = TerminalSession). The existing pipe-backed inheritance path stays in place for posix_spawn-driven pipeline children. posix_inherit_stdio() becomes a one-shot adopter for the terminal grant too.
  • Env vector + getenv / setenv / putenv. Per-process env vector in libcapos-posix, populated at startup from manifest rodata (a bounded env grant on initConfig.init, mirroring the wasiEnv :Text bounded grant the WASI host adapter already uses for Preview 1 environ_get). The eventual typed LaunchParameters cap remains a follow-on; the v0 env source is the manifest rodata grant.
  • printf / string subset. Implemented in libcapos-posix: printf / fprintf / vprintf / vfprintf / snprintf / vsnprintf; memcpy / memmove / memset / memcmp; strlen / strcmp / strncmp / strchr / strrchr / strcpy / strncpy / strcat / strncat / strdup; atoi / strtol / strtoul; and the ctype subset (isspace / isdigit / isalpha / isalnum / tolower / toupper). Formatted output is bounded to the documented v0 integer / string conversions and width/precision caps; floating-point, fopen, stream buffering, and locale stay out of scope. make run-posix-printf proves the surface from a live capOS C process. libcapos already exports malloc / free / calloc / realloc for C consumers.
  • Signal stubs. Implemented in libcapos-posix: signal / sigaction validate and store handlers in a per-process table but never deliver them; kill fails closed with EPERM because this POSIX surface has no target ProcessHandle authority; raise fails closed with ENOSYS because self-delivery is not implemented. make run-posix-signal-time proves the documented behavior from live capOS C process output. Real SIGCHLD / SIGTSTP delivery and job control remain out of scope.
  • Time additions. Implemented in libcapos-posix: time(2), nanosleep, and sleep reuse the existing Timer cap path already used by clock_gettime / gettimeofday. make run-posix-signal-time proves monotonic-since-boot time() output, bounded nanosleep(), and one-second sleep() from live capOS C process output.
  • Identity stubs. Implemented: getpid returns the stable capos-rt bootstrap pid for the current process, while the recording-shim child pid allocator stays above the caller’s pid for the waitpid table; getuid / getgid return the hardcoded single-identity uid/gid 0. make run-posix-identity proves a parent and fork/exec child observe distinct process-visible pids from live capOS C code.
  • isatty / getppid (closed 2026-05-24 08:47 UTC). Both are pure-userspace dash prerequisites over the existing fd table – no kernel, cap, IPC, or schema change. isatty(fd) returns 1 for an FdBacking::Terminal slot, 0 with errno = ENOTTY for any other live backing, and 0 with errno = EBADF for an empty/closed slot. getppid() returns the v0 single-identity parent constant (1); no kernel parent handoff exists yet, so it is an honest stub alongside the getpid single-identity path. make run-posix-isatty proves isatty(0/1/2)=1 over bootstrap-granted TerminalSession stdio, isatty(pipe_fd)=0 errno=ENOTTY, and getppid=1 from live capOS C process output.
  • fcntl (closed 2026-05-24 09:23 UTC). A pure-userspace dash prerequisite over the existing fd table – no kernel, cap, IPC, or schema change. F_DUPFD/F_DUPFD_CLOEXEC duplicate into the lowest free slot >= arg over the same dup_for_dup2 alias path dup/dup2 use; F_GETFD/F_SETFD round-trip a per-fd FD_CLOEXEC byte; F_GETFL reports a stable access mode (O_RDWR for Console/Udp/Pipe/Terminal, the stored open() mode for File, O_RDONLY for the read-only Directory); F_SETFL fails closed with EINVAL when the argument carries O_NONBLOCK (the v0 ring calls block with WAIT_FOREVER, so there is no non-blocking mode to switch into), except on UDP socket fds, where it is accepted-and-ignored for the vendored dns.c snapshot whose documented contract already drives deadlines from userspace; other status bits (e.g. O_APPEND) stay accept-and-ignore. Unknown cmd yields EINVAL and a closed/out-of-range fd yields EBADF. CLOEXEC is enforced at recording-shim execve time: the full-fd-table inheritance walk skips a slot whose flags byte carries FD_CLOEXEC unless an explicit recorded dup2 named that child slot. make run-posix-fcntl proves the F_DUPFD,10 relocation, the FD_CLOEXEC round-trip, F_GETFL=O_RDWR for a pipe, and the EBADF/EINVAL error paths from live capOS C process output.
  • Manifest + smoke harness (landed 2026-05-27 09:36 UTC). system-posix-shell.cue grants dash a TerminalSession (stdio), a bootstrap RAM Directory (root), a ProcessSpawner, and a Timer. New demos/ls-shim/ one-binary listing helper wraps the inherited directory fd with fdopendir() (the smoke’s only allowed spawn target). make run-posix-shell-smoke + tools/qemu-posix-shell-smoke.sh feed a heredoc into the shell’s fd 0 – the shell creates two RAM-root entries, opens the directory as fd 3 (exec 3< /), runs /ls-shim, and prints done – and assert the alpha/beta entry lines, done, two clean-exit log lines, the scheduler halt line, and clean QEMU exit. The ls-by-bare-name vs /ls-shim PATH-stat workaround uses the slash-bearing path, which the recording-shim spawn maps to the manifest binary name by basename. Stretch: extend the smoke to cat foo | grep bar end-to-end, exercising the P1.3 Pipe primitive through a shell pipeline. Stretch closed (2026-05-27, posix-dash-pipeline-exec-reconcile): dash patch 0004-pipeline-evexit-recording-shim.patch reconciles the EV_EXIT in-place shellexec path with the recording shim (every evalpipe element takes that path, which the original patch set had left unreconciled), and libcapos-posix gained wildcard waitpid(-1)/wait3 reaping. make run-posix-shell-smoke now drives the pipeline (match bar here filtered through, four clean child exits). See docs/backlog/posix-adapter-dash-port.md Slice 14 and vendor/dash/VENDORED_FROM.md.
  • read builtin over fd 0 (landed 2026-05-31 20:35 UTC, posix-dash-read-builtin-terminal-line). Proves dash’s read VAR builtin consuming interactive input off its fd 0 TerminalSession cooked-mode line discipline – the one stdin path every prior smoke skipped (run-posix-shell-smoke feeds no stdin). No dash patch or libcapos-posix change was needed: dash’s tcgetattr(0)-derived canonical buffering takes the plain read(0, ...) branch, which the FdBacking::Terminal adapter satisfies one line at a time. make run-posix-read-builtin (system-posix-read-builtin.cue + tools/qemu-posix-read-builtin-smoke.sh) echoes back the harness-fed lines got=[hello world] / raw=[raw\back\slash] (the second under read -r, proving the no-escape path). The harness handshakes each feed on dash’s own terminal output because the kernel line discipline has no inter-read input buffer and the UART carries no EOF. See docs/backlog/posix-adapter-dash-port.md Slice 18.
  • Open question closures (Slice 1, closed 2026-05-24 00:53 UTC). Open Question §1 (dash 0.5.13.x candidate) and §7 (fd 0 backing = TerminalSession) are promoted to final decisions in this proposal’s “## Open Questions” section ahead of vendoring.

Recommended dispatch ordering: P1.1 -> P1.2 Phase A (schema + client, landed) -> P1.2 Phase B (kernel UDP path + dns.c, landed) and P1.3 (Pipe cap + fork-for-exec, landed) in either order, since they no longer contend on the schema serial surface -> P1.4 dash-port successors. P1.4 itself does not touch schema/capos.capnp and so does not contend on the shared schema serial surface.

Trust Boundaries

BoundaryNative capOS servicePOSIX-shaped C binary on capOS
Authority sourceProcess CapSetProcess CapSet projected through libcapos-posix fd table
Memory isolationPage tablesPage tables (no wasm-style sandbox; libc has no extra runtime check)
Code integrityW^X + NXW^X + NX
Cap forgeryKernel-owned CapTableSame; the fd table is per-process userspace state, not authority
Resource limitsKernel quotasKernel quotas; ulimit is ENOSYS
Side channelsHardware-level (Spectre etc.)Same hardware level

A POSIX binary on capOS is more constrained than on Linux, not less. The adapter provides familiar function signatures, not familiar authority.

Validation

The first ports are not complete until they have QEMU evidence:

  • A POSIX binary prints through a granted Console / TerminalSession.
  • The same binary cannot use write to a fd it was not granted, cannot open() a path outside its preopened namespaces, and cannot call an unimplemented POSIX function without receiving ENOSYS.
  • A missing or wrong-interface cap lookup returns the documented errno (not a host-side panic, not silent success).
  • An owned result cap is released deterministically when the binary exits.
  • Each demo binary exits cleanly and does not wedge the kernel.

Host tests should cover errno mapping and the per-process fd table once those pieces are pure enough to test outside QEMU. Do not claim “POSIX adapter works” from host tests alone; the useful behavior is authority- shaped POSIX execution in capOS.

Open Questions

The following design decisions are documented as open questions because the planning phase recommends an answer but has not yet committed to one.

  1. POSIX shell candidate. Decided (P1.4 Slice 1, 2026-05-24 00:53 UTC): dash 0.5.13.x, vendored at a pinned tag under vendor/dash/. Rationale: smallest established POSIX-strict shell (~13 kSLOC, readable in full by the porting team), no readline/termcap dependency (it talks to whatever fd 0 gives it), and a single-purpose /bin/sh posture that does not accidentally validate Bash extensions libcapos-posix does not implement. Rejected: busybox ash (heavier embedded framework cost), oksh (ksh-superset, larger surface than v0 needs), toysh (incomplete upstream), and a custom Rust shell (it defeats the purpose of porting a real C program; the native shell/ capos-shell already exists). Vendoring, the Variant A patch, the multi-TU C build, and the shell smoke are later P1.4 slices (11-14).
  2. DNS resolver candidate. Decided (P1.2 Phase A, 2026-05-05 18:02 UTC): dns.c (William Ahern), vendored at a pinned tag under vendor/dns-c-wahern/. Rationale: single-file MIT C (~10 kSLOC .c plus header), no Cargo/CMake build system, no configure script, no required I/O model (caller plugs the socket layer), and a track record as a reusable resolver core in production software outside libc. The license is capOS-compatible and does not force a transitive libc port. Rejected: musl libresolv – tied to the rest of musl’s headers, build, and __syscall shape; pulling it in either drags musl as a transitive dependency or forces a per-symbol carve-out that defeats the “single .c plus header” cost profile. Rejected: c-ares (configure-driven, ~3x larger, more invasive port). Rejected: GNU adns (GPL-2.0+ license question). Rejected: pure-Rust trust-dns (defeats the C-port purpose).
  3. libcapos versioning and naming. The C library is just libcapos (mirrors the Rust capos-rt). Open question: should the POSIX layer be libcapos-posix (current recommendation), or a different name that avoids any Rust-side framework name collision? The C-side naming is settled; the POSIX-layer name remains an open question pending confirmation that no Rust framework will reuse the libcapos-posix identifier. Working answer: keep libcapos-posix for the POSIX layer.
  4. POSIX errno representation. Decided (P1.2 Phase A, 2026-05-05 18:02 UTC): per-thread errno cell exposed via __errno_location() – the standard POSIX shape. Storage lives in libcapos-posix, owned by a thread-local cell accessed through a stable extern "C" int *__errno_location(void); function so vendored ports (dns.c, dash, future C software) compile against errno exactly as on Linux/musl. Rust internals keep the typed CapError/CapException shape; one bidirectional mapping at the C boundary writes the int value into the TLS cell so internal callers cannot invent unmapped values. Rejected: per-fd error field – breaks source compatibility with every POSIX program that reads errno after read/recvfrom/open, requires every vendored port to be patched, and provides no isolation gain over the per-thread cell that the cap layer already exclusively writes.
  5. File descriptor table location. Decided (P1.2 Phase A, 2026-05-05 18:02 UTC): static-array fd table in libcapos-posix with a small fixed cap (target: 32 open fds per process for v0). Rationale: the lookup is one bounds-check + one array index in userspace with no syscall; the kernel keeps zero knowledge of fds, so capOS authority remains exactly the per-process CapTable and is not duplicated in a parallel kernel-side fd map. The fixed cap matches the surfaces dns.c (single fd) and a v0 shell port (a handful of stdio + pipe fds) actually exercise. Rejected: capability-table-backed fd map that resolves fd numbers through the process cap table – larger blast radius (fd churn would touch the kernel cap table on every dup/close), and the cap-table object id is already a userspace-visible handle through OwnedCapability, so a separate dense fd index in userspace is the right layer. The 32-fd cap can grow later (or migrate to a sparse representation) if a real consumer needs more, without changing the kernel surface.
  6. Fork policy. Decided (P1.3, 2026-05-07 09:55 UTC; refined 2026-05-07 10:30 UTC to drop setjmp/longjmp): Variant A – the recording shim. fork() records “next exec is the real spawn” in TLS and returns 0 unconditionally. dup2() and close() calls between fork() and execve() route through process::maybe_record_dup2 / maybe_record_close and are not applied to the parent fd table. execve() consumes the recorded actions, dispatches ProcessSpawner.spawn() with the matching pipe halves moved into the child as stdio_<dst> grants, parks the resulting OwnedCapability<ProcessHandle> in a per-process table, and returns the synthetic child pid as its own return value (a deliberate v0 deviation from POSIX, where execve only returns -1 on failure). The user pattern becomes int spawn_pid = execve(...); if (spawn_pid < 0) /* surface error to the parent's error path; do NOT _exit because the pseudo-child branch is still the parent */ ; child = spawn_pid;. After a successful Move-grant spawn the parent’s source fd slot is replaced with a FdBacking::Moved sentinel so a subsequent close(src) (the dash-shaped pattern’s “I no longer hold the write end”) removes the sentinel without a kernel round trip. The earlier setjmp/longjmp design longjmp’d back to fork()’s call site after execve() had returned – the saved jmp_buf RSP/RIP pointed into fork()’s stack frame, which was deallocated when fork() first returned, so the longjmp resumed inside a stale frame whose memory had already been reused by dup2/close/execve. A targeted dash patch is still required for the v0 contract: execve() returns the synthetic pid on success, where unmodified dash assumes execve() only returns on failure (and falls into its post-exec error path). Variant A keeps that patch surface narrow – the change is the four-line “capture spawn_pid; bail on -1; assign back to child” snippet shown above per fork-exec call site, not a wholesale rewrite of the fork-dup2-exec pattern – and dash’s inter-call dup2/close still record into the spawn grants without per-call patching. Rejected: Variant B (patched-port posix_spawn only) requires the port to consolidate every fork+dup2+exec sequence into a single posix_spawn call with explicit posix_spawn_file_actions, a much wider patch surface. A 2026-05-13 successor now exports direct posix_spawn() over the same execve-backed action replay. Recording-shim execve argv now travels through a private posix_argv Pipe grant; direct posix_spawn argv/envp remain ignored until LaunchParameters / environment support lands.
  7. fd 0 backing for the shell. Decided (P1.4 Slice 1, 2026-05-24 00:53 UTC): the canonical fd 0 / 1 / 2 backing for the v0 dash smoke is TerminalSession – the natural mapping (read line + cooked-mode line discipline already exists in kernel and migrates to userspace at networking Phase C). For the DNS resolver fd 0 is unused and stays unmapped. The backing is realized by the FdBacking::Terminal variant in libcapos-posix/src/fd.rs plus posix_inherit_stdio() adopting the bootstrap-granted TerminalSession cap, mirroring the existing pipe-inheritance path; that implementation already shipped under P1.4 Slice 5 and is proven by make run-posix-stdio-terminal-smoke. This slice only records the backing choice as final.
  8. UDP cap surface scope. Decided (P1.2 Phase A, 2026-05-05 18:02 UTC): four-method blocking shape that mirrors the existing TCP cap pattern, with the wait deadline owned by the ring client (not the method parameter list). Methods:
    • NetworkManager.createUdpSocket(localAddr :Data, localPort :UInt16) -> (socketIndex :UInt16) – bind a UDP socket to the given local (addr, port) (localAddr empty selects the configured interface; localPort = 0 selects an ephemeral port). The result cap is transferred via socketIndex in the CQE result-cap list, matching connectTcp.
    • UdpSocket.sendTo(addr :Data, port :UInt16, data :Data) -> (bytesSent :UInt32).
    • UdpSocket.recvFrom(maxLen :UInt32) -> (addr :Data, port :UInt16, data :Data)blocking, no in-method timeout. Same CQE-on-completion shape as TcpSocket.recv: the kernel parks the SQE until a datagram arrives. The caller bounds the wait through the existing RingClient::wait(call_id, timeout_ns) mechanism; dns.c-style retry/deadline loops drive that bound from userspace. If the caller wants to abort a parked recvFrom early, it issues close() on the socket; the parked completion then returns a Disconnected-class CapException. The v0 surface deliberately does not introduce a new Timeout exception class, since none exists today (ExceptionType covers only failed, overloaded, disconnected, unimplemented) and inventing one for a single method would expand the kernel error surface ahead of any consumer that needs to distinguish wait-expiry from generic disconnect.
    • UdpSocket.close() -> (). Rationale: the blocking shape maps directly onto dns.c’s existing retry/timeout loop (dns.c does its own resend and deadline tracking, then issues a bounded blocking read backed by the ring wait), so the v0 port plugs in without a separate readiness/poll surface. The shape also reuses every primitive already present for TCP – ring-side cap_enter parking, transferred result caps, client-side RingClient::wait deadline – so the kernel UDP path in P1.2 Phase B is a near-mirror of the TCP path. Rejected (deferred): readiness/poll-style recvFrom – the cap surface decision (one-shot wait vs an event stream over an Endpoint) is itself unsettled, has no live consumer, and adding a second wait shape now would force every port to choose. Add a separate readiness method (or a generic Pollable cap) when a real consumer needs it, not before. Rejected: per-method timeoutNs parameter – creates two competing deadlines (the in-method timeout and the ring wait) that race on the same call, would require either inventing a new Timeout exception class or overloading Disconnected ambiguously, and is redundant with the ring wait the client already issues.
  9. Pipe cap design. Decided (P1.3, 2026-05-07 09:55 UTC): kernel-allocated bounded SPSC ring (4 KiB ceiling, default to the maximum) with EOF on close. The two halves share an Arc<PipeShared> and store their direction; close on one side flips the matching closed flag and the per-tick poll completes the peer. Both halves implement the same Pipe interface (read / write / close / isClosed); the kernel rejects wrong-direction calls with a failed exception. Reader-closed surfaces bytesWritten = 0 to the writer (the EPIPE-equivalent chosen to avoid expanding the kernel ExceptionType vocabulary). Writer-closed surfaces eof = true to the reader after the buffered bytes drain. **Rejected: shared MemoryObject
    • userspace ring** because EOF across process exits and bounded waiter wake semantics need kernel-side state anyway, and the userspace path would still need a kernel cap to coordinate close races.
  10. argv / envp source. This proposal assumes a future LaunchParameters cap delivers argv / envp through a typed cap. Until that cap lands, libcapos-posix can carry argv / envp via a fixed well-known cap or rodata blob. Confirm gate-on-LaunchParameters versus ship-stub.
  11. Linker / toolchain for C consumers. Recommended: clang --target=x86_64-unknown-none-elf -nostdlib -static, link against libcapos.a (and optionally libcapos-posix.a), reuse the existing capos-rt linker script. Confirm clang vs gcc and whether the track ships a shared cc-glue Cargo crate or a Make rule invoking cc directly.
  12. Vendoring policy. In-tree vendor/dash/, vendor/dns-c-wahern/ versus out-of-tree submodule versus separate repo. Working answer: in-tree vendoring with pinned tags, mirroring the planned vendor/piccolo-no_std/ shape from the Lua track.
  13. Audit / measure-mode interaction. The libcapos-posix wrappers must not break measure mode (the measure feature). Most wrappers only call libcapos, which only calls capos-rt, which is already measure-mode-clean, so this should be free; confirm whether the track adds a make run-measure smoke for one libcapos-posix binary as a regression gate.

Relationship to Other Proposals

  • Userspace Binaries owns the broader native-binary, language, and POSIX-adapter roadmap. This proposal supersedes Part 4: POSIX Compatibility Adapter of that proposal with the full POSIX adapter design.
  • Programming Languages is the reader-facing summary of language support. The C row records the shipped libcapos.a + libcapos_posix.a surface (P1.1 + P1.2 + P1.3, plus the 2026-05-13 posix_spawn successor and Console-backed stdio slice). The POSIX-shaped software row cross-links this proposal as the long-form design source and records the P1.4 dash-port block on Namespace + File caps.
  • Networking defines NetworkManager, TcpListener, and TcpSocket and defers UDP. The DNS resolver port in Phase P1.2 adds the UdpSocket cap surface; the TCP cap surface is reused unchanged.
  • Storage and Naming defines the Directory / File / Store / Namespace surfaces that the shell port consumes. Phase 2/3 of that proposal gates the dash file I/O surface.
  • Service Architecture defines the future Resolver cap that the resolver port eventually exports.
  • Shell covers the native capos-shell. The POSIX shell port is for porting validation and does not replace capos-shell.
  • WASI Host Adapter is the parallel untrusted-portable execution path. POSIX adapter targets trusted source-recompiled C; WASI adapter targets sandboxed wasm modules. Both share the per-process fd-table and per-import authority pattern.
  • Lua Scripting is the capability-scoped trusted-script path; PUC Lua’s native build assumes a C substrate, so it eventually consumes libcapos.

Proposal: POSIX fork/execve Full-fd-table Inheritance

Make the capOS POSIX fork+execve recording shim inherit the parent’s full live fd table by default, honoring close-on-exec, so unmodified POSIX software (the dash port is the headline case) gets working stdin/stdout/stderr and an inherited cwd in its children without the application explicitly dup2-ing every descriptor. This reverses the v0 explicit-grant-only default, which is the inverse of real POSIX semantics, while keeping the capability model’s no-ambient-authority guarantee.

Why this is needed

capOS has no real fork (no address-space copy, no shared open-file descriptions). fork+execve is emulated by a recording shim (libcapos-posix/src/process.rs): fork() opens a recording window and returns 0; dup2/close between fork and execve are recorded as deferred fd actions; execve() replays them against a virtual child fd-view and forwards the resulting fds as CapGrants through ProcessSpawner.spawn. The child reconstructs its fd table from the named stdio_<N> grants (libcapos-posix/src/fd.rs inherit_stdio_grants).

The v0 contract is explicit-grant-only: in spawn_path_with_actions, only fd slots a recorded dup2/close touched become grants; untouched live slots are deliberately not inherited (the touched array gate). This is the inverse of POSIX, where a child inherits the parent’s entire fd table across fork+execve – every descriptor not marked O_CLOEXEC/FD_CLOEXEC – sharing the underlying open file descriptions.

The consequence is decisive for arbitrary POSIX software. Vanilla dash compiled JOBS=0 does not dup2 stdio for a foreground external command – only the FORK_BG path in vendor/dash/src/jobs.c (forkchild) manipulates fds. So dash -> ls-shim replays an empty action list and hands the child an empty CapSet: no stdout to print to, no Directory to list. This is not a dash bug; it is correct POSIX behavior (the child is expected to inherit dash’s stdio). The v0 shim’s inverted default breaks every POSIX program that relies on inheritance, which is essentially all of them.

The project directive is explicit: do not solve this with per-app dash patches (posix-p1-4-dash-shell-smoke). A fd-inheritance fix that must be re-applied to every POSIX program is not POSIX compatibility. The correct fix is to make the recording shim inherit the full fd table by default, like real POSIX, reconciled with the capability model.

Current state vs target

AspectRealized (done 2026-05-27)Notes
Inheritance defaultfull-table: every open slot forwards unless FD_CLOEXEC or a non-forwardable backingspawn_path_with_actions walks every open parent slot; recorded dup2/close are edits on the baseline
FD_CLOEXECenforced: an implicitly-inherited CLOEXEC slot is dropped at execve forward time; open(O_CLOEXEC) sets the bytean explicit recorded dup2 keeps its child slot (POSIX dup2 clears close-on-exec)
Terminal stdoutnon-destructive: the recording shim forwards TerminalSession via SpawnGrantMode::Raw (process.rs Terminal arm) over the Copy/SameSession bootstrap capparent keeps its terminal across the spawn (proof make run-posix-fd-inherit-default); kernel mint proven by make run-posix-terminal-forward
Writable File/DirectoryNonTransferable -> kernel rejects grant -> whole-spawn ENOEXECdocumented divergence (single-writer policy). v0 POSIX open mints only Copy/SameSession RAM/read-only caps, so none enters the fd table; a future writable-fs open path needs a pre-spawn transferability check to skip it non-fatally (follow-up)
Directory fd (open("/"))EISDIR; forwardable dir fd via dirfd(opendir()) (inherits by default under full-table)open(dir, O_RDONLY) -> FdBacking::Directory landed (§5, posix-open-directory-fd); non-O_RDONLY stays EISDIR

Target design

1. Full-fd-table inheritance default

execve() should forward the parent’s entire live fd table to the child, not only touched slots. The recording shim already builds a virtual child_view seeded from every open parent slot (spawn_path_with_actions); the change is to remove the touched-only gate so the forward list is built from every child_view[slot] == Some(parent_slot) entry, then apply the recorded dup2/close actions as edits on top of that baseline. The replay order is:

  1. Seed child_view[k] = Some(k) for every open parent slot k (already done).
  2. Apply recorded Dup2(src, dst) / Close(fd) actions in order (already done).
  3. New: skip any slot whose parent fd carries FD_CLOEXEC (see §2).
  4. Build a forward for every remaining child_view[child_slot] == Some(parent) entry – not only touched ones.

This makes the dash-> child case work: dash’s open stdio fds (0/1/2) flow to the child by default, exactly as POSIX requires, with no dup2 from dash.

A subtlety the v0 forward list already half-handles: the one-parent-slot-per- forward rule. Under full inheritance multiple child slots can legitimately map to the same parent fd (e.g. dash’s fd 0 and a child’s inherited fd 0 are the same open description). For non-destructive (Copy/Raw) backings this is fine – the parent keeps its cap and each child slot gets an independent Copy. For destructive (Move) backings (Pipe), the existing unique-owner / one-forward rule must hold: a single Move’d Pipe end cannot legitimately appear under two child slot names. The forward builder must therefore Copy-share where the backing permits and reject only the genuine Move-aliasing case, rather than the v0 blanket “one parent slot per forward for every backing type” rule. This is the main behavioral subtlety to get right and test.

2. close-on-exec enforcement

FD_CLOEXEC is currently stored per-fd (fd.rs FD_FLAGS) but never acted on, because the v0 explicit-grant model has no full-table walk to enforce it against. Under full inheritance there is now a walk: at execve forward-build time, a parent slot whose FD_FLAGS byte has FD_CLOEXEC set is not forwarded (equivalent to the recorded-Close path for that child slot). This needs a small read API on the fd module (e.g. fd::is_cloexec(slot)); the FD_FLAGS array already exists. O_CLOEXEC passed to open() must set the same byte at open time so the two surfaces agree. This is the POSIX-correctness half: inherit-all without CLOEXEC enforcement would leak descriptors a correct program expects closed (e.g. a listening socket dash opened for itself).

3. The TerminalSession-stdout problem (core decision)

Real POSIX dup-inherits the controlling terminal to all children non-destructively: a shell keeps its tty while every child writes to the same tty. The kernel precursor for this is now landed: the bootstrap TerminalSession cap is minted Copy/SameSession (boot_cap_hold, kernel/src/cap/mod.rs) and forwards non-destructively via SpawnGrantMode::Raw, proven by make run-posix-terminal-forward (a parent forwards its terminal to a child and both write distinct lines; the parent’s post-spawn write proves it kept its cap). The remaining gap is on the POSIX side: the recording shim still forwards a Terminal fd via destructive Move (process.rs Terminal arm) and must switch to Raw under posix-recording-shim-full-fd-inherit. Until then, forwarding fd 1 when it is a TerminalSession would still strip the parent under the shim path.

Decision (kernel mint landed): mint TerminalSession Copy/SameSession, matching Console, so it forwards via SpawnGrantMode::Raw non-destructively. This is safe because TerminalSessionCap (kernel/src/cap/terminal_session.rs) is a stateless unit struct – it carries no per-session ownership state; write/writeLine dispatch onto the shared kernel terminal, and readLine resolves caller context at call time (call_with_context). The Move/ServiceRegrantOnly choice was a policy default, not a state-ownership requirement. Minting it Copy/SameSession lets the parent keep its terminal cap while each child receives an independent Copy to the same shared terminal – which is exactly the POSIX all-children-share-the-tty semantics, realized through the capability model rather than against it.

Security/scope: Copy/SameSession keeps the cap from escaping the session (the same scope Console already uses); a child gains no authority the parent did not already hold (a write/read view of the same terminal it was already attached to). requires_live_caller_session stays true, so the child’s readLine still resolves against the child’s own live session context. This must be confirmed in the kernel slice’s security review, including that a forwarded terminal cap cannot outlive the session improperly and that line-discipline interleaving of two writers (parent + child) is acceptable for the research surface (it is: the shared kernel terminal already serializes writes; cooked-mode interleaving at sub-line granularity is a known, documented research-surface limitation, not a capability leak).

Alternative considered and rejected: a separate narrower TerminalWrite write-only cap (interface-is-the-permission). This is cleaner long-term but introduces a new interface, a new bootstrap source, a new FdBacking variant, and child-side adoption – disproportionate for v0 when the existing TerminalSession write surface is already the right shape and can be shared by a mint-mode change alone. Recorded as future work if a write-only child terminal view is later wanted.

4. Writable File/Directory single-writer tension

Real POSIX shares writable fds across fork (parent and child write to the same open description). capOS’s disk-backed writable filesystem enforces a fail-closed single-writer policy: writable File/Directory caps are minted NonTransferable (writable_fs::transfer_result_cap), so the kernel rejects the spawn grant and execve surfaces ENOEXEC.

Decision: keep writable File/Directory NonTransferable; document the divergence. Under full inheritance this means a child does not inherit a parent’s writable disk fd – execve must treat a NonTransferable backing as a non-fatal skip (drop that one fd from the child, like CLOEXEC) rather than a fatal ENOEXEC for the whole spawn. The v0 path made it fatal because the fd was explicitly dup2’d (the app asked for it); under full inheritance the fd is inherited implicitly, so failing the entire spawn because one incidental writable fd cannot transfer would break unrelated programs. The honest divergence: capOS shares the read path of the filesystem across fork (read-only caps are Copy/SameSession) but not the write path, because the single-writer policy is a deliberate capOS guarantee that has no POSIX analog. RAM scratch Directory/File (the kernel:directory/kernel:file sources) are Copy/SameSession and do inherit, matching the common shell-scratch case.

A future revocation-aware writable share (refcounted or session-scoped) is possible but out of scope; recorded as a follow-up. v0’s stance is: writable disk fds are not inheritable, skipped non-fatally, documented.

5. cwd Directory representation and inheritance

A shell’s children should be able to list/open the cwd without the app doing anything special. A forwardable directory fd is obtainable both via dirfd(opendir()) and, since posix-open-directory-fd, via open(dir, O_RDONLY) (libcapos-posix/src/file.rs). Two parts:

  • cwd as an inheritable Directory fd. Under full inheritance, if the shell holds an open FdBacking::Directory fd for its cwd, it forwards to the child by default (read-only RAM/readonly_fs dirs are Copy/SameSession). The child’s libc cwd resolution can then target the inherited dir fd. This is the primary mechanism and needs no new surface beyond full inheritance.
  • open(dir, O_RDONLY) -> Directory fd (landed, posix-open-directory-fd). open on a directory now installs a FdBacking::Directory fd instead of failing: read returns EISDIR, write returns EBADF, lseek returns EISDIR, and fdopendir consumes it. A non-O_RDONLY directory open stays EISDIR; a missing path keeps its original error (ENOENT). This covers the N</dir redirection path (dash redir uses sh_open -> open) without the bespoke dirfd(opendir()) dance. Proof make run-posix-open-dir-fd. It was decoupled from the headline path, which never depended on it.

6. Backward compatibility and re-verification

Changing the default from explicit-grant-only to full-inherit interacts with the just-landed explicit-grant contract and existing smokes. What must be re-verified when the behavior slice lands:

  • make run-posix-pipe-smoke – relies on explicit pipe-end Move grants. Under full inheritance the parent’s other open fds (e.g. its terminal stdio) would now also forward. The pipe child must still see EOF when the parent closes the write end, and the parent must not lose its own terminal (fixed by §3). The recorded close(write_end) still drops that child slot. Re-verify.
  • make run-posix-spawn-smokeposix_spawn with explicit file actions. The file-actions path must still honor explicit dup2/close; full inheritance is the baseline the actions edit on top of. Re-verify.
  • make run-posix-execve-inherit-smoke – the bespoke parent that explicitly dup2s a Directory/Console. Under full inheritance the explicit dup2s become redundant (the fds would inherit anyway) but must remain correct. Re-verify.
  • make run-posix-stdio-smoke / run-posix-stdio-terminal-smoke – stdio backing selection. Re-verify.

The capability-purity argument is unchanged: full-inherit is not ambient authority. The child inherits exactly the capabilities in the parent’s fd table (the same caps under the same slots), nothing more. There is no global namespace, no inherited credential, no kernel-side fd knowledge – the kernel still only sees an explicit List(CapGrant) from ProcessSpawner.spawn. The shim now computes that list from the full table instead of the touched subset; the kernel’s transfer-mode enforcement (process_spawner.rs) still gates every grant. A child can receive only caps the parent already holds and that are transferable; NonTransferable writable caps are skipped, not smuggled.

Implementation path (decomposed)

The work splits into a kernel cap-mode slice and a libcapos-posix behavior slice, with one optional narrow slice, all gating the dash shell smoke. See the ready task records:

  • posix-terminal-session-forwardable (behavior, kernel, done 2026-05-27) – mint TerminalSession Copy/SameSession so it forwards non-destructively via SpawnGrantMode::Raw. Precursor for the terminal-stdout half of §3. Proven by make run-posix-terminal-forward.
  • posix-recording-shim-full-fd-inherit (behavior, libcapos-posix, done 2026-05-27) – full-table inheritance default (§1), FD_CLOEXEC enforcement (§2), non-fatal skip of non-forwardable backings (Udp / already-moved / shared Pipe) when implicitly inherited (§4), and Copy-share of multi-aliased non-destructive backings (§1 subtlety). The recording-shim Terminal arm now forwards Raw (non-destructive). Proven by make run-posix-fd-inherit-default. A NonTransferable writable backing stays a documented whole-spawn ENOEXEC boundary; the v0 POSIX open surface mints no such cap, so the §4 non-fatal skip is realized for the backings that can actually arise.
  • posix-open-directory-fd (behavior, libcapos-posix, done) – open(dir, O_RDONLY) -> FdBacking::Directory (§5); non-O_RDONLY stays EISDIR, missing path keeps ENOENT. Proof make run-posix-open-dir-fd. Was off the headline critical path.

posix-p1-4-dash-shell-smoke (docs/tasks/) depends on the first two; once they land it can run with no per-app dash patch (only the generic, already- landed Variant A fork-exec patch set and the slash-bearing /ls-shim invocation to skip dash’s PATH stat, which is a documented dash-config choice, not a capOS workaround).

Per-app patch stance

The directive forbids per-app dash patches that would have to be repeated for every POSIX program. This design needs none: full inheritance is a generic capOS-side fix in the shim. The only acceptable vendored-dash touch is a generic POSIX-correctness item (the existing Variant A fork-exec coupling under vendor/dash/patches/, owned by posix-p1-4-dash-vendor), not a per-app inheritance workaround. The EV_EXIT in-place-exec residual (posix-p1-4-dash-shell-smoke) is the one remaining dash-specific item; it is a recording-shim “exec without prior fork” limitation, handled in the shell-smoke slice (disable the optimization or a bounded generic patch), not by this proposal.

Design grounding

  • libcapos-posix/src/process.rs (spawn_path_with_actions, fork, execve, the recording-shim contract), libcapos-posix/src/fd.rs (FdBacking, FD_FLAGS/FD_CLOEXEC, inherit_stdio_grants), libcapos-posix/src/terminal.rs, libcapos-posix/src/directory.rs, libcapos-posix/src/file.rs.
  • kernel/src/cap/mod.rs boot_cap_hold (Console and TerminalSession both Copy/SameSession since 2026-05-27), kernel/src/cap/terminal_session.rs (TerminalSessionCap stateless unit struct), kernel/src/cap/process_spawner.rs (validate_spawn_transfer_scope, transfer-mode enforcement).
  • schema/capos.capnp ProcessSpawner.spawn(... grants :List(CapGrant)), CapGrant, CapGrantMode.
  • docs/proposals/posix-adapter-proposal.md (recording-shim Variant A, fd-backing-cap inheritance), docs/capability-model.md (interface-is-the-permission, transfer modes/scopes).
  • docs/tasks/done/2026-05-27/posix-execve-capability-inheritance.md and docs/tasks/done/2026-05-26/spawn-grant-forwardable-readonly-directory.md (the landed explicit-grant inheritance this proposal generalizes), posix-p1-4-dash-shell-smoke (the premise conflict this resolves).

Design Proposal: Installable capOS System

This is a design proposal with its bounded local/QEMU path landed. The persistent data-region mount, config-overlay schema + init compose/merge with fail-closed fallback (make run-installable-overlay), generation/rollback machinery (make run-installable-generation), integrated installable disk (make run-installable-disk), target-disk install (make run-installable-install), first-boot provision (make run-installable-provision), and update/rollback flow (make run-installable-update) are implemented. The storage and disk-image prerequisites it builds on have also landed (see Build-On Relationship): block-device-backed read-only and writable filesystems, a persistent content-addressed Store, reboot-surviving writable persistence, and a hybrid BIOS+UEFI disk image. This proposal has been reconciled against those landed contracts and is decomposed separately (see Closeout And Decomposition).

Throughout, landed behavior is written in the present tense and planned behavior in the future/conditional tense. The installed-system proof remains a bounded local/QEMU result: it does not claim secure boot/signing, production release authority, public ingress, AWS/Azure live support, direct-remapping production hardware, userspace smoltcp/L4 readiness, or a persistent Namespace.

Problem

The baseline capOS boot path is a boot-from-image research system. The build packs a Cap’n Proto SystemManifest (compiled from system.cue) plus the userspace binaries into an ISO; Limine loads the manifest as a module; the kernel parses it, builds init’s bootstrap caps, and enters the single initConfig.init process. The boot-binary ISO layout (behind the boot_iso feature) can instead read ELFs on demand from /boot/bins/ so the manifest carries names only. Without the installable-system path, the system that boots is still exactly the image that was built: the next boot re-reads the same immutable manifest and rebuilds the same capability graph.

That baseline is correct for a research image and insufficient for an installed system. An installed capOS is one that:

  • boots from a local disk rather than a re-imaged ISO each time;
  • carries mutable system configuration – installed services, local accounts, network/runtime settings – that persists across reboot and is not baked into the image; and
  • can be updated to a new system generation and rolled back to a known-good one.

The hard question is not the disk format. It is how persistent, mutable system configuration composes with the immutable boot manifest without reintroducing ambient authority or a single mutable blob that can brick the system. That composition is the center of this proposal.

Non-Goals

  • Designing the block device, filesystem, or Store persistence mechanisms. Those are owned by Storage and Naming and the storage tracks in Hardware, Boot, and Storage. This proposal composes them and must not redesign them.
  • Defining the local-account schema. That is Local Users, Storage, and Policy; the account store is a consumer of the persistent-config region designed here.
  • Secure boot, image signing, and manifest trust. Those are tracked as storage-proposal Open Question #5 and the security/verification track; this proposal notes where a signature check would attach but does not specify the cryptography.
  • Any cloud-image or non-ATAPI boot-binary loader work; see the Cloud Device Tracks backlog.

On-Disk Layout

The installed system needs three regions with distinct mutability and authority: a read-only boot region, an immutable-per-generation system region, and a mutable data region. How those regions map onto physical disks is the first reconciliation point, because the landed building blocks already fix part of it.

Landed shape:

  • make image produces a single hybrid BIOS+UEFI raw image with one GPT ESP (FAT32) carrying Limine + the kernel + manifest.bin. That is the boot region (tools/mkdiskimage.sh, make run-disk / make run-disk-bios).
  • The persistent content-addressed Store (CAPOSST1) and writable filesystem (CAPOSWF1) are co-located in the data-region image produced by tools/mkstore-image --writable. Focused storage and early data-region smokes can still attach that image as a separate virtio-blk device.
  • The installable-system disk path has folded those regions onto one bootable disk: GPT partition 1 is the ESP and GPT partition 2 carries the co-located CAPOSST1 + CAPOSWF1 data region at a fixed base LBA read through cap::data_region_base_lba (no GPT parser in the kernel). make run-installable-disk proves boot from that integrated disk.
  • capos-system-install writes the same layout to a manifest-selected target BlockDevice and then boots the installed disk standalone (make run-installable-install). Provisioning and update operate on the installed data region after that floor exists.

Region placement decision (reconciled). The storage model is the co-located CAPOSST1 Store + CAPOSWF1 writable filesystem data region, not a persistent Namespace and not three independent mutable partitions. The separate data-region disk remains a focused proof packaging, while the installed system packages the ESP and data region onto one target disk. The kernel relies on the fixed tool/kernel data-region LBA contract for the installable path.

flowchart TD
    InstallDisk[Installed disk] --> Boot[GPT partition 1: ESP with Limine + kernel + boot manifest]
    InstallDisk --> DataRegion[GPT partition 2: fixed-LBA data region]
    DataRegion --> System[CAPOSST1 content-addressed Store: immutable generation objects]
    DataRegion --> Data[CAPOSWF1 writable filesystem: config/account state and markers]
    Boot -.init mounts data region when present.-> DataRegion
    Data -.active and known-good marker files name hashes.-> System

The system and data regions share the co-located data region: immutable generation objects live in the persistent Store (CAPOSST1), and mutable config/accounts plus the active/known-good pointers live in the writable filesystem (CAPOSWF1). The overlay read/validate/merge that composes them at boot has landed for the installable-system path (see below).

Boot region (read-only at runtime)

The single GPT ESP carrying Limine, the kernel, and the immutable boot manifest – the same SystemManifest shape that exists today. This region is what make image produces now (one hybrid BIOS+UEFI image, one ESP). At runtime it is treated as read-only. The landed update proof stages and commits generation objects in the data region; production boot-region rewrite, rollback, signing, and release policy remain future work.

The boot manifest stays the root of trust for topology: it pins the kernel, the init binary, and the minimum kernel-sourced caps init needs to bootstrap. In the installable-system path, which system/config generation to activate is named by writable-filesystem marker files and persistent Store hashes (see Generations And Rollback); the SystemManifest still carries no generation field. The boot manifest does not grow to hold installed-service config or accounts.

System region (immutable per generation)

The landed persistent content-addressed Store (CAPOSST1, put/get/has/ delete keyed by SHA-256, durable across reboot) is the durable substrate for immutable generation objects. The installable-system proofs exercise config and account generation objects (SystemConfigOverlay plus related records) and the marker-selected hashes that choose them. Package-manager-style system payload generations – service binaries, default software configuration, and release payload roots written into CAPOSST1 by a production updater – remain future work.

The target system region is the system of record for what software is installed, not a POSIX /usr, once those software-payload generation roots exist. capOS has capabilities, not paths: a service is “installed” when the generation root object binds its name to the content hash of its manifest fragment and binary. Because the landed Namespace cap is RAM-only and does not survive reboot, persistent name-to-hash bindings live inside generation capnp objects in the Store and in writable-filesystem marker files, not in a persistent Namespace cap (none exists). A Namespace may still be populated in RAM at boot from those persistent bindings, and a StoreFS adapter (storage proposal “Bridging the Two Models”) can expose a generation as a directory tree for POSIX/WASI consumers, but the durable installable record is the Store objects plus writable-filesystem markers.

Data region (mutable)

The landed writable filesystem (CAPOSWF1, full Directory mutation set + File.write/truncate/sync/close under a fail-closed single-writer policy, co-located with the persistent Store). It holds everything that legitimately changes at runtime and must survive reboot:

  • Persistent system configuration – the central subject below: capnp overlay objects in the persistent Store plus writable-filesystem marker files for the active/known-good pointers.
  • Local account store – the provision proof writes an operator account record as a persistent Store object and config overlay input; broader durable account policy remains owned by local-users-management.md Gate 3.
  • Per-account home/config/cache subtrees and service state.

The data region is mutated under capability authority only. There is no global filesystem root and no ambient path-based access: a service receives a writable-filesystem Directory cap (or a Store cap) scoped to its own subtree, exactly as Storage and Naming describes for attenuated grants.

Why not “a filesystem is the system of record”

A traditional install makes / the source of truth and layers config files, package databases, and /etc over it. That is ambient authority through paths, which capOS rejects by design (storage proposal, “The Problem with Filesystems”). Here the capability object graph is authoritative; the durable installable-system record is the persistent Store objects plus writable-filesystem marker files, and a filesystem view is an adapter over that authority rather than ambient authority itself. The on-disk bytes may be a filesystem for tooling convenience, but the system model is capability-native.

Beyond-Boot-Manifest Configuration (Central Decision)

This is the core of the proposal. Today the system is fully described by the static boot manifest. An installed system needs persistent, mutable configuration that the boot manifest cannot carry, while keeping the boot manifest’s fail-closed guarantees.

The model is a two-layer composition resolved by init at boot:

  1. Base layer – the immutable boot manifest. Pins the kernel, the init binary (the init mandate from Run Targets, Init Mandate, and Default-Run Integration Gate B still applies: initConfig.init.binary must be init), the kernel-sourced bootstrap caps, and the floor of services and policy the system always runs. This layer is authoritative and cannot be overridden by persistent state – it is the recovery anchor.

  2. Overlay layer – the persistent config generation. A capnp-encoded configuration object naming additional installed services, local network/runtime settings, and account bindings. The object is content-stored in the persistent Store (CAPOSST1); a well-known writable-filesystem path (proposed system/config/, a CAPOSWF1 directory) holds the small marker files that name the current and known-good generation by content hash. The landed Namespace is RAM-only, so this is filesystem-path + content-hash grounding, not a persistent Namespace root. This is what capos-system-install, capos-system-provision, and capos-system-update write in the landed local proofs.

Precedence and merge model

init reads the base manifest from BootPackage (as it does today). The overlay step landed in 2026-05 (installable-config-overlay-schema-and-merge): when the data region mounts, init reads system/config/overlay.bin, decodes the SystemConfigOverlay capnp object, and – only if it validates against the base’s declared SystemManifest.extensionPoints – composes its additional services over the base plan (SystemConfigOverlay::compose_onto, proof make run-installable-overlay). Generation selection also landed (installable-system-generation-rollback): writable-filesystem marker files select the active/known-good object hashes and provide failed-boot fallback. The merge rules are deliberately conservative:

  • Base pins win. Anything the base manifest declares (kernel caps, the init binary, floor services, policy floors) is not overridable by the overlay. The overlay may only add services and fill in settings the base marks as overlay-supplied. This prevents a tampered or buggy overlay from dropping a recovery service or widening authority.
  • Overlay adds, within declared extension points. The base manifest declares named extension points (e.g. “additional services”, “network config”, “account store location”). The overlay may bind only those. An overlay key that does not match a declared extension point is rejected, not silently applied – closed by default, mirroring the existing “omitted cap sources fail closed” invariant (manifest-startup.md).
  • No new authority classes. The overlay can request services be started with caps the base manifest already authorizes init to delegate. It cannot mint kernel-source authority that the base did not grant init. The interface is the permission: an overlay names which already-authorized caps a service gets, never a new kernel cap source.

This is layering, not free-form override: the base manifest is a contract and the overlay fills declared holes in it.

Where persistent config physically and logically lives

  • Physically: the data region – the landed persistent Store (CAPOSST1) for the immutable per-generation capnp objects, and the landed writable filesystem (CAPOSWF1) for the small active/known-good marker files.
  • Logically: a CAPOSWF1 directory tree, e.g. system/config/, holds one marker file naming the current generation object by content hash (plus the retained known-good marker); the generation objects themselves are content-stored in the persistent Store. Account records live under a sibling system/accounts/ tree the same way (consumed by local-users-management.md). There is no persistent Namespace cap; the RAM Namespace is repopulated from these bindings at boot if needed.
  • Authority: only a narrow system-config authority (held by init and the dedicated install/provision/update proof services) may write the system/config writable-filesystem subtree and the system-config Store objects. Ordinary services receive read-only scoped views or nothing. This is the writable-filesystem Directory / Store attenuation model, not a new mechanism.

Detecting and recovering from a bad persistent layer

The overlay is the most dangerous new surface: a corrupt or hostile overlay must never prevent boot. The design is fail-safe by construction:

  • Validation before merge (landed). init validates the overlay against the base manifest’s declared extension points and the schema before applying any of it: a schema-invalid, version-mismatched, content-hash-mismatched, stale-epoch, or extension-point-violating overlay is rejected whole (SystemConfigOverlay::from_capnp_bytes + compose_onto). Validation reuses the existing capos-config discipline rather than a parallel checker.
  • Monotonic generation + integrity. No landed Store, Namespace, or SystemManifest schema carries a system-generation/epoch field (other caps such as AccountRecord and the DDF revocation generations do, but not the installable-system path). The overlay instead carries a monotonic epoch and a SHA-256 contentHash inside its own SystemConfigOverlay capnp object (both landed in track item 3): the epoch is checked against the base’s minOverlayEpoch floor and the content hash is a self-consistency check. The writable-filesystem marker files that record which hash is active/known-good landed in the generation/rollback path. This mirrors the stale-write and monotonic-version rules already required for the account store (local-users-management.md Gate 3) and the managed-cloud store (storage proposal “Managed Cloud Backing”) without extending Store or Namespace. A stale overlay (epoch below the floor) is rejected.
  • Boot-with-base fallback. If the data region does not mount, or the active overlay fails validation, init boots from the base manifest alone and surfaces the failure (serial diagnostics / audit). The system always reaches at least the floor configuration, which by construction includes a recovery path.
  • Known-good generation pointer. The active overlay pointer is advanced only after a generation is proven to boot (see Generations And Rollback); a failed new generation leaves active on the prior known-good one.

Install / Provision / Update / Rollback Flow

The local/QEMU install, provision, update, and rollback flows have landed. They prove the authority and durability shape over capOS capabilities; they do not claim a production release/update service, secure boot/signing, public ingress, or live multi-provider deployment readiness.

Install

The capos-system-install userspace service takes the packaged image source from the booted CD-ROM /boot/bins/ tree and writes the installable layout onto a manifest-selected target disk. It holds only the read-only installable_image_source Directory and the target-scoped block_device_target BlockDevice; it cannot reach the boot disk through that target cap.

The service writes the boot-region head (BOOTHEAD.BIN: protective MBR, primary GPT, FAT ESP with Limine + release kernel + base manifest), writes the backup GPT (BOOTGPT.BIN) at the LBA named by the primary GPT header, and initializes the empty data region (DATAIMG.BIN: empty CAPOSST1 Store + CAPOSWF1 filesystem with system/config) at the fixed cap::data_region_base_lba. It validates every sector range and verifies the read-back before treating the install as complete. The empty data region is the install floor; the first non-empty config generation is provisioning.

make run-installable-install proves the flow in two passes: pass 1 installs into the target virtio-blk disk, and pass 2 boots that disk standalone with no CD-ROM and reaches the base service with the data region mounted.

Provision

First-boot provisioning writes the initial persistent config: the operator’s first local account record, network/runtime settings, and any additional services to start. capos-system-provision runs as PID 1 over an installed system’s persistent data region with only Console, writable_fs_root, and persistent_store caps. On the empty install floor, it writes the first non-empty SystemConfigOverlay generation, commits the generation object to the Store, writes system/config/overlay.bin, and advances the gen-active marker. Until provisioning runs, the system boots on the base-manifest floor.

make run-installable-provision boots the same empty-config disk twice: pass 1 provisions the account/settings/additional service, and pass 2 re-reads the active generation and account record from the data region to prove they survived reboot.

Update

The landed update flow applies a new generation over the same persistent Store + writable system/config region used by provisioning. The local proof does not rewrite a production boot region or ship a signed release updater; it proves staged generation commit, failed-candidate fallback, and base-overlay revalidation.

  1. Write the new generation into the content-addressed Store as a new root hash; the old generation’s objects remain (content-addressing dedups shared objects).
  2. Stage a new active-candidate pointer; do not advance active yet.
  3. Reboot into the candidate. If it reaches a health checkpoint, commit by advancing active. If not, the boot-with-known-good fallback keeps the prior generation (see below).

Persistent config (the overlay and accounts in the data region) is carried across updates: the data region’s config/account generations persist across candidate staging, commit, and fallback. Where a new base no longer admits an overlay’s declared authority, the overlay is re-validated against that base and falls back to the base floor with a surfaced error rather than applying partially.

make run-installable-update boots the same empty-config disk three times: boot 1 provisions known-good generation 1, rejects an overlay against a revoked-cap base, and stages a healthy generation 2; boot 2 commits generation 2 across reboot and stages a failing generation 3; boot 3 auto-falls back from generation 3 to known-good generation 2 while preserving the data region.

Generations and rollback

The active system/config generation is named by writable-filesystem marker files (CAPOSWF1) carrying a content hash and monotonic pointer epoch – not by a SystemManifest field, since the manifest schema carries no system-generation field. The generation objects themselves are immutable content-addressed roots in the persistent Store. Rollback is:

  • System rollback: point the active system-generation hash back to the prior known-good generation. Because generations are immutable content-addressed roots, the prior generation’s bytes are still present; rollback is a pointer move plus reboot, not a re-extraction.
  • Config rollback: point the active overlay binding back to the prior overlay generation, retained for a bounded number of generations.
  • Failed-boot auto-fallback: a generation is promoted to known-good only after it reaches a defined health checkpoint. A boot that does not reach the checkpoint (kernel panic, init failure, overlay validation failure) is detected on the next boot via a “boot attempt count vs confirmed” marker, and the init/generation logic reverts to the last confirmed generation. This is the standard A/B-generation pattern, expressed over content-addressed Store roots rather than two fixed partitions.

make run-installable-generation proves this machinery before the full update flow: it stages a candidate, records a boot attempt before applying it, rejects a stale pointer, proves config rollback to a retained generation, and auto-falls back to the known-good generation across a fresh reboot when a candidate is left unconfirmed.

Build-On Relationship To Landed And Planned Work

This proposal is an integration design over existing tracks. It must not redesign them. Current state of each piece it builds on:

Building blockOwning trackStatus today
Persistent content-addressed StoreStorage and Naminglanded: CAPOSST1 superblock at LBA 0, put/get/has/delete keyed by SHA-256, durable across reboot (persistentStore grant source; reboot proof make run-storage-persist). RAM-backed Store CapObject + userspace RAM Store service also landed
Namespace modelStorage and Naminglanded but RAM-only: resolve/bind/list/sub, not persistent (namespace grant source). No persistent Namespace cap exists
BlockDevice boundaryHardware, Boot, and Storage “Reusable Block-Device Path” / “Local Disk Storage”landed: readBlocks/writeBlocks/info/flush over a real cfg(qemu) virtio-blk device (blockDevice grant source; proof make run-virtio-blk)
Read-only filesystem over BlockDevice“Local Disk Storage Milestone”landed: CAPOSRO1 superblock, Directory.list/open/sub + File.read/stat, mutating methods fail closed (readOnlyFsRoot; proof make run-storage-fs)
Writable persistence across reboot“Writable Local Storage Milestone”landed: CAPOSWF1 writable filesystem at LBA 256, full Directory mutation set + File.write/truncate/sync/close, fail-closed single-writer (writableFsRoot; reboot proof make run-storage-writable). Co-located with CAPOSST1 via tools/mkstore-image --writable
Bootable disk image (make image, make run-disk)“Bootable Disk Image”landed: single hybrid BIOS+UEFI raw image with one GPT ESP carrying Limine + kernel + manifest.bin; make image/run-disk/run-disk-bios; GCP/AWS provider packaging. The boot-binary ISO layout’s on-demand reads also landed behind boot_iso
Boot manifest / SystemManifest / init mandateManifest and Service Startup, Run Targets, Init Mandate, and Default-Run Integrationlanded: static manifest, init-owned service graph, name-only boot-ISO path. The installable path additionally reads and validates a persistent overlay only when the data region is mounted and the base manifest declares matching extension points (make run-installable-overlay)
Local account store (a consumer)Local Users, Storage, and Policy Gate 3partially landed for installable proof: capos-system-provision writes and re-reads one operator account record through persistent Store/writable-filesystem state; full durable account policy remains future

The storage and disk-image prerequisites have landed, and the bounded installable-system composition has landed on top of them: overlay read/validate/merge, generation marker files, install, provision, and update/rollback flows all have local QEMU evidence. The decomposition task (installable-system-decomposition) required ddf-blockdevice-boundary-virtio-blk-smoke, storage-readonly-fs-over-blockdevice, storage-persistent-store-reboot-proof, storage-writable-persistence-reboot-proof, and disk-image-provider-packaging to be done before emitting implementation tasks; they are. Because some prerequisites landed with contracts that differ from this proposal’s original projections (single hybrid ESP rather than three boot/system/data partitions, RAM-only Namespace rather than a persistent one, no system-generation field on the Store/Namespace/SystemManifest path), this proposal has been reconciled to the landed shapes above so the track does not encode a stale contract. Production hardening remains separate: secure boot/signing, authorized release publication, public ingress, broader cloud-provider coverage, direct-remapping production hardware, and full durable local-account policy are not implied by the local installable-system evidence.

Milestone Framing

installable-system is its own milestone: “an installed, persistent capOS that boots from disk and keeps mutable system configuration across reboots.” It is a distinct, user-visible product outcome from the storage and bootable-disk image milestones it builds on, even though it depends on them – a user can have block devices, a filesystem, and a bootable disk image without having an installed, self-configuring, updatable system.

This framing is recorded in Roadmap. The milestone became the selected milestone after Device Driver Foundation closed and is now closed for the bounded local/QEMU installable-system contract by the structural docs reconcile and the landed install/provision/update/rollback evidence. The successor selected milestone is the GCE self-hosted Web UI path; public ingress and TLS remain approval-gated follow-ups under that track.

Design Grounding

Closeout And Decomposition

This proposal is reachable from docs/SUMMARY.md, and the installable-system milestone framing is recorded in docs/roadmap.md.

Turning this design into actionable backlog + implementation tasks is a separate task, installable-system-decomposition, which decomposed the track against the landed BlockDevice/filesystem/Store/writable-persistence/disk-image contracts in Installable System. The behavior track then landed the data-region mount, overlay compose, generation/rollback machinery, integrated disk packaging, target-disk install, first-boot provision, and update/rollback flows. This proposal has now been structurally reconciled to those landed shapes: integrated installed disk packaging over an ESP plus fixed-LBA data region, writable-filesystem + content-addressed Store grounding for persistent naming and generation markers, RAM-only Namespace, and no system-generation field on the Store/Namespace/SystemManifest path. The proposal text and backlog track therefore describe the same bounded local/QEMU contract.

Proposal: Resource Accounting and Quotas

Cross-cutting resource profiles, ledgers, reservation semantics, and verification gates for bounded capOS sessions, services, drivers, storage, networking, tests, and future language runtimes.

  • Authority Accounting records the current transfer and resource-accounting invariants.
  • Memory Management documents the current frame-grant and MemoryObject accounting baseline.
  • Go VirtualMemory Contract provides the first concrete virtual-reservation versus physical-commit ledger split for a future language runtime.

Problem

capOS already has several resource limits: cap slots, frame grants, timer waiters, thread and kernel-stack quotas, ring scratch, and spawn preflight checks. Those are useful but fragmented. Local accounts, guests, anonymous callers, external sessions, service accounts, drivers, storage services, network stacks, tests, and future runtimes all need the same rule:

No workload receives implicit unlimited consumption of finite system resources.

This proposal defines the common model. It extends the Security Verification Track S.9 authority graph and per-process ResourceLedger design rather than replacing it.

Principles

  • A ResourceProfile is a policy template, not authority.
  • Actual enforcement happens through ledgers, capability wrappers, brokers, supervisors, and kernel/resource-service admission checks.
  • Every resource class has one ledger of record. Mirrors for status, metrics, or audit are derived views and must not be used for enforcement.
  • Reservation happens before side effects. Commit publishes the resource. Release and rollback are mandatory on all success, failure, timeout, revocation, and process-exit paths.
  • Identity metadata selects policy. It never consumes, releases, or bypasses quota by itself.
  • Quota donation is explicit. A caller may donate budget to a service call, but a service cannot silently spend the caller’s unrelated budget.

Resource Profiles

Resource profiles are named templates selected by account records, manifest seed data, service policy, external admission rules, or test manifests. A profile should contain policy intent, not raw authority:

struct ResourceProfile {
  profileId @0 :Data;
  versionId @1 :Data;
  epoch @2 :UInt64;
  homeQuotaBytes @3 :UInt64;
  tempQuotaBytes @4 :UInt64;
  processLimit @5 :UInt32;
  threadLimit @6 :UInt32;
  capLimit @7 :UInt32;
  memoryCommitLimitBytes @8 :UInt64;
  frameGrantLimitPages @9 :UInt64;
  memoryVirtualReservationLimitBytes @20 :UInt64;
  endpointQueueLimit @10 :UInt32;
  inFlightCallLimit @11 :UInt32;
  retired12 @12 :UInt32; # was pending IPC submission quota; do not reuse
  ringScratchLimitBytes @13 :UInt64;
  logQuotaBytesPerWindow @14 :UInt64;
  networkProfile @15 :Text;
  cpuBudgetUsPerWindow @16 :UInt64;
  cpuWindowUs @17 :UInt64;
  timerWaiterLimit @18 :UInt32;
  launcherProfile @19 :Text;
}

The profile is evaluated by a broker or supervisor. The result is a set of ledger limits, wrapper caps, service-specific budgets, and spawn constraints. Changing a profile does not change a running workload until a trusted service issues new limits, revokes old caps, or starts a replacement workload.

Current kernel coverage includes manifest profile decoding, spawn-time profile resolution, per-process ring and reply scratch sizing, endpoint queue and in-flight call limits for profile-created endpoint caps, child cap-table slot limits, and per-process thread table limits. The QEMU proof make run-resource-profile covers an in-limit spawn, an over-cap spawn rejection before result authority escapes, rollback after that rejection, and a thread-limit rejection through ThreadSpawner.create.

Ledgers of Record

The ledger of record depends on the resource owner:

ResourceLedger of record
Capability slotsProcess CapTable / process resource ledger
Processes and child subtreesSupervisor or ProcessSpawner ledger
Threads and kernel stacksProcess-owned thread/kernel-stack ledger
Anonymous virtual reservationsAddress-space or VM service reservation ledger
Anonymous committed memoryAddress-space or VM service ledger
Physical frames and frame grantsFrame allocator / holder ledger
MemoryObject mappingsPer-process frame-grant ledger plus address-space tracking
Endpoint queuesEndpoint object ledger
In-flight calls and result capsCaller/callee transport ledger
Ring submissionsFixed ring depth and per-dispatch budget; no profile ledger
Ring scratch and request buffersProcess ring/resource ledger
Timer sleeps and waitersTimer service waiter ledger
Log bytesLog service token bucket / retention ledger
Storage bytes and namespace entriesStore/Namespace service ledger
Temporary, cache, and home storageStore/Namespace scoped sub-ledgers
Network listeners, sockets, and bytesNetwork service or socket cap ledger
CPU share and runtime budgetScheduler or scheduling-context ledger
DMA pool bytes, DMA buffer count, descriptor/ring depth, MMIO mappings, interrupt holds, in-flight DMA submissionsDevice-manager ledgers, later
Model tokens, provider calls, tool callsProvider/agent gateway ledgers, later

No second module should maintain an independent enforcement counter for the same resource. A status service may cache values for display only if it treats the ledger owner as authoritative and never grants or rejects based on stale cache state.

Relationship To Tickless And Realtime CPU Authority

The CPU terms in Tickless and Realtime Scheduling reuse this resource-accounting model:

  • ResourceProfile.cpuBudgetUsPerWindow: coarse policy template only. Selecting a profile does not mint executable CPU-time authority.
  • ResourceLedger CPU budget: coarse best-effort accounting before realtime contexts exist, and the ledger of record for non-realtime CPU share/runtime limits.
  • SchedulingContext: spendable CPU-time object for realtime or admitted execution. It carries budget, period, relative deadline, priority/criticality, CPU mask, and overrun policy.
  • CpuIsolationLease: CPU placement, exclusivity, and noise/nohz authority. It is not CPU budget and must charge consumed runtime to a SchedulingContext or scheduler ResourceLedger.
  • NoHzEligibility / NoHzActivation: reviewed eligibility plus scheduler-proven current CPU state. They do not grant resource credit.
  • RealtimeIsland: admitted bundle consuming SchedulingContexts plus memory, device, ring, and optional CpuIsolationLease reservations.

Do not create a second CPU budget system under nohz, SQPOLL, or realtime terminology. Those features select placement and execution mode; CPU time is still charged through scheduling-context or scheduler-ledger authority.

Reservation Lifecycle

Every resource allocation follows the same lifecycle:

reserve(request, limits, expected_state)
  -> reserved(token)
  -> denied(reason)

commit(token)
  -> committed(resource)
  -> rollback(token, reason)

release(resource)
  -> released

Rules:

  • reserve validates structure, bounds, ownership, and available quota before any externally visible mutation.
  • commit publishes exactly the resource that was reserved.
  • rollback restores all ledgers touched by the reservation.
  • release is idempotent from the caller’s perspective but changes ledger state at most once.
  • Process exit and cap revocation bulk-release all resources owned only by the exiting process or revoked hold edge.
  • Stale handles, exhausted quotas, malformed limits, and unknown profile versions fail closed with typed errors or denials, not panics.

The Security Verification Track S.9 transfer transaction is the concrete model for cap transfer and spawn. Other services should reuse the same preflight, reservation, commit, rollback, and audit vocabulary.

Donation and Shared Services

Shared services handle many sessions in one process. They need bounded server-side state without treating caller identity as authority.

Donation is a lease from one ledger to another for a named operation:

Donation {
  donorSessionId
  donorLedgerId
  receiverServiceId
  resourceClass
  amount
  expiresAtMs
  callId
}

A donation can pay for queue entries, scratch bytes, temporary storage, outbound bytes, model tokens, or CPU budget needed to serve one request. It does not grant unrelated authority to the service and does not let the caller spend the service’s own management budget. When the call finishes, times out, is cancelled, or the session exits, unused donation is returned and used donation is charged to the donor’s accounting record.

Services may also have their own base budgets for resident state. Per-client budgets and service base budgets are separate ledger entries so a single client cannot hide consumption inside the service account.

Profile Binding

Profiles are selected by policy inputs:

  • manifest-seeded operators and recovery identities,
  • local account records,
  • service account records,
  • guest and anonymous admission rules,
  • external identity bindings,
  • test manifests and QEMU smoke profiles,
  • future driver, storage, network, and runtime launch policies.

The broker or supervisor translates those profiles into concrete limits at session creation, spawn, service start, or cap minting time. The translation must record:

  • profile ID, version ID, and policy epoch,
  • ledger owner and resource class,
  • hard limit and optional token-bucket window,
  • source policy and approving broker/supervisor,
  • audit record ID for the grant,
  • expiry or revocation epoch if the budget is leased.

A session can carry profile summaries for audit and display, but the summaries do not enforce quota. Enforcement lives where the resource is created or used.

Resource Classes

Kernel and Process Resources

Cap slots, process count, thread count, kernel stacks, ring scratch, outstanding calls, and endpoint queue entries are kernel or kernel-object resources. Ring submissions are bounded separately by the fixed SQ depth and the per-dispatch budget, so they do not have a profile quota. The remaining checks belong before spawn, thread creation, transfer, IPC, and ring dispatch side effects.

Memory

Current VirtualMemory mappings and held MemoryObject caps charge the process-owned frame-grant ledger of record. The address space records borrowed object-backed pages at the same tracking limit so unmap and teardown can distinguish them from anonymous pages, but that tracking is not a second enforcement counter. Future reserve/commit/decommit semantics split virtual reservation from committed physical memory: VirtualMemory.reserve charges a virtual-reservation ledger, while VirtualMemory.commit and compatibility VirtualMemory.map charge the committed-memory/frame ledger before pages become accessible. Decommit releases physical commit budget while preserving virtual reservation budget until unmap.

Storage

Storage services own byte, object, namespace-entry, and snapshot ledgers. home, config, cache, and tmp are separate sub-ledgers even when backed by the same Store. Temporary session storage expires on logout or session expiry. Cache quota may be reclaimed by policy. Home/config quota should not be reclaimed without explicit account/storage policy.

Logging and Audit

Log volume uses token buckets and retention limits. Audit entries required for security state transitions should have a protected emergency path; ordinary application logs must not starve audit. If audit storage is unavailable, the system enters a bounded emergency mode rather than silently dropping mandatory security events.

Network

Network profiles select listener authority, outbound connection classes, socket counts, byte windows, and remote scopes. A normal local account may receive client network caps; listener authority requires service policy, operator policy, or an application-specific grant. Anonymous remote sessions receive only protocol state needed to authenticate or create an account.

CPU and Scheduling

CPU share and runtime budget belong to the scheduler or future scheduling context. Until full scheduling-context donation exists, CPU limits can be coarse token buckets and supervisor policy. Later realtime, media, and driver work should use explicit period/budget/deadline records rather than ad hoc sleep or polling loops.

Devices and Providers

DMA pools, MMIO mappings, interrupts, cloud provider calls, LLM tokens, media frames, and external API calls are scarce resources too. The first proof may use service-level ledgers, but the rule is the same: one ledger of record, typed reservation, explicit release, audit-visible denial.

For the Security Verification Track S.11.2 userspace-driver transition, device ledgers must account at least DMA pool bytes, DMA buffer count, descriptor or ring depth, MMIO mappings, interrupt holds, and in-flight DMA submissions. A DMAPool reservation is not only memory allocation; it is also device-visible write authority and must be released through the same revoke/quiesce/reset path that makes future reuse safe.

Canonical device ledger concepts:

dma_pool_bytes
dma_buffer_count
dma_descriptor_count
mmio_mapping_count
interrupt_hold_count
inflight_dma_submission_count

These fields are device-manager accounting concepts even if the first implementation uses different internal names. They must have one ledger of record. DMA pool bytes and buffer counts are not interchangeable with ordinary MemoryObject ownership, because device-visible memory also carries IOVA, descriptor, reset, and stale-completion obligations.

Failure Semantics

Quota failure is a normal result, not a crash:

ConditionResult
Malformed requestInvalid input / typed transport error
Caller exceeds hard limitQuota denied / overloaded
Service base budget exhaustedService overloaded
Donated budget exhaustedRequest denied or partial result
Stale profile versionDenied; refresh session/profile
Ledger mismatch or rollback failureEnter recovery/emergency mode

Retry policy belongs to the caller or supervisor. Kernel and service code must not spin, allocate unbounded retry queues, or emit unbounded diagnostics after quota failure.

Audit and Status

Auditable events:

  • profile-to-ledger translation,
  • reservation denial,
  • successful budget grant,
  • donation start/commit/release,
  • cap or resource revocation,
  • process-exit cleanup,
  • rollback or recovery-mode entry,
  • administrative profile change.

Status views should expose current usage, limits, denial counts, and suppressed diagnostic counts by resource class. They must redact sensitive account, network, provider, and object identifiers unless the viewer holds a suitable audit/status cap.

Verification Gates

Before treating resource profiles as complete for any caller class, add checks at the affected resource owner:

  • Host tests for limit parsing, stale profile rejection, reservation/rollback, and one-ledger-of-record invariants.
  • QEMU smokes proving quota denial for process/thread/cap, endpoint queue, timer waiter, memory, storage, log, and network resources as they exist.
  • Hostile exhaustion tests that do not panic, leak frames, leak cap slots, or leave partial child processes.
  • Process-exit and revocation tests proving all charges release exactly once.
  • Audit/status tests showing denial and cleanup are visible without exposing secrets.
  • Kani or property tests for small pure ledger primitives when bounds are fixed enough to model.

Relationships

  • Authority Accounting: Security Verification Track S.9 defines the current authority graph and process-ledger transaction model. This proposal generalizes the quota vocabulary to services, storage, networking, sessions, and future devices.
  • User Identity and Policy: account and session resource profiles select templates. Brokers and supervisors translate them into ledgers and wrapper caps.
  • OOM Handling and Swap: memory commitment, reclaim, and swap policy are the memory-specific part of this model.
  • Storage and Naming: Store/Namespace services own storage ledgers for homes, config, cache, tmp, snapshots, and imports.
  • System Monitoring: status and metrics expose derived ledger views, not parallel enforcement counters.

Non-Goals

  • No Unix cgroups clone as the primary abstraction.
  • No identity-based quota enforcement in the kernel.
  • No global mutable quota database trusted by every subsystem.
  • No claim that existing code already enforces every resource class above.
  • No unbounded best-effort mode for guests, anonymous callers, tests, or service accounts.

Open Questions

  • Which ledger IDs and status schemas should become stable ABI first?
  • How much CPU-budget enforcement is useful before scheduling contexts exist?
  • Should quota donation be represented as a general capability type or as method-specific sideband on selected service calls?
  • Which storage quota primitive is first: bytes, object count, namespace entries, or snapshots?
  • Which proofs belong in capos-lib versus resource-service-specific tests?

Proposal: Memory Authority Model

capOS already has implemented memory management. This proposal defines the missing cross-cutting contract: which capability may create or hold memory, which memory may move or be reclaimed, when a mapping mutation is complete, and which proofs are required before shared-memory, DMA, swap, and language-runtime features build on that substrate.

This is deliberately not a CPU or language memory-ordering model. Atomic ordering, cache coherence, and Rust aliasing rules remain their own topics. This page covers OS memory authority, residency, consistency, and lifecycle.

  • Memory Management documents current physical frames, address spaces, user-buffer validation, VirtualMemory, and MemoryObject.
  • Go VirtualMemory Contract freezes the reserve/commit/decommit contract this proposal treats as the current anonymous-memory baseline.
  • OOM Handling and Swap owns memory-pressure, OOM, reclaim, and encrypted swap policy. That proposal consumes the authority/residency vocabulary defined here; this proposal does not re-specify reclaim or swap policy.
  • Capability-Based Service Architecture owns authority-at-spawn, service composition, and the per-service capability graph that selects which memory authorities (anonymous VirtualMemory, MemoryObject, future pin/DMA/swap caps) a service receives. The classes and state machines below are the contract that service-graph budgeting and spawn-time memory grants must respect.
  • Resource Accounting and Quotas owns the one-ledger-of-record and reserve/commit/rollback/release vocabulary used by the accounting rules below.
  • Design Risks Register R4 tracks the consolidated open gap in cross-service donation/fairness, log volume accounting, and the memory authority/residency proof obligations this proposal must close before downstream features may depend on them.
  • DMA Isolation owns device-visible memory, IOMMU, stale DMA, and scrub-before-reuse requirements.
  • Park Authority records why shared park words need mapping identity or object pins before they can be safe across processes.
  • Memory Authority Model Backlog decomposes the research, design, and proof work behind this proposal.

Problem

The current tree has strong local rules, but they are spread across several documents and implementations:

  • anonymous VirtualMemory ranges separate reservation from physical commit;
  • MemoryObject caps can be shared and mapped into multiple address spaces;
  • user-buffer copies validate and use pointers while holding the address-space lock;
  • TLB shootdown is routed through address-space residency masks;
  • private ParkSpace cleanup handles anonymous unmap/decommit and explicit MemoryObject.unmap;
  • DMA isolation requires resident unswappable pages, generations, quiesce, and scrub-before-reuse;
  • OOM policy rejects default overcommit and forbids an ambient OOM killer.

Those rules are individually useful, but future work needs a single answer to questions such as:

  • Is this page only reserved, physically committed, resident, pinned, swapped, or device-visible?
  • Which cap or ledger is the authority that made it that way?
  • Can this backing frame be reclaimed, unmapped, shared, donated, or exposed to a device?
  • When is it safe to reuse a frame after unmap, decommit, protect, process exit, revoke, or failed rollback?
  • Which proof must exist before a shared park word, NIC ring, block buffer, Store blob, GPU buffer, Go heap arena, or swap slot depends on the rule?

Without a consolidated contract, later features can accidentally treat MemoryObject, SharedBuffer, DMA pool pages, anonymous VM pages, and swapped pages as interchangeable. They are not interchangeable authority classes.

Design Grounding

The project grounding for this proposal is:

  • docs/architecture/memory.md
  • docs/backlog/go-virtual-memory-contract.md
  • docs/proposals/oom-and-swap-proposal.md
  • docs/proposals/resource-accounting-proposal.md
  • docs/proposals/service-architecture-proposal.md
  • docs/design-risks-register.md (entry R4 – Resource accounting is fragmented)
  • docs/dma-isolation-design.md
  • docs/architecture/park.md
  • docs/architecture/scheduling.md
  • docs/architecture/userspace-runtime.md
  • docs/proposals/storage-and-naming-proposal.md
  • docs/proposals/go-runtime-proposal.md
  • docs/security/verification-workflow.md
  • REVIEW.md

Relevant research grounding:

  • docs/research/zircon.md for VMO/VMAR separation, mapping authority, and VMO commit/decommit precedent.
  • docs/research/genode.md for parent-routed sessions and resource quota discipline.
  • docs/research/sel4.md for explicit authority, no ambient allocation authority, and the proof value of small stable object invariants.
  • docs/research/eros-capros-coyotos.md for single-level-store and keeper lessons; capOS should not make transparent persistence or implicit paging the baseline by accident.
  • docs/research/llvm-target.md for Go/runtime memory hooks that require reserve, commit, decommit, and fault behavior.

Goals

  1. Define the authoritative state vocabulary for user memory, shared memory, pinned memory, DMA memory, and future swapped memory.
  2. Make every memory-state transition name the capability and ledger that authorizes it.
  3. Specify when validation, mapping mutation, TLB shootdown, cleanup, and frame reuse are complete.
  4. Keep future SharedBuffer, file, network, GPU, and DMA paths from bypassing MemoryObject provenance or residency rules.
  5. Tie memory-pressure and OOM behavior to explicit budgets and process lifecycle, not allocator surprises.
  6. Convert this contract into host tests, QEMU smokes, Kani models, and review gates.

Non-Goals

  • No demand paging or swap implementation in this proposal.
  • No transparent global persistence or EROS-style single-level store.
  • No copy-on-write clone design.
  • No sub-VMAR or nested address-region capability in the first version.
  • No generic userspace pager in the page-fault path.
  • No change to CPU atomic ordering, language memory models, or Rust aliasing rules.

Memory Authority Classes

The model starts by naming memory by authority and residency, not by how it is convenient for one subsystem to access it.

ClassAuthorityBackingReclaim/swapNotes
Reserved anonymous VAVirtualMemory bound to one address spaceNo frameNot reclaimable; no physical stateCharges virtual-reservation quota only.
Committed anonymous pageVirtualMemory plus frame-grant budgetPrivate frameReclaim/swap only under future explicit policyVM_PROT_NONE is committed but inaccessible.
Borrowed MemoryObject mappingMemoryObject cap plus caller address spaceShared object backingNot phase-1 swapMapping pins backing lifetime and charges mapping budget.
Capability transport pageKernel bootstrap/ring setupKernel-owned frame mapped into user VANever swapRing and CapSet pages are reserved outside caller VM control.
Private kernel metadataKernel boot/process/scheduler/cap stateKernel heap or framesNever swapFailure is kernel capacity pressure unless later object funding exists.
Reclaimable clean cacheStore, boot package, file, or loader serviceReconstructable backingDrop/refetch, not swapMust have trusted backing identity before reclaim.
Pinned user pageExplicit pin authority or future residency optionResident frameNever swap while pinnedNeeded for shared park, DMA staging, and secret/locked mappings.
DMA-visible pageDMAPool/device-manager authorityResident frame or IOVA mappingNever swap; quiesce before freeDevice bytes remain untrusted until driver validation.
Secret pageFuture secret-memory authorityResident private frameNever swap; scrub aggressivelyStronger policy than ordinary pinned memory.
Swapped anonymous pageVirtualMemory plus swap budget and slot metadataEncrypted/authenticated slotNon-resident until faultFuture phase only; anonymous private pages first.

Two rules follow from this table:

  • A capability that maps CPU-visible bytes does not automatically grant device visibility, swap eligibility, pin authority, or persistence.
  • A residency promise is an authority and accounting event, not a flag that a helper may set opportunistically.

State Machines

Anonymous VirtualMemory

Anonymous memory follows this lifecycle:

  1. unreserved
  2. reserved
  3. committed inaccessible (VM_PROT_NONE) or committed accessible
  4. decommitted reserved
  5. unreserved

Required properties:

  • reserve charges virtual reservation and installs no present user PTE.
  • commit is all-or-nothing: frame allocation, ledger charge, page-table mutation, committed-page metadata, and deferred TLB completion reservation either all become visible or all roll back.
  • protect changes accessibility only for committed pages; it does not create backing.
  • decommit releases committed frames and physical charges while preserving virtual reservation.
  • unmap releases the reservation and any committed backing inside it.
  • ring and CapSet virtual ranges remain outside caller-controlled VirtualMemory operations.

MemoryObject

MemoryObject backing has a separate lifecycle from any one mapping:

  1. allocated by FrameAllocator
  2. held through one or more process cap-table entries
  3. borrowed into one or more address spaces by MemoryObject.map
  4. unmapped or torn down from those address spaces
  5. released after the final cap and final borrowed mapping drop their holds

Required properties:

  • held MemoryObject caps charge holder frame-grant pages;
  • each borrowed mapping charges the mapper while it remains mapped;
  • object-backed pages are tracked separately from anonymous reservations;
  • unmap/protect must prove the affected range is backed by the same object;
  • releasing a cap cannot leave uncharged live mapped backing behind;
  • future direct read/write, slice, clone, file-backed, or DMA-backed subclasses must define how they differ from the current physically backed object.

Pinned and Device-Visible Memory

Pinned memory is not just a mapped page with a refcount. It is a promise that reclaim, swap, and frame reuse will not move or free the page until the pin is released.

The first reusable state machine should be:

  1. resident
  2. pin_reserved
  3. pinned
  4. pin_draining
  5. resident

DMA-visible memory extends this with device ownership:

  1. allocated
  2. mapped_to_device
  3. submitted
  4. quiescing
  5. device_mapping_removed
  6. scrubbed
  7. released

The device state machine must advance generations or epochs before reuse so stale handles, stale interrupts, and stale completions fail closed.

Future Swap

Swap is a later extension of committed anonymous memory:

  1. committed resident
  2. eviction reserved
  3. encrypted slot written and authenticated
  4. committed swapped
  5. faulting restore
  6. committed resident

No phase-1 swap path may include MemoryObject backing, shared IPC pages, ring/CapSet pages, DMA pages, kernel metadata, or secret pages. If swap-in cannot restore data or authenticate the slot, the faulting process exits with a typed OOM or corruption reason; the kernel must not fabricate contents.

Mapping Consistency

The consistency rule is:

A frame may return to the allocator, be exposed to a different authority, or be made device-visible only after every stale CPU and device observer has been invalidated or made irrelevant by generation.

For current CPU mappings this means:

  • address-space mutation holds the address-space lock while checking and changing metadata;
  • local page-table changes flush locally before returning to user mode;
  • remote CPUs that have run the address space receive and acknowledge the required TLB shootdown generation before freed frames can be reused;
  • kernel upper-half mappings that share a top-level PML4 entry with existing address spaces mutate only shared lower-level tables and use kernel-wide TLB shootdown; a new kernel-half PML4 slot must either be pre-seeded before user address-space creation or fail closed until a synchronized live-root propagation path exists;
  • cleanup paths reserve deferred completion slots before mutation when they may need to free frames after shootdown;
  • private ParkSpace waiters are interrupted before an unmapped virtual address can be reused for unrelated state.

For future shared keys, DMA, and swap this means:

  • shared park keys must be derived from MemoryObject identity and offset, or the backing object must be explicitly pinned for the duration of key derivation and wait registration;
  • DMA page reuse requires descriptor quiesce, interrupt/completion generation checks, IOMMU or bounce-buffer invalidation, and scrub-before-release;
  • swapped pages carry slot generation/integrity metadata, and stale slots must not be accepted as current backing.

Accounting Rules

Every state transition must name exactly one ledger of record:

  • virtual reservation pages: process/address-space VM ledger;
  • anonymous committed pages: process frame-grant or future memory-commit ledger;
  • held MemoryObject backing: holder frame-grant ledger;
  • borrowed MemoryObject mappings: mapper frame-grant ledger plus address-space tracking;
  • pinned pages: future pin ledger owned by the pinning authority;
  • DMA pool bytes, DMA buffers, descriptors, MMIO mappings, interrupt holds, and in-flight DMA submissions: device-manager ledger;
  • swap commitment and swap slots: future swap/budget ledger;
  • kernel metadata: kernel capacity budget until explicit object funding exists.

Status views, metrics, and audit logs may mirror these values, but they are not allowed to grant or deny resources from stale copies.

Failure Semantics

Capability-call boundaries return controlled failures:

  • malformed ranges, overflow, unsupported protection, and authority mismatch return deterministic validation errors;
  • local budget exhaustion returns quota denial or typed OOM;
  • global pressure may attempt reclaim first if the memory class is eligible;
  • rollback failure enters a bounded recovery or emergency path rather than publishing half-mutated authority.

Execution faults are process lifecycle events:

  • access to unreserved memory is an ordinary fault;
  • access to reserved uncommitted memory is a reservation fault, not implicit demand commit;
  • access to committed VM_PROT_NONE is a protection fault;
  • future failed swap-in or failed zero-fill terminates the faulting process, not a random victim.

Boot-time core allocation failures remain boot-fatal until capOS moves kernel object memory under explicit authority.

Proof Obligations

Before a feature depends on this model, it must add or cite the relevant proof:

AreaRequired evidence
Frame allocatorHost/Kani proof that allocator metadata frames are never returned, allocation is unique, and free rejects invalid or double-free inputs.
Anonymous VMHost tests for overlap, split, rollback, quota, VM_PROT_NONE, decommit/recommit zero-fill, and process teardown.
TLB/frame reuseQEMU or targeted instrumentation proving unmap/decommit/protect wait for required shootdown before frame reuse on resident CPUs.
User-buffer copyReview and tests showing validation and copy/read occur under one address-space stability guarantee.
MemoryObject sharingQEMU proof that mapping, transfer, writes, unmap, release, and teardown preserve backing lifetime and accounting.
Shared park wordsProof that key derivation uses object identity and stale waiters cannot observe reused virtual addresses or object generations.
DMAQEMU/host proof that stale handles, interrupts, and completions cannot affect new owners or freed buffers.
OOM/quotaHostile exhaustion tests showing controlled errors, rollback, no leaked frames, and no partial cap publication.
SwapFuture proof for encrypted/authenticated slots, stale-slot rejection, pinned/secret/DMA exclusion, and faulting-process termination.

Kani should stay focused on bounded pure primitives in capos-lib. Hardware-dependent behavior such as TLB shootdown, page faults, and DMA requires QEMU or targeted integration evidence.

Phasing

Phase 0: Contract and Inventory

  • Land this proposal and the backlog.
  • Inventory existing memory-state transitions and match them to the classes above.
  • Identify code paths whose errors currently blur validation failure, quota denial, and global pressure.

Phase 1: Host-Testable Ownership Model

  • Extract or mirror sparse anonymous reservation behavior into host-testable logic where useful.
  • Add model tests for reservation split/merge, committed-page bookkeeping, borrowed mapping provenance, and one-ledger-of-record accounting.

Phase 2: Shared-Memory Provenance and Pins

  • Define MemoryObject mapping identity sufficient for SharedParkSpace, service-owned shared buffers, and zero-copy file/network paths.
  • Add explicit object pins or mapping pins only where a locked copy/read is not enough.

Phase 3: Runtime Budgets and OOM Normalization

  • Add spawn-time memory budget policy once the selected milestone sequence reaches resource-profile work.
  • Normalize allocation failure results at capability boundaries.
  • Add typed process exit reporting for memory faults and future OOM.

Phase 4: DMA and Swap Extensions

  • Treat userspace drivers as blocked until DMA owner states, generations, ledgers, and stale-completion proofs exist.
  • Treat swap as blocked until page classes, slot metadata, encryption, and fault lifecycle are explicit.

Open Questions

  1. Should capOS expose a first-class pinned-memory option on VirtualMemory, or only through narrower future caps such as SharedParkSpace, DMAPool, and SecretMemory?
  2. Should MemoryObject gain direct read/write and slice operations before file/network SharedBuffer APIs, or stay mapping-only until a concrete service needs the API?
  3. Which pure address-space interval logic should move into capos-lib for Kani/host testing without dragging in architecture-specific page-table details?
  4. What is the smallest status surface that exposes reservation, commit, resident, pinned, borrowed, DMA, and swapped counts without creating a second enforcement counter?
  5. How much kernel metadata should remain permanently reserved before capOS adds explicit object-funding authority for kernel allocations?

Bottom Line

capOS should treat memory state as capability authority plus ledgered residency. Anonymous reservations, committed frames, shared objects, pins, DMA pages, and swapped pages need distinct contracts. The near-term path is not to implement swap or a pager; it is to make the authority, accounting, TLB/device-observer, cleanup, and proof rules precise enough that future shared-memory and device work cannot accidentally bypass them.

Proposal: Stateful Task and Job Graphs

capOS should eventually have a small durable work-graph substrate: a way to describe, run, inspect, pause, retry, and resume stateful DAG-shaped work. It should serve four related needs without becoming a universal service manager:

  • init-owned service startup, restart, and shutdown orchestration;
  • IX-style package and build graph execution;
  • operator-visible task lists with optional assignee, budget, and run state;
  • notebook-like user stories where prose, commands, outputs, and rerun points are recorded as a narrative over real work.

The important design line is that the graph substrate is not the UI, not a shell, not a package manager, not a notebook runtime, not a service manager, and not a generic capability broker. It is the durable state machine beneath those tools.

Position

Adopt a WorkGraph model, but keep it narrow.

The core object is a versioned graph definition plus run instances:

  • Graph definition: immutable, schema-validated structure: nodes, typed edges, resource hints, authority requirements, retry/cancellation policy, and expected artifacts.
  • Graph run: one execution attempt of a graph definition, with node-run state, leases, logs, checkpoints, artifacts, and audit events.
  • Node run: one executable, manual, or descriptive unit of work inside a run.
  • Artifact: durable output, checkpoint, service export, log, report, or Store/Namespace reference produced by a node.
  • Assignment: optional workload metadata: assignee principal, role, queue, priority, resource profile, deadline, and budget.

The common substrate is a schema/library/event-log pattern, not one global coordinator. Each domain owns its coordinator, executor queue, domain-node schema, validation, and authority:

  • init owns init lifecycle state;
  • BuildCoordinator owns IX build graph execution and job state;
  • an agent runner owns agent task state and workspace leases;
  • a notebook/story service owns narrative projections;
  • an operator task service owns human assignment state.

They may share graph/run/event/artifact shapes, but they do not share one authority-holding scheduler.

Everything above that is a facade:

  • init sees service lifecycle and dependency state;
  • IX sees package inputs, build steps, outputs, and Store commits;
  • an operator sees a DAG-organized todo list with assigned work;
  • a notebook sees cells, prose, rich outputs, and rerun checkpoints;
  • an agent runner sees durable steps, memory/checkpoints, and review gates.

The same persisted run can have more than one projection. A failed package build can appear as an IX build failure, an operator task, a notebook section, and a graph node with logs. The core should not know which projection is being used. Cross-domain views should be read-only projections or explicit links to the owning run, not copied mutable event state.

Why This Belongs in capOS

capOS already has several graph-shaped systems:

  • initConfig.services is an init-owned service graph.
  • ProcessSpawner and ProcessHandle provide process lifecycle edges.
  • capos-service needs readiness, shutdown, drain, background work, resource reservations, and handoff hooks.
  • IX-on-capOS needs dependency-ordered fetch, extract, build, Store commit, and realm publish.
  • agent and shell workflows need durable state when work crosses sessions, reviews, restarts, or context compaction.

Without a shared state model, each subsystem will grow its own partial orchestrator: init will have a service table, IX will have a build executor, agents will have task memory, operators will have ad-hoc todo state, and notebook-like demos will have their own cell/run records. That is duplication in the wrong layer.

With too much sharing, the substrate becomes a god object. The right answer is a shared run-state and dependency model with domain-specific executors.

Prior Art Baseline

Sources checked for this proposal:

The useful lessons are separable.

Airflow: a workflow run has task instances, dependencies, scheduling, retries, timeouts, documentation, and operational state. Airflow’s DAG object intentionally does not care what happens inside a task; it cares about order, retry, timeout, and execution conditions. capOS should copy that separation, but not the Python-file import model, global scheduler database, or operator/plugin surface.

Dagster: asset-first thinking fits capOS better than task-first thinking when the output is durable state. A Store object, package output, Namespace snapshot, boot manifest, built binary, benchmark report, or service export is closer to a Dagster asset than to an Airflow task. Dagster’s ops/graphs remain useful when work is not naturally an asset. capOS should adopt the split: assets are durable products; ops are execution steps; jobs are selections of work to materialize or run. Dagster itself is data-platform-shaped, so it is inspiration, not the implementation target for init.

Jupyter: notebook structure is a user story, not the kernel or init abstraction. Cells, prose, outputs, and metadata are excellent for reviewing a run, explaining why it happened, and rerunning a chosen step. They should be a projection over graph state. Cell order must not become the source of truth for service lifecycle or package builds.

LangGraph: checkpointed graph execution, threads, super-step boundaries, interrupts, and time travel are useful for agent-like and human-in-the-loop work. capOS should borrow the checkpoint boundary idea for resumability, but avoid binding the substrate to LLM message state.

IX: the package graph research is the strongest local precedent. IX’s current executor traverses a dependency graph by node outputs, applies pools, creates output directories, runs shell commands, touches sentinel files, and kills the process group on failure. That proves IX already has a real build graph. It also shows where capOS must stop: graph scheduling must not be fused to subprocess, Unix process groups, filesystem sentinels, hardlinks, symlinks, fetchers, archive extraction, or Store mutation. Those belong behind typed capOS services.

Core Model

The minimal model is:

struct WorkGraph {
  graphId @0 :Text;
  version @1 :UInt64;
  nodes @2 :List(CommonNodeSpec);
  edges @3 :List(EdgeSpec);
  defaults @4 :GraphPolicy;
  domainSchema @5 :UInt64;
}

struct CommonNodeSpec {
  nodeId @0 :Text;
  title @1 :Text;
  inputs @2 :List(ArtifactSelector);
  outputs @3 :List(ArtifactSpec);
  requiredCaps @4 :List(CapRequirement);
  policy @5 :NodePolicy;
  assignmentDefault @6 :Assignment;
}

struct WorkRun {
  runId @0 :Text;
  graphId @1 :Text;
  graphVersion @2 :UInt64;
  state @3 :RunState;
  nodes @4 :List(NodeRun);
  events @5 :List(EventRef);
}

struct NodeRun {
  nodeId @0 :Text;
  state @1 :NodeState;
  attempt @2 :UInt32;
  assignment @3 :Assignment;
  artifacts @4 :List(ArtifactRef);
  checkpoint @5 :CheckpointRef;
}

This is a shape, not final schema. The stable part is the split between definition, run, node-run state, artifacts, and assignments.

Domain node meanings are not a shared NodeKind enum in the common schema. Init may define InitServiceNode; IX may define FetchNode, ExtractNode, BuildNode, StoreCommitNode, and PublishNode; a story projection may define NotebookCellNode or ManualNoteNode. Those domain structs live in domain-owned schemas or config sections and are validated by the domain coordinator that holds the relevant authority. The common graph library may hash, store, and index their association with nodeId, but it must not interpret every domain’s node kinds.

Node State

Node state should be explicit enough for init, package builds, and operators:

  • planned: validated but not yet eligible.
  • blocked: waiting on upstream nodes, an unavailable capability, resource budget, or manual input.
  • runnable: dependencies are satisfied and a worker may lease it.
  • leased: a worker or assignee owns the next attempt for a bounded time.
  • running: execution has begun.
  • waiting: running but blocked on a child process, readiness export, external event, manual approval, timer, or checkpoint resume.
  • succeeded: produced the declared outputs or accepted terminal result.
  • failed: terminal failure under current policy.
  • retryPending: failed attempt will be retried under policy.
  • skipped: intentionally not run because branch/condition policy selected a different path.
  • canceled: canceled by caller, shutdown, superseding run, or authority revocation.
  • paused: durable operator or policy pause.
  • stale: graph version, cap epoch, input artifact, or session binding no longer matches the run’s assumptions.

State transitions should be append-only events. Services may compact state into snapshots, but audit and replay need a durable event boundary.

Edges

A plain DAG edge is not enough. capOS needs typed edge reasons:

  • dependsOnSuccess: downstream may run after upstream succeeds.
  • dependsOnArtifact: downstream consumes a named artifact or Store ref.
  • dependsOnReady: downstream waits on a service readiness export.
  • dependsOnLease: downstream may run only while a lease/session is live.
  • cancelsWith: cancellation propagates across the edge.
  • shutdownBefore: shutdown order edge, usually reverse of startup.
  • approvalFor: manual approval gates a node or subgraph.
  • observes: node only observes another node’s state and does not block it.

The graph remains acyclic within one run. Loops are modeled by new runs, periodic schedules, sensors, retries, or explicit child graphs. This is a critical stop line: hidden cycles create service-manager behavior inside the graph engine.

Workload Assignment

Assignment is optional metadata, not authority:

struct Assignment {
  principal @0 :Text;
  role @1 :Text;
  queue @2 :Text;
  priority @3 :Int32;
  budget @4 :ResourceProfileRef;
  deadline @5 :TimeRef;
  lease @6 :LeaseRef;
}

An assigned operator or worker may receive a lease to attempt a node. The lease does not grant broad system authority. It only grants the ability to claim or update that node-run through the coordinator, and any executable work still needs domain caps supplied by init, a build coordinator, a package worker, an agent runner, or another supervisor.

This makes the same graph usable as:

  • a todo list where a human owns a manual node;
  • a build queue where a worker owns a build step;
  • an init run where PID 1 owns service lifecycle nodes;
  • an agent plan where a worker owns a bounded workspace task.

Init As A Consumer

The user direction is important: this may be used for workload orchestration by init.

The current init path validates initConfig.services, spawns children through ProcessSpawner, records exports, and waits. The first graph use should only observe and structure that existing behavior:

  1. Compile initConfig.services into a graph definition.
  2. Create a volatile boot WorkRun in init memory.
  3. Treat each service as a lifecycle node with the states current init can actually observe: planned, spawned, running/waiting, exited, or failed.
  4. Use typed edges for declared cap imports and manifest-order dependencies.
  5. Persist selected run events later through a Store-backed journal when storage is available.

Init does not need to become a general-purpose Airflow. It needs a durable or inspectable lifecycle table with graph semantics:

  • what services were planned;
  • what caps and exports they depend on;
  • which services are spawned, running, waiting, exited, failed, or blocked under the current primitives;
  • later, which services are restarting, draining, terminating, or ordered for shutdown once those lifecycle primitives exist;
  • what operator-visible work remains.

Restart, drain, termination, readiness-export waiting, and shutdown-order control are later phases. They require primitives that are still future in the service and broker proposals:

  • process termination or kill-tree semantics narrower than raw process-table authority;
  • an explicit readiness/export contract for services;
  • service drain or lifecycle caps for graceful shutdown;
  • restart policy state that is disabled or narrowed during shutdown mode;
  • stale export and stale process-handle behavior for restarted services;
  • audit events that distinguish crash, restart, operator stop, shutdown, timeout, and stale-authority denial.

The generic graph code can be an init-internal library at first. If a separate run-state service appears later, init should delegate only narrow read or update capabilities to it. The separate service must not receive ProcessSpawner, raw process handles, or service-owner caps merely because it stores graph state.

IX Package Graph Consumer

IX should use the same run-state model with a different executor:

  • package templates and descriptors produce graph definitions;
  • fetch/extract/build/store/publish become typed nodes;
  • inputs and outputs are Store or Namespace refs;
  • build logs and output hashes are artifacts;
  • package build workers lease executable nodes;
  • BuildCoordinator owns scheduling, cancellation, queues, and job state;
  • Fetcher, Archive, BuildSandbox, Store, and Namespace hold the real authority.

The graph substrate should not know how to fetch a URL, unpack a tarball, run sh, or commit a Store object. It records that those typed steps exist, which worker owns the attempt, what artifacts were produced, and whether the run can resume or retry.

This preserves the IX research recommendation: use IX’s package corpus and content-addressed model without importing a CPython/POSIX executor boundary. It does not move IX job ownership into a global graph coordinator.

Notebook User Story

Jupyter is best treated as a user story:

  • A notebook cell can map to a note, manualTask, notebookCell, agentStep, or build node.
  • Cell output is an artifact: text, table, image, log excerpt, benchmark summary, Store ref, or Namespace snapshot.
  • Markdown/prose explains why the graph exists and how to interpret its state.
  • Rerun means “create a new run or retry selected node(s) under policy”, not “mutate hidden cell global state”.
  • Checkpoints let a user resume from a durable boundary.

The notebook layer may be CLI text, mdBook, a future web shell, or a rich UI. The core model should not depend on any of those.

Dagster Fit

Dagster is closer than Airflow for durable capOS work when outputs matter. For capOS, a software-defined asset maps naturally to:

  • content-addressed package output;
  • boot image or manifest;
  • Namespace snapshot;
  • benchmark report;
  • generated code artifact;
  • service export that becomes available after readiness;
  • notebook output captured as a reproducible artifact.

Dagster’s ops and graphs map to executable steps. Its jobs map to selections of assets or ops to run. Its sensors and schedules map to run creation policies.

The mismatch is domain and authority. Dagster assumes a data-platform runtime, Python definitions, and external resources. capOS needs capability grants, typed service exports, process handles, sessions, Store/Namespace refs, resource ledgers, and boot-time constraints. The right move is not “run Dagster in init”; it is “use Dagster’s asset/ops/jobs distinction to keep the capOS graph model honest.”

Where To Stop

The main risk is building a god object. The graph substrate must not absorb every adjacent concept.

Stop at these boundaries:

  • No kernel WorkGraph capability. The kernel provides primitive caps: process, memory, IPC, timers, devices, and storage plumbing. Graph state is userspace.
  • No global service discovery. A graph may reference capabilities granted into its runner or produced by its own nodes. It must not look up arbitrary services by global name.
  • No ambient executor. Run-state code cannot execute arbitrary strings, scripts, Cap’n Proto calls, or binaries. A domain executor must hold the exact capabilities needed.
  • No universal plugin ABI. Domain node kinds are typed in domain schemas. Unsupported node kinds fail domain validation rather than becoming untyped byte blobs.
  • No authority laundering. Assignment, tags, labels, notebook cells, and graph edges do not grant authority. Only capabilities do.
  • No UI state in the core. Notebook cells, DAG visual positions, comments, and todo-list grouping are projections or metadata.
  • No package-manager logic in the core. Fetch, archive, build, Store, and Namespace operations stay in IX/build services.
  • No init-specific policy in the core. Restart policy, shutdown order, and process termination are init or supervisor policy. The graph can record and drive them only through explicit runner methods.
  • No hidden loops. Periodic work, sensors, retries, and agent iteration create new attempts or runs. One run’s execution graph stays acyclic.
  • No unbounded event retention by default. Retention and compaction are policy fields, not accidental database growth.

If a feature requires any graph coordinator to hold broad ProcessSpawner, DeviceManager, NetworkManager, Store, Namespace, Fetcher, shell, or session authority for all domains, the design has crossed the line.

Service Split

The target split is:

flowchart TD
    Lib[Shared graph schema and state library]
    Log[Optional Store-backed event log]

    Lib --> InitCoord[init-local lifecycle graph]
    Lib --> BuildCoord[IX BuildCoordinator graph]
    Lib --> TaskCoord[operator task graph]
    Lib --> StoryCoord[notebook/story projection]
    Lib --> AgentCoord[agent-run graph]

    InitCoord --> InitLog[volatile boot run first]
    BuildCoord --> Log
    TaskCoord --> Log
    StoryCoord --> Log
    AgentCoord --> Log

    InitCoord --> InitExec[init lifecycle executor]
    BuildCoord --> BuildExec[build workers]
    TaskCoord --> Human[operator/manual assignee]
    AgentCoord --> AgentExec[agent worker]

    InitExec --> Spawner[ProcessSpawner]
    BuildExec --> Sandbox[BuildSandbox]
    BuildExec --> Store[Store/Namespace]
    AgentExec --> Workspace[Task workspace caps]

Only domain coordinators and executors hold domain authority. The shared code owns no authority beyond manipulating in-memory or Store-backed graph records through whatever narrow capability its caller already holds.

Persistence

Persistence should be incremental:

  • Early init boot runs can be volatile.
  • Build runs should persist event logs, logs, artifacts, and Store refs as soon as Store exists.
  • Operator tasks and notebook stories should persist once user storage exists.
  • Agent runs should persist checkpoints and review state, not raw hidden prompt state.

Store integration should use content-addressed objects for immutable outputs and an append-only or generation-checked log for mutable run state. Namespace snapshots can publish human-facing names for completed runs, package realms, or notebook reports.

Boot must not depend on a separate Store-backed graph service being available. If durable graph logging is unavailable during boot, init falls back to its volatile lifecycle table and emits diagnostics through its existing console/log path. Durable replay and post-boot inspection are degraded in that mode, but service startup must not fail solely because the graph log is unavailable.

Security Rules

  • Node claims are lease-based and expire.
  • Every state update is authorized by the current lease, graph owner, or a delegated control cap.
  • Node output publication validates expected artifact type and size.
  • Retrying a node must not reuse stale capabilities, stale sessions, or stale object epochs.
  • Cancellation must release leases and ask domain executors to drain or kill work through typed lifecycle caps.
  • Audit logs distinguish failure, cancellation, stale authority, denied authority, timeout, manual rejection, and superseded run.
  • Resource budgets are reserved before execution and released on all terminal paths.

Staged Plan

Stage A: Init-Local Run Model

Add a pure capos-config or init-local graph/run-state library that can model the existing initConfig.services startup order, service exports, and child waits. Keep it volatile. Add host tests for graph validation and state transitions.

Stage B: Init Lifecycle Projection

Teach init to expose or print an inspectable service run summary: planned, spawned, running or waiting, exited, and failed. Later summaries can add readiness, restart, drain, termination, and shutdown ordering after those primitives exist. This can remain a text proof before adding any new capability interface.

Stage C: Store-Backed Run Log

Once Store/Namespace is credible, persist run events and compact snapshots. This unlocks post-boot inspection, operator task state, and notebook stories.

Stage D: IX BuildCoordinator

Represent IX package builds as graph runs. Keep execution in BuildCoordinator, BuildSandbox, Fetcher, Archive, Store, and Namespace services.

Stage E: Operator Task Surface

Expose a shell or structured command surface for graph runs: list, inspect, assign, pause, resume, retry, cancel, approve, and show artifacts. This is the DAG-organized todo-list layer.

Stage F: Notebook Story Projection

Generate notebook-like reports from graph runs: prose, cells, commands, logs, artifacts, and checkpoints. Treat notebooks as reproducible run narratives, not as the owner of execution semantics.

Stage G: Agent Workflows

Use graph runs for long-lived agent tasks, review gates, workspace leases, memory checkpoints, and human approval nodes.

Validation

Each stage should have focused checks:

  • pure host tests for state transitions and invalid graph rejection;
  • init QEMU proof that existing service startup still works;
  • later lifecycle-control proof that shutdown dependency order is obeyed, once terminate/drain/shutdown primitives exist;
  • stale lease and stale cap epoch tests;
  • IX differential tests against host-side IX planning where applicable;
  • docs build to refresh topics and catch Mermaid/front matter errors.

Open Questions

  • Should init embed the graph library permanently, or should it eventually delegate run-state persistence to a child service once storage is available?
  • What is the smallest schema for ArtifactRef that covers service exports, Store refs, logs, notebooks, and package outputs without becoming Any?
  • Does domainSchema identify only a domain schema version, or also the domain payload location and content hash for node-specific config?
  • How should schedules and sensors be represented without creating hidden cyclic runs?
  • Which graph events deserve permanent audit retention versus compacted operational state?
  • Should notebook projections use Jupyter nbformat directly, or a smaller capOS-native story format that can export to notebooks later?

Recommendation

Build a small stateful graph substrate, but make it a run-state service or library, not a universal orchestrator.

For init, use it to make service lifecycle visible and eventually durable. For IX, use it to track package build graphs while execution remains in build services. For operators, project it as an assigned DAG todo list. For Jupyter, project it as a notebook-style user story. For agents, project it as durable task state with checkpoints and review gates.

The stop line is authority: shared graph code records state, domain coordinators schedule work, and typed domain services execute it.

Proposal: OOM Handling and Swap

How capOS should behave under memory pressure, what “out of memory” means at different boundaries, and how optional swap support fits the capability model.

  • Memory Management documents the current implemented frame, page-table, VirtualMemory, and MemoryObject behavior.
  • Go VirtualMemory Contract defines the near-term distinction between virtual reservation and physical commit that this proposal builds on.
  • Resource Accounting and Quotas defines the ledger vocabulary used for memory-pressure policy.
  • Memory Authority Model defines who may create memory commitments and how authority composes with the budgets this proposal charges against.
  • Scheduler Evolution is the parallel design for CPU-time authority: SchedulingContext budget/period/donation is the CPU-side analogue of the spawn-time memory budgets and per-process commitment ledger this proposal requires, and reclaim/swap-in fault paths must respect bound dispatcher budgets and depletion notifications rather than busy-waiting under pressure.
  • Design Risks Register R4 – Resource accounting is fragmented tracks the cross-proposal gap this design closes for the memory axis and routes related fragmentation (per-service fairness beyond thread weights, unified resource bundles, scratch-bytes/outstanding-calls/in-flight-call quotas) to its owning trackers.

Problem

capOS already has several local out-of-memory paths:

  • boot-time allocation failures that are still fatal,
  • service-facing operations that return a controlled error,
  • rollback paths that free partially allocated state, and
  • hostile-path tests that prove some frame-exhaustion cases.

What the tree does not have yet is a coherent memory-pressure policy. There is no system-wide answer to these questions:

  • When should an allocation fail immediately vs. trigger reclaim?
  • Which memory is reclaimable, swappable, or permanently pinned?
  • What outcome should a process observe when a page fault cannot be satisfied?
  • Who is allowed to decide that another process should die under memory pressure?

Without that policy, the codebase will drift into a mix of local conventions: some paths return Overloaded, some return interface-specific failure text, some remain boot-fatal, and future swap support would have no clear ownership or threat model.

Design Goals

  1. No ambient OOM killer. The kernel must not scan the system for an arbitrary victim and kill it Linux-style.
  2. Explicit accounting. Memory exhaustion must be understood in terms of budgets, commitments, and reclaimability, not just “the allocator returned None.”
  3. Typed failure semantics. Callers must be able to distinguish invalid requests, local budget exhaustion, transient pressure, and fatal page-fault failure.
  4. Fail closed. Memory-pressure code must not corrupt capability state, silently drop dirty data, or leave half-constructed kernel objects behind.
  5. Swap is optional. capOS must work without swap. Swap is a policy and deployment choice, not a baseline requirement.
  6. Security first. Swap must not become a secret-leak side channel or an integrity hole.

Non-Goals

  • Transparent global persistence in the EROS sense.
  • General-purpose overcommit as the default memory model.
  • Swapping kernel metadata, capability rings, CapSet pages, or DMA-pinned memory.
  • A userspace pager dependency in the first swap implementation.

Design Grounding

This proposal deliberately borrows from three existing design directions in the research set:

  • Genode: strict memory accounting and quota donation are the right default because they avoid an ambient OOM killer and make responsibility obvious.
  • seL4: explicit memory authority is preferable to a kernel that can create new backing objects out of thin air when under pressure.
  • EROS / CapROS / Coyotos: do not make implicit persistent backing store the baseline. capOS already chose explicit persistence and should not back into a single-level-store design through swap.

The result is not a copy of any of those systems. capOS keeps explicit capability-granted memory objects and ordinary page tables, but adopts the accounting discipline that makes OOM behavior reviewable.

Core Policy

1. No Overcommit by Default

The default rule is simple: a process may only create anonymous memory if the system can charge that commitment to a real budget.

That means:

  • anonymous VirtualMemory.commit and compatibility VirtualMemory.map consume committed-page budget,
  • anonymous VirtualMemory.reserve consumes virtual address-space quota only and does not promise physical backing,
  • resident pages consume real frame availability when they are instantiated,
  • swap, when enabled, extends commitment capacity only for memory classes that explicitly allow it,
  • and no interface may assume that a later background OOM killer will clean up a bad admission decision.

This follows the same principle as capability authority in general: if a child needs more memory, some parent or broker must have chosen to give it that room.

2. The Kernel Never Picks a Random Victim

When memory is tight, the kernel may:

  • reclaim kernel-known clean caches,
  • free resources from already-dead processes,
  • swap out eligible anonymous pages,
  • reject a new allocation,
  • or terminate the faulting process when its own page cannot be restored.

What it must not do is kill an unrelated process just because it happens to be large. Cross-process eviction is a supervisor policy decision, not a kernel allocator side effect.

Supervisors remain free to implement their own policy. A shell/session broker or future service manager can decide to stop a child, reduce its budget, or restart it. That decision is explicit and auditable rather than hidden inside the low-level frame allocator.

3. Distinguish Four Memory Outcomes

capOS should treat these as different cases, not variants of one string:

SituationRequired behavior
Invalid request (size=0, misaligned range, quota metadata malformed)Deterministic failed / request validation error
Caller exhausted its allowed budgetDeterministic overloaded or typed outOfMemory result
Global pressure, but reclaim/swap may succeedReclaim first, then retry locally
Faulting page cannot be restored or committedTerminate the faulting process with an explicit OOM exit reason

The important distinction is between synchronous API failure and asynchronous execution failure. If a capability call asks for more memory, it should get an error back. If a process touches a swapped-out page and the system cannot bring it back, there is no capability return value to encode. That must be a process-lifecycle event.

Memory Classes

The reclaim policy depends on what kind of memory is being discussed.

ClassExamplesReclaim policy
Kernel-reserved, unswappablekernel heap, page tables, scheduler/process metadata, cap-table backing, ring scratchNever swap; pressure here is a kernel-capacity problem
User pinned, unswappablecapability ring page, CapSet page, DMA buffers, wired mappings, key material, future mlock-style regionsNever swap; allocation fails if unavailable
Reclaimable clean cacheboot-package cache, future filesystem cache, executable pages that can be reloaded, clean read-only object pagesDrop and refetch rather than swap
Anonymous private swappableordinary heap/stack/anonymous VM pages that opt into swapSwap-eligible if policy allows it
Shared/persistent object pagesMemoryObject, mapped content-addressed store pages, future file-backed shared memoryNot part of phase-1 swap; treat as reclaim/drop or keep resident based on object semantics

Two rules matter here:

  1. Clean cache is not swap. If a page can be reconstructed from a trusted backing object without preserving dirty state, reclaim it by dropping it.
  2. Pinned means pinned. If a page participates in DMA, capability transport, bootstrap identity, or secret handling, treat it as unswappable unless a later design proves otherwise.

DMA pages are a pinned residency class with additional lifecycle constraints: they must be committed before exposure to the device, resident for the entire device-visible lifetime, unswappable while mapped by a DMAPool or IOMMU domain, and scrubbed before release to another owner. Reclaim is not allowed to make progress on a DMA page; pressure must surface as admission failure or device-manager teardown.

Device-written DMA pages are untrusted input until validated by the owning driver or network/storage stack. Pinning and residency prevent reclaim races; they do not make device bytes trustworthy, nor do they grant ordinary MemoryObject authority over the backing frames.

Failure Semantics by Boundary

Capability Calls

For explicit allocation requests, return a structured failure rather than panicking:

  • VirtualMemory.map should return overloaded or a typed OOM result when the request cannot be satisfied.
  • ProcessSpawner.spawn should continue the current direction: bounded parsing, fallible allocation, Overloaded on resource exhaustion.
  • Future interfaces where OOM is a normal domain outcome should prefer a typed union result rather than an exception string.

This is consistent with the existing error-handling proposal: temporary resource exhaustion is not the same thing as malformed input.

Page Faults

Page faults are different. A faulting instruction does not have a natural request/response channel. The policy should therefore be:

  1. attempt reclaim,
  2. attempt swap-out of another eligible page if that creates room,
  3. attempt swap-in or zero-fill for the requested page,
  4. if that still fails, terminate the faulting process with a typed exit reason such as outOfMemory.

That is not an ambient OOM killer. It is the equivalent of delivering an unrecoverable execution fault to the process whose own memory access could not be satisfied.

Boot

Boot remains a special case. If the kernel cannot allocate its own core heap, page tables, or init process, the system cannot proceed. Those failures remain boot-fatal until the architecture moves more kernel object memory under explicit authority.

This proposal does not pretend otherwise. It narrows runtime behavior first and only then pushes on the deeper architectural question of who funds kernel objects.

Budget Model

The long-term model should separate commitment from residency.

  • Reserved virtual pages: address-space ranges the process owns but that do not yet promise physical backing. The Go allocator contract charges these to a separate virtual-reservation quota.
  • Committed pages: memory the system has promised can exist for a process. This is what VirtualMemory.commit, compatibility VirtualMemory.map, and future runtime heap growth should charge.
  • Resident pages: memory currently backed by a physical frame.
  • Pinned pages: resident pages that reclaim and swap may not touch.
  • Swapped pages: committed but non-resident anonymous pages with an encrypted slot on a swap area.

The detailed Go/runtime ABI for splitting virtual reservation from physical commitment is Go VirtualMemory Contract. This proposal’s no-overcommit rule applies at commit time, not at pure reservation time.

At spawn time, a parent or broker should be able to set a memory budget for the child. A minimal future shape is:

struct MemoryBudget {
    committedPages @0 :UInt32;
    pinnedPagesMax @1 :UInt32;
    allowSwap @2 :Bool;
    swapPagesMax @3 :UInt32;
    virtualReservationPagesMax @4 :UInt64;
}

This budget does not require capOS to expose Linux-style cgroups. It is a capability-native admission contract between parent and child.

Swap Support

Position

Swap is useful, but only as a constrained extension of the non-overcommit model.

Swap must not mean:

  • “pretend RAM is infinite,”
  • “the kernel can now kill random processes later,”
  • or “all memory classes are equivalent.”

Instead, swap means: some anonymous pages may be evicted to an encrypted backing area, subject to explicit budgets and page-class rules.

Phase-1 Swap Scope

The first swap implementation should be intentionally narrow:

  • only anonymous private pages created through VirtualMemory,
  • only for mappings that are explicitly swappable,
  • no swapfiles,
  • no filesystem dependency,
  • no userspace pager in the fault path,
  • no swapping of MemoryObject result caps, shared IPC pages, or device/DMA memory.

That scope is small on purpose. Once the first swap implementation exists, expanding eligibility is easy; debugging a too-clever pager in the page-fault path is not.

Backing Store

Phase 1 should use a dedicated swap extent, not a regular file.

Reasons:

  • a file-backed swap path drags in namespace, filesystem, metadata writeback, and deadlock questions too early,
  • a dedicated extent is easier to bound and reason about,
  • and encryption/integrity policy is cleaner when the medium is dedicated to swap slots.

Provisioning should happen through init or a future storage broker that discovers a block extent and passes it into a kernel configuration path.

Compression

Compressed swap caches are a reasonable later optimization, but not the first one to build.

Linux’s zswap design is a useful warning here: it keeps a dynamically sized compressed pool in RAM and evicts from that pool to a backing swap device when the pool reaches its limit. That can improve I/O behavior, but it also creates another reclaim tier with its own sizing, hysteresis, and writeback policy.

capOS should not start there. Phase 1 should write eligible pages directly to the encrypted swap extent. A compressed in-RAM layer can be added later only after the basic swap accounting, eviction, integrity, and observability rules are stable.

Encryption and Integrity

Swap must be encrypted by default.

The crypto policy should match the existing key-management and volume-encryption direction:

  • use a fresh per-boot ephemeral symmetric key that lives only in RAM,
  • never persist that key,
  • invalidate all prior swap contents on boot,
  • authenticate every swapped page so stale-slot replay and random corruption do not silently produce attacker-controlled plaintext.

This has one deliberate consequence: hibernation is out of scope for the first design. Per-boot keys make resume-across-reboot impossible, which is the correct tradeoff for an early capability OS that does not yet have a full trusted suspend/resume story.

Page Eligibility

A mapping should carry an explicit policy bit or enum rather than forcing all anonymous pages into one bucket.

A future VirtualMemory.map shape should move from bare protection flags to options that express residency policy:

enum MemoryResidency {
    normal @0;     # reclaimable, swap if allowed by budget
    pinned @1;     # must stay resident
    secret @2;     # resident only; zero aggressively; never swap
}

This is a better fit than inventing ad hoc “don’t swap this one page” special cases later for crypto heaps, broker secrets, or device buffers.

Fault Path Semantics

On a page fault to a swapped-out page:

  1. the kernel locates the slot metadata,
  2. allocates or frees a frame through reclaim,
  3. reads and authenticates the page,
  4. remaps the page,
  5. resumes the process.

If the slot cannot be restored because no frame can be made available, or the page fails integrity validation, the kernel terminates the faulting process with a distinct exit reason. It must not inject zeros, fabricate stale data, or retry indefinitely.

Why Not a Userspace Pager First

A pure userspace pager is attractive in theory but wrong as the initial step. The current kernel does not have the scheduler, storage, and fault-notification machinery needed to make page-fault RPC safe and bounded under memory pressure.

The first swap design should therefore keep the fault mechanism and slot metadata in kernel while keeping the provisioning and high-level policy outside the kernel where possible.

An external pager can remain a later phase once capOS has:

  • notifications,
  • richer process/thread lifecycle control,
  • deadlock-resistant fault upcalls,
  • and a storage stack that can be driven safely during memory pressure.

Interface and Lifecycle Changes

This proposal implies a few interface changes, even if the exact schema names change later.

Process Exit Reporting

Supervisors need to know whether a child:

  • exited normally,
  • hit a capability exception,
  • faulted on memory corruption,
  • or died because memory pressure could not be satisfied.

That argues for a typed exit record rather than flattening everything into one numeric code.

Spawn-Time Memory Budgets

ProcessSpawner should eventually accept resource limits, including a memory budget, rather than assuming every child competes in one shared frame pool.

Monitoring

A future monitoring/status surface should expose at least:

  • committed pages,
  • resident pages,
  • pinned pages,
  • swapped pages,
  • swap I/O failures,
  • reclaim counts,
  • and per-process OOM termination counts.

Without that, operators will not be able to distinguish “the child leaked heap” from “the kernel pinned too much unswappable state.”

Security Requirements

Memory-pressure code is security-sensitive, not just performance-sensitive.

Required properties:

  • reclaim and swap metadata operations are bounded and fail closed,
  • swap ciphertext is authenticated, not just encrypted,
  • freed swap slots cannot be read by another process,
  • secret/pinned mappings never spill to swap,
  • swap enable/disable transitions do not expose stale plaintext,
  • and pressure paths avoid allocation where possible.

The last point matters because allocating heap memory while handling OOM is how systems spiral into recursive failure and panic surfaces.

Relationship to Existing Proposals

  • Error Handling: resource exhaustion should map to overloaded or typed OOM results at explicit call boundaries, not generic panic text.
  • Service Architecture: parents and supervisors should own memory budgets just as they own capability grants.
  • Storage and Naming: swap should use explicit backing extents, not ambient filesystem paths.
  • Volume Encryption / Key Management: swap encryption uses a per-boot ephemeral symmetric key; persistent encryption keys are unnecessary for the first design.

Phases

Phase 0: Normalize Runtime OOM Semantics

  • Remove remaining runtime panic surfaces on untrusted allocation paths.
  • Distinguish boot-fatal OOM from service-facing overloaded.
  • Add typed process-exit reporting for OOM and faulted swap-in.

Phase 1: Budgeted Anonymous Memory

  • Add spawn-time memory budgets.
  • Charge anonymous VirtualMemory.commit and compatibility VirtualMemory.map against committed-page budget.
  • Charge anonymous VirtualMemory.reserve against virtual address-space quota.
  • Mark pinned vs. swappable vs. secret mappings explicitly.

Phase 2: Reclaim Without Swap

  • Add clean-cache reclaim and dead-process cleanup accounting.
  • Expose pressure metrics and events.
  • Keep allocation failure deterministic when reclaim cannot help.

Phase 3: Encrypted Kernel-Managed Swap

  • Add dedicated swap extent provisioning.
  • Add encrypted/authenticated page slots with per-boot ephemeral keying.
  • Support swap for anonymous private pages only.
  • Terminate the faulting process cleanly when swap-in cannot succeed.

Phase 4: Optional External Pager

  • Revisit pager upcalls only after notifications, richer lifecycle control, and storage-stack maturity exist.
  • Keep the kernel fault path bounded even if policy moves outward.

Open Questions

  1. Should capOS ever add demand commit on first access after the explicit reserve/commit contract, or should runtime allocators keep making commitment visible through capability calls?
  2. Should executable anonymous pages be swappable in phase 1, or should swap be limited to writable anonymous pages until code-loading semantics mature?
  3. When MemoryObject grows richer sharing semantics, should some subclasses be reclaimable-from-backing rather than unswappable?
  4. Does a future secret mapping need stronger guarantees than “never swap,” such as forced zero-on-fork, no-core-dump, and cache-flush hooks?
  5. How much kernel memory should remain permanently reserved before the system starts admitting user commitments?

Bottom Line

capOS should treat OOM as an authority and lifecycle problem, not as a last-gap allocator surprise. The default system should use explicit budgets and no overcommit, return typed exhaustion at API boundaries, reserve process death only for unsatisfied execution faults, and add encrypted swap later as a narrow extension for anonymous private pages.

Proposal: Capability-Native System Monitoring

How capOS should expose logs, metrics, health, traces, crash records, and service status without introducing global /proc, ambient log access, or a privileged monitoring daemon that bypasses the capability model.

Problem

The current system is observable mostly through serial output, QEMU exit status, smoke-test lines, CQE error codes, and a small measurement-only build feature. That is enough for early kernel work, but it is not enough for a system whose claims depend on service decomposition, explicit authority, restart policy, auditability, and later cloud operation.

Monitoring is also not harmless. A monitoring service can reveal capability topology, service names, scoped subject references, transport metadata, timing, crash context, request payloads, and security decisions. If capOS imports a Unix-style “read everything under /proc” or “global syslog” model, monitoring becomes an ambient authority escape hatch. If it imports a kernel-programmable tracing model too early, it adds a large privileged execution surface before the basic service graph is stable.

The design target is narrower: make operational state visible through typed, attenuable capabilities. A process should observe only the services, logs, and signals it was granted authority to inspect.

Current State

Implemented signal sources:

  • Kernel diagnostics are printed through COM1 serial via kprintln!, timestamped with the PIT tick counter. Panic and fault paths use a mutex-free emergency serial writer.
  • Userspace logging currently goes through the kernel Console capability, backed directly by serial and bounded per call.
  • A Phase 1 capability log surface has landed: LogSink/LogReader over a bounded drop-oldest kernel ring (kernel/src/cap/log.rs), with SystemConfig.logLevel drop enforcement at the sink, serial forwarding of accepted records, and scoped sink/reader caps granted at spawn (proof: make run-monitoring-log-smoke). Metrics, status, health, traces, crash records, the narrow kernel stats caps, and persistent retention remain future phases.
  • Runtime panics can use an emergency console path, then exit with a fixed code.
  • Capability-ring CQEs carry structured transport results, including negative CAP_ERR_* values and serialized CapException payloads.
  • The ring tracks cq_overflow, corrupted SQ/CQ recovery, and bounded SQE dispatch, but these facts are not exported as normal metrics.
  • ProcessSpawner and ProcessHandle.wait expose basic child lifecycle observation, but restart policy, health checks, and exported-cap lifecycle are future work.
  • capos-lib::ResourceLedger tracks cap slots, outstanding calls, scratch bytes, and frame grants, but only as local accounting state.
  • The measure feature adds benchmark-only counters and TSC helpers for controlled make run-measure boots.
  • SystemConfig.logLevel exists in the schema and is printed at boot, but there is no filtering, routing, or retention policy behind it.
  • An AuditLog capability exists in the schema and kernel (kernel/src/cap/audit_log.rs), used by AuthorityBroker to record auth, setup, session, broker, and shell-launch events. Currently writes to serial via kprintln!; no ring-buffer reader cap or persistent retention yet.
  • A HardwareAuditLog capability with a bounded volatile ring buffer and drain/snapshot readers exists for DMA/MMIO/Interrupt cap lifecycle events (kernel/src/cap/hardware_audit.rs), including sequence numbers and dropped-record counts. A userspace hardware-audit-service drains it into a Store/Namespace-backed hash-chained segment ring and exposes scoped HardwareAuditReader snapshots; the current backing StoreCap is RAM-backed, so post-reboot retention is still a storage-backend concern.
  • hardware_release_log module (kernel/src/cap/hardware_release_log.rs) emits DMA pool, DMA buffer, DeviceMmio, and Interrupt release outcomes to serial; no reader cap or retention yet.

That means the system has useful raw signals and partial audit infrastructure but lacks a unified capability-shaped monitoring architecture with log routing, metrics export, and reader caps for most signal classes.

Design Principles

  1. Observation is authority. Reading logs, status, metrics, traces, crash records, or audit entries requires a capability.
  2. No global monitoring root. SystemStatus(all), LogReader(all), and ServiceSupervisor(all) are powerful caps. Normal sessions receive scoped wrappers.
  3. Kernel facts, userspace policy. The kernel may expose bounded facts about processes, rings, resources, and faults. Retention, filtering, aggregation, health semantics, restart policy, and user-facing views belong in userspace.
  4. Separate signal classes. Logs, metrics, lifecycle events, traces, health, crash records, and audit logs have different readers, retention rules, and security properties.
  5. Bounded by construction. Every producer path has a byte, entry, or time budget. Loss is explicit and summarized.
  6. Payload capture is exceptional. Default tracing records headers, interface IDs, method IDs, sizes, result codes, and scoped transport identifiers only when authorized. Capturing method payloads needs a stronger cap because payloads may contain secrets.
  7. Serial remains emergency plumbing. Early boot, panic, and recovery still need direct serial output. Normal services should receive log caps rather than broad Console.
  8. Audit is not debug logging. Audit records security-relevant decisions and capability lifecycle events. It is append-only from producers and exposed through scoped readers.
  9. Pull by default, push when justified. Status and metrics are pull-shaped (reader polls a snapshot cap). Logs, lifecycle events, crash records, and audit entries are push-shaped (producer calls into a sink). Traces are pull with an explicit arm/drain lifecycle because capture is expensive. Each direction has its own cap surface; do not generalize one shape to cover all signals.
  10. Narrow kernel stats caps over one god-cap. The kernel exposes bounded facts through several small read-only caps (ring, scheduler, resource ledger, frames, endpoints, caps, crash) rather than one KernelDiagnostics that grants everything. Narrow caps let an init-owned status service be assembled by composition, and let a broker lease a subset to an operator without handing over the rest.

Signal Taxonomy

Logs

Human-oriented diagnostic records:

  • severity, component, service name, pid, optional subject/service reference, monotonic timestamp, message text;
  • rate-limited at producer and log service boundaries;
  • suitable for serial forwarding, ring-buffer retention, and later storage;
  • not a source of truth for security decisions.

Metrics

Low-cardinality numeric state:

  • per-process ring SQ/CQ occupancy, cq_overflow, invalid SQE counts, opcode counts, transport error counts;
  • scheduler runnable/blocked counts, direct IPC handoffs, cap-enter timeouts, process exits;
  • resource ledger usage: cap slots, outstanding calls, scratch bytes, frame grants, endpoint queue occupancy, VM mapped pages;
  • heap/frame allocator pressure;
  • later device, network, storage, and CPU-time counters.

Metric shape is fixed to three forms:

  • Counter — monotonic u64, reset only by reboot. Cumulative semantics make aggregation composable.
  • Gaugei64 that moves both ways. Used for queue depths, free-frame counts, mapped-page counts.
  • Histogram — fixed bucket layout carried in the descriptor, u64 per bucket. Used for ring-dispatch duration, context-switch latency, IPC RTT.

Richer shapes (top-k tables, exponential histograms) are emitted as opaque typed payloads through the producer-scoped envelope described under “Core Interfaces”; the generic reader treats them as data, and a schema-aware viewer decodes them. Metrics should be snapshots or monotonic counters, not unbounded label streams.

Events

Discrete lifecycle facts:

  • process spawned, started, exited, waited, killed, or failed to load;
  • service declared healthy, unhealthy, restarting, quiescing, or upgraded;
  • endpoint queue overflow, cancellation, disconnected holder, transfer rollback;
  • resource quota rejection;
  • device reset, interrupt storm, link up/down, block I/O error once devices exist.

Events are useful for supervisors and status views. They may also feed logs.

Traces

Bounded high-detail capture for debugging:

  • SQE/CQE records around one pid, service subtree, endpoint, cap id, or error class;
  • optional capnp payload capture only with explicit authority;
  • offline schema-aware viewer for reproducing and explaining a failure;
  • short retention by default.

This is the Ring as Black Box milestone from docs/tasks/README.md, not full replay.

Health

Declared service state:

  • ready, starting, degraded, draining, failed, stopped;
  • last successful health check and last failure reason;
  • dependency health summaries;
  • supervisor-owned restart intent and backoff state.

Health is not inferred only from process liveness. A process can be alive and unhealthy, or intentionally draining and still useful.

Crash Records

Panic, exception, and fatal userspace runtime records:

  • boot stage, current pid if known, fault vector, RIP/CR2/error code where applicable, recent SQE context when safe, and last serial line cursor;
  • bounded, redacted, and readable through a crash/debug capability;
  • serial fallback remains mandatory when no reader exists.

Audit

Security and policy records:

  • session creation, approval request, policy decision, cap grant, cap transfer, cap release/revocation, denial, declassification/relabel operation;
  • no raw authentication proofs, private keys, bearer tokens, or full environment dumps;
  • query access is scoped by session, service subtree, or operator role.

ITU-T X.700 Series Alignment

The ITU-T X.700 Systems Management framework (OSI management) predates modern observability stacks by two decades but still offers a cleaner decomposition than ad-hoc log/metric/trace categorization. capOS is not implementing CMIS/CMIP (X.710/X.711 assume ASN.1 BER over an OSI stack capOS will never speak); the value is the signal taxonomy and field model, not the transport.

capOS signal classClosest ITU-TWhat we take from it
LogsX.735 Log control functionLog record identity (moRef analog = component+pid+service_ref), severity mapping, scoped reader model.
MetricsX.739 Metric objects and attributesFixed metric shapes (counter / gauge / histogram) as opposed to open-ended label streams.
EventsX.734 Event report management functionDiscriminator-driven filtering, event-type taxonomy, producer/consumer separation.
Alarms (events)X.733 Alarm reporting functionPerceived severity (cleared/indeterminate/warning/minor/major/critical), probable cause, specific problem, trend indication, proposed repair action.
HealthX.731 State management functionOperational / administrative / usage state model (enabled/disabled, unlocked/locked, idle/active/busy) feeding HealthState.
AuditX.740 Security audit trail functionAudit record field model: event type, time, initiator, target, outcome, evidence chain.
Crash recordsX.733 + X.736 Security alarm reporting functionStructured cause + severity for fatal/integrity events; security-relevant crashes flow through both the crash cap and the audit cap.

FCAPS coverage. X.700/X.701 defines the five management functional areas: Fault, Configuration, Accounting, Performance, Security. This proposal covers Fault (crash records, alarms), Performance (metrics), and Security (audit). Configuration and Accounting are deliberately out of scope here:

  • Configuration management (X.700 “C”) — versioned, signed configuration deltas applied to running services. Partially covered by cloud-metadata-proposal.md (ManifestDelta) but capOS has no general configuration-management proposal yet. Candidate for a separate proposal once the manifest-executor and live-upgrade work stabilize.
  • Accounting management (X.700 “A”) — per-principal, per-session, per-service resource-usage ledgers with retention and export. The kernel’s ResourceLedger is the lowest layer; aggregation, persistence, and audit-grade usage records are undesigned. Candidate for a separate proposal; would compose with the audit cap and the user-identity session model.

Updated Field Mappings

LogRecord maps roughly onto X.735 logRecord:

X.735 logRecord                    capOS LogRecord
---------------                    ---------------
logRecordId                        (cursor + pid + tick)
managedObjectClass                 component + service name
managedObjectInstance              pid + service_ref
eventType                          Severity (lossy; add explicit
                                    eventType once alarm/security
                                    records share the pipe)
eventTime                          tick (monotonic; wall-clock when
                                    available)
notificationIdentifier             not modeled; add when events need
                                    correlation IDs

Audit records should adopt X.740 fields explicitly. Proposed schema extension once the audit service ships:

enum AuditEventType {
  # X.740 §6.1 event categories, pruned to what capOS actually records.
  authentication    @0;   # login, logout, auth failure
  accessControl     @1;   # grant, deny, revoke, transfer
  policyDecision    @2;   # broker decision with plan + constraints
  objectLifecycle   @3;   # capability create/destroy, object reap
  securityAlarm     @4;   # X.736-shaped: integrity/confidentiality violation
  serviceControl    @5;   # restart, upgrade, quiesce, resume
  administrative    @6;   # manifest update, role change
}

enum AuditOutcome {
  success           @0;
  failure           @1;
  denied            @2;
  pending           @3;   # multi-party approval outstanding
}

struct AuditRecord {
  tick        @0 :UInt64;
  eventType   @1 :AuditEventType;
  initiator   @2 :Data;        # opaque principal/session ID
  target      @3 :Text;        # interface + service identity
  outcome     @4 :AuditOutcome;
  reason      @5 :Text;
  evidence    @6 :Data;        # opaque, bounded; no secrets
}

Alarms (X.733) are a structured subset of Events, not a new signal class. The ServiceStatus / Health path emits alarms when degraded, failed, or security-relevant thresholds trip:

enum PerceivedSeverity {
  cleared        @0;
  indeterminate  @1;
  warning        @2;
  minor          @3;
  major          @4;
  critical       @5;
}

enum ProbableCause {
  # X.733 Annex A lists ~50 values; capOS starts with the handful that
  # match known failure modes and extends as needed.
  communicationsError    @0;
  integrityViolation     @1;
  operationalViolation   @2;
  softwareError          @3;
  underlyingResourceUnavailable @4;
  qualityOfServiceAlarm  @5;
  securityAlarmIntegrity @6;
  securityAlarmAccess    @7;
}

struct Alarm {
  tick            @0 :UInt64;
  managedObject   @1 :Text;           # service or cap identity
  severity        @2 :PerceivedSeverity;
  probableCause   @3 :ProbableCause;
  specificProblem @4 :Text;
  trend           @5 :AlarmTrend;
  proposedRepair  @6 :Text;
}

The taxonomy buys two things the Unix-style “syslog + Prometheus + Jaeger” tower does not: (1) alarms as a first-class signal with a defined severity lattice and probable-cause field, which is how operators actually triage, and (2) audit as a distinct record type with fixed fields rather than a convention-layer over free-form log messages.

ITU-T references

  • ITU-T Rec. X.700 (09/92) — Management framework
  • ITU-T Rec. X.701 (08/97) — Systems management overview
  • ITU-T Rec. X.733 (02/92) — Alarm reporting function
  • ITU-T Rec. X.734 (09/92) — Event report management function
  • ITU-T Rec. X.735 (09/92) — Log control function
  • ITU-T Rec. X.736 (01/92) — Security alarm reporting function
  • ITU-T Rec. X.740 (01/92) — Security audit trail function
  • ITU-T Rec. X.731 (01/92) — State management function
  • ITU-T Rec. X.739 (11/93) — Metric objects and attributes

Proposed Architecture

flowchart TD
    Kernel[Kernel primitives] --> KD[KernelDiagnostics cap]
    Kernel --> Serial[Emergency serial]

    Init[init / root supervisor] --> LogSvc[Log service]
    Init --> MetricsSvc[Metrics service]
    Init --> StatusSvc[Status service]
    Init --> AuditSvc[Audit log]
    Init --> TraceSvc[Trace capture service]

    KD --> MetricsSvc
    KD --> StatusSvc
    KD --> TraceSvc

    Services[Services and drivers] --> LogSink[Scoped LogSink caps]
    Services --> Health[Health caps]
    Services --> AuditWriter[Scoped AuditWriter caps]

    LogSink --> LogSvc
    Health --> StatusSvc
    AuditWriter --> AuditSvc

    Broker[AuthorityBroker] --> Readers[Scoped readers]
    Readers --> Shell[Shell / agent / operator tools]

    StatusSvc --> Readers
    LogSvc --> Readers
    MetricsSvc --> Readers
    TraceSvc --> Readers
    AuditSvc --> Readers

The important property is that there is no ambient monitoring namespace. The graph is assembled by init and supervisors. Readers are capabilities, not paths.

Core Interfaces

These are conceptual interfaces. They should not be added to schema/capos.capnp until the current manifest-executor work is complete and a specific implementation slice needs them.

enum Severity {
  debug @0;
  info @1;
  warn @2;
  error @3;
  critical @4;
}

struct LogRecord {
  tick @0 :UInt64;
  severity @1 :Severity;
  component @2 :Text;
  pid @3 :UInt32;
  subjectRef @4 :Data;   # privacy-preserving subject/session correlation
  sessionRef @5 :Data;   # optional scoped session correlation
  serviceRef @6 :Data;   # optional authorized service/component correlation
  transportId @7 :Data;  # debug-only ring/endpoint metadata, not identity
  message @8 :Text;
}

struct LogFilter {
  minSeverity @0 :Severity;
  componentPrefix @1 :Text;
  pid @2 :UInt32;
  includeDebug @3 :Bool;
}

interface LogSink {
  write @0 (record :LogRecord) -> ();
}

interface LogReader {
  read @0 (cursor :UInt64, maxRecords :UInt32, filter :LogFilter)
      -> (records :List(LogRecord), nextCursor :UInt64, dropped :UInt64);
}

LogSink is what ordinary services receive. LogReader is what shells, operators, supervisors, and diagnostic tools receive. A scoped reader can filter to one service subtree or session before the caller ever sees the record.

Monitoring terminology should use snake-case names in prose and map them to schema-style fields only at the Cap’n Proto boundary:

subject_ref / session_ref:
  privacy-preserving identity or session correlation fields.

service_ref:
  service instance or component correlation where the reader is authorized.

transport_id:
  debug-only ring, endpoint, SQE/CQE, or waiter metadata; never subject
  identity.

Legacy endpoint badge terminology must not leak into user-facing monitoring identity. If a low-level transport path still stores a badge-shaped selector, monitoring may expose it only as debug transport_id under an appropriate diagnostic cap, not as subject_ref, session_ref, or service_ref.

struct ProcessStatus {
  pid @0 :UInt32;
  serviceName @1 :Text;
  state @2 :Text;
  capSlotsUsed @3 :UInt32;
  capSlotsMax @4 :UInt32;
  outstandingCalls @5 :UInt32;
  cqReady @6 :UInt32;
  cqOverflow @7 :UInt64;
  lastExitCode @8 :Int64;
}

struct ServiceStatus {
  name @0 :Text;
  health @1 :Text;
  pid @2 :UInt32;
  restartCount @3 :UInt32;
  lastError @4 :Text;
}

interface SystemStatus {
  listProcesses @0 () -> (processes :List(ProcessStatus));
  listServices @1 () -> (services :List(ServiceStatus));
  service @2 (name :Text) -> (status :ServiceStatus);
}

SystemStatus is read-only. A broad instance can see the system; wrappers can expose one service, one supervision subtree, or one session.

enum MetricKind {
  counter @0;
  gauge @1;
  histogram @2;
}

struct MetricSample {
  # Well-known fixed-name slot for counters and gauges the aggregator
  # understands without additional schema lookup. Use this for stable
  # kernel counters to keep the hot path allocation-free.
  name @0 :Text;
  kind @1 :MetricKind;
  value @2 :Int64;
  tick @3 :UInt64;

  # Producer-scoped typed envelope for richer samples (histograms,
  # top-k tables, per-subsystem structs). Payload is a capnp message;
  # the schema is identified by `schemaHash` (capnp node id) and keyed
  # per producer. Opaque to the generic reader; a schema-aware viewer
  # decodes it.
  producerId @4 :UInt64;
  schemaHash @5 :UInt64;
  payload    @6 :Data;
}

struct MetricFilter {
  prefix @0 :Text;
  service @1 :Text;
}

interface MetricsReader {
  snapshot @0 (filter :MetricFilter, maxSamples :UInt32)
      -> (samples :List(MetricSample), truncated :Bool);
}

Early metrics should be fixed-name counters and gauges in the name/value slot. Avoid arbitrary labels until there is a concrete memory and cardinality policy. The producer-scoped envelope exists so richer samples do not force the generic reader to learn a string-key taxonomy — if a producer needs per-queue or per-device detail, it ships a typed capnp struct keyed by schemaHash rather than synthesizing name strings.

struct TraceSelector {
  pid @0 :UInt32;
  serviceName @1 :Text;
  errorCode @2 :Int32;
  includePayloadBytes @3 :Bool;
}

struct TraceRecord {
  tick @0 :UInt64;
  pid @1 :UInt32;
  opcode @2 :UInt16;
  capId @3 :UInt32;
  methodId @4 :UInt16;
  interfaceId @5 :UInt64;
  result @6 :Int32;
  flags @7 :UInt16;
  payload @8 :Data;
}

interface TraceCapture {
  arm @0 (selector :TraceSelector, maxRecords :UInt32, maxBytes :UInt32)
      -> (captureId :UInt64);
  drain @1 (captureId :UInt64, maxRecords :UInt32)
      -> (records :List(TraceRecord), complete :Bool, dropped :UInt64);
}

Payload capture should default off. A capture cap that can read payload bytes is closer to a debug privilege than a normal status cap.

enum HealthState {
  starting @0;
  ready @1;
  degraded @2;
  draining @3;
  failed @4;
  stopped @5;
}

interface Health {
  check @0 () -> (state :HealthState, reason :Text);
}

interface ServiceSupervisor {
  status @0 () -> (status :ServiceStatus);
  restart @1 () -> ();
}

ServiceSupervisor is authority-changing. Normal monitoring readers should not receive it. A broker can mint a leased ServiceSupervisor(net-stack) for one operator action.

Kernel Diagnostics Contract

The kernel should expose a small read-only diagnostics surface for facts only the kernel can know:

  • process table snapshot: pid, state, service name if known, wait state, exit code, ring physical identity hidden or omitted;
  • ring snapshot: SQ/CQ head/tail-derived occupancy, overflow count, corrupted head/tail recovery counts, opcode/error counters;
  • resource snapshot: cap slot usage, outstanding calls, scratch reservation, frame grant pages, mapped VM pages, free frame count, heap pressure;
  • scheduler snapshot: tick count, current pid, run queue length, blocked count, direct IPC handoff count, timeout wake count;
  • crash record: last panic/fault metadata and early boot stage.

The kernel should not implement log routing, alerting, dashboards, retention policy, restart decisions, RBAC, ABAC, or text search. Those are userspace service responsibilities.

Implementation shape:

  • Maintain fixed-size counters in existing kernel structures where the source event already occurs.
  • Prefer snapshots computed from existing state over duplicate counters when the cost is bounded.
  • Expose snapshots through a small set of narrow read-only capabilities, not one KernelDiagnostics god-cap. The initial decomposition:
    • SchedStats — tick count, current pid, run queue length, blocked count, direct IPC handoff count, cap_enter timeout/wake counts.
    • FrameStats — free/used frame counts, frame-grant pages, allocator pressure histogram.
    • RingStats — per-process SQ/CQ occupancy, cq_overflow, corrupted-head recovery counts, opcode counters, transport-error counters.
    • CapTableStats — per-process slot occupancy, generation-rollover counts, insertion/remove rates.
    • EndpointStats — per-endpoint waiter depth, RECV/RETURN match rate, abort/cancellation counts.
    • CrashSnapshot — last panic/fault metadata, early boot stage, recent SQE context when safe.
  • Each narrow cap exposes snapshot() -> (sample :MetricSample) or a typed struct; none of them enumerates processes or reads cap tables beyond what the subsystem owns. A trusted status service composes the ones it needs; a broker leases a subset for operator sessions without the rest.
  • ProcessInspector (pid-scoped process table, cap-table enumeration, VM map) is a distinct, stronger cap and lives with process-management authority, not with monitoring.
  • Convert broad diagnostics into scoped userspace wrappers before handing them to shells or applications.
  • Keep panic/fault serial writes independent of any diagnostics service.

Promotion from the measure feature: the benchmark counters in kernel/src/measure.rs graduate to always-on in RingStats / SchedStats when the per-event cost is provably a single relaxed atomic add. Cycle-counter instrumentation (rdtsc/rdtscp) stays behind cfg(feature = "measure") because it is serializing and benchmark-only. The promotion threshold keeps normal dispatch builds free of instrumentation cost without forcing monitoring into a second build configuration.

Logging Model

Early boot has only serial. After init starts the log service, ordinary services should receive LogSink rather than raw Console unless they need emergency console access.

Recommended path:

  1. Kernel serial remains for boot, panic, and fault records.
  2. Init starts a userspace log service and passes scoped LogSink caps to children.
  3. The log service forwards selected records to Console until persistent storage exists.
  4. SystemConfig.logLevel becomes an initial policy input for which records the log service forwards and retains.
  5. Session and operator tools receive scoped LogReader caps from a broker.

Services should not put secrets, raw capability payloads, full auth proofs, or arbitrary user input into logs without explicit redaction. Log records are data, not commands.

Metrics and Status

Status answers “what is alive and what state is it in.” Metrics answer “what is the numeric behavior over time.” Keeping them separate avoids a common failure mode where a human-readable status API grows into an unbounded metrics store.

Initial status fields should cover:

  • pid, service name, binary name, process state, exit code;
  • process handle wait state;
  • supervisor health and restart policy once supervision exists;
  • cap table occupancy and outstanding call count;
  • ring CQ availability and overflow;
  • endpoint queue occupancy where authorized.

Initial metrics should cover:

  • ring dispatches, SQEs processed, per-op counts, transport error counts;
  • cap-enter wait count, timeout count, wake count;
  • scheduler context switches and direct IPC handoffs;
  • frame free/used counts, frame grant pages, VM mapped pages;
  • log records accepted, suppressed, dropped, and forwarded;
  • trace records captured and dropped.

Timer/nohz/realtime metrics should be owned by monitoring rather than left as one-off debug prints once those features exist:

  • scheduler_tick_count{cpu};
  • ticks_suppressed{cpu,mode};
  • nohz_enter_count{cpu,kind};
  • nohz_exit_count{cpu,reason};
  • oneshot_deadline_miss_count;
  • sqpoll_busy_ns;
  • sqpoll_sleep_count;
  • deadline_expired_count;
  • budget_exhausted_count;
  • realtime_overrun_count;
  • donation_depth_max;
  • housekeeping_offload_count.

These are correctness signals for nohz/realtime admission, not only performance counters. A scoped monitoring reader may observe them only under the same authority rules as other scheduler and service telemetry.

Current state alignment. Scheduler Phase D WFQ and Phase E SchedulingContext have landed per docs/changelog.md (Phase D closed 2026-05-10), and Phase F is delivering one-SQ-consumer, nohz telemetry counters, and housekeeping/deferred-work placement; automatic nohz activation’s first increment is now closed via docs/tasks/done/2026/scheduler-phase-f-auto-nohz-activation.md (per the scheduler bullet in docs/tasks/README.md), and SQPOLL-driven auto-nohz activation is also closed via docs/tasks/done/2026/scheduler-phase-f-auto-nohz-sqpoll.md: a ring-coupled kernelSqpoll lease whose bound ring is in SQPOLL running/sleeping mode with a live owner is admitted for tick suppression, with the SQPOLL ring-state re-check as the decisive rollback gate; the CpuIsolationLease preflight performs real per-CPU periodic-tick suppression for the narrow single-runnable-entity window with fail-closed rollback; timeout-based auto-revoke and generic full-nohz for ordinary budgeted compute leases are also landed. The nohz/realtime counter families above describe the target monitoring surface for those signals — the kernel may already maintain some counters internally as Phase F lands them, but until the narrow read-only stats caps (SchedStats / RingStats and friends) and a userspace metrics service ship, those counters are scheduler-internal facts and not yet exported through a monitoring cap. The metrics service is not authority to trigger nohz mode changes; it observes counters under the authority rules in this proposal.

Metric labels such as mode, kind, and reason must be fixed enums, not free-form strings:

#![allow(unused)]
fn main() {
enum NoHzKind {
    Idle,
    KernelSqpoll,
    AutoCompute,
    AutoUserspacePoller,
    RealtimeIsland,
}

enum TickSuppressionMode {
    Idle,
    SqpollNoHz,
    AutoNoHz,
    RealtimeIsland,
}

enum NoHzExitReason {
    TimerDeadline,
    Ipi,
    DeviceIrq,
    SecondRunnable,
    NetworkForcedPeriodic,
    DeferredWork,
    LeaseRevoked,
    ClocksourceUnsafe,
    DebugWatchdog,
}
}

Future metric schemas should add enum variants through reviewed ABI changes rather than accepting arbitrary labels.

Avoid per-method, per-cap-id, per-transport-id, or per-user high-cardinality metrics by default. Those belong in short-lived traces or scoped logs.

Benchmark outputs follow the same cardinality rule. A completed, validated benchmark run may import a small summary such as latest median, p95, sample count, and pass/fail status for a named benchmark profile. Raw samples, transcripts, host/QEMU configuration, correctness evidence, and comparison tables are benchmark artifacts, not always-on monitoring metrics. Running a profile that needs measure, debug taps, broad status readers, or other diagnostic authority should emit an audit record because the act of measuring can expose timing and topology data that ordinary services should not see.

Ring as Black Box

The first concrete monitoring milestone is the completed docs/tasks/README.md Ring-as-Black-Box item. The visible milestone was achieved by commit da5f5e9 at 2026-04-24 03:13 UTC:

  • define a bounded capture format for SQE/CQE records;
  • export capture through a QEMU-only debug path;
  • build a host-side viewer that decodes records and capnp payloads when payload capture is authorized;
  • add one failing-call smoke whose captured log can be inspected offline.

This buys immediate debugging value without committing to durable audit, network export, service restart policy, or replay semantics.

This is inspection, not record/replay. Replay requires stronger determinism, payload retention, timer/input modeling, and capability-state checkpoints.

Capture path cost. The capture cap (working name RingTap) is feature-gated (cfg(feature = "debug_tap") analogous to measure). Every armed tap imposes a serializing fan-out on dispatch; keeping it out of the default kernel feature set prevents always-on cost. Arming a tap is itself an auditable event — the tapped process and the audit log observe it — and tap grants respect move-semantics so a tap cannot be silently cloned past its intended holder. Payload-capturing taps require a separately leased cap distinct from metadata-only capture because payloads may contain secrets.

Health and Supervision

Health and restart policy should live with supervisors, not in a central kernel daemon.

Each supervisor owns:

  • a narrowed ProcessSpawner;
  • child ProcessHandle caps;
  • the cap bundle needed to restart its subtree;
  • optional Health caps exported by children;
  • a LogSink and AuditWriter for its own decisions.

Status services aggregate supervisor-reported health. They should distinguish:

  • no process exists;
  • process exists but never reported ready;
  • process is alive and ready;
  • process is alive but degraded;
  • process exited normally;
  • process failed and supervisor is backing off;
  • process was intentionally stopped or draining.

Restart authority should be a separate ServiceSupervisor cap. A read-only SystemStatus cap must not be able to restart anything.

Audit Integration

Audit should share infrastructure with logging only at the storage or transport layer. Its semantics are different.

Audit producers:

  • AuthorityBroker for policy decisions and leased grants;
  • supervisors for restarts and service lifecycle actions;
  • session manager for session creation and logout;
  • kernel or status service for cap transfer/release/revocation summaries when those events become part of the exported authority graph;
  • recovery tools for repair actions.

Audit readers are scoped:

  • a user can read records for its own session;
  • an operator can read a service subtree;
  • a recovery or security role can read broader streams after policy approval.

Audit entries must avoid secrets and payload dumps. They should record object identity, service identity, policy decision summaries, and capability interface classes rather than raw data.

Security and Backpressure

Monitoring must not become the easiest denial-of-service path.

Required controls:

  • Per-process log token buckets, matching the Security Verification Track S.9 diagnostic aggregation design.
  • Suppression summaries for repeated invalid submissions.
  • Fixed-size ring buffers with explicit dropped counts.
  • Maximum record size for logs, events, crash records, and traces.
  • Bounded formatting outside interrupt context.
  • No heap allocation in timer or panic paths.
  • No unbounded metric label creation from user-controlled strings.
  • Payload tracing disabled by default.
  • Redaction rules at producer boundaries and at reader wrappers.
  • Capability-scoped readers; no unauthenticated “debug all” endpoint.

When pressure forces dropping, preserve first-observation diagnostics and later summaries. Losing detailed logs is acceptable; corrupting scheduler progress or blocking the kernel on log I/O is not.

Relationship to Existing Proposals

  • Service Architecture: monitoring services are ordinary userspace services spawned by init or supervisors. Logging policy and service topology stay out of the pre-init kernel path.
  • Shell: the native and agent shell should receive scoped SystemStatus and LogReader caps in daily profiles, not global supervisor authority.
  • User Identity and Policy: AuthorityBroker mints scoped readers and leased supervisor caps based on session policy; AuditLog records the decisions.
  • Error Handling: transport errors and CapException payloads are monitoring signals, but retry policy remains userspace.
  • Authority Accounting: resource ledgers provide the first metrics substrate and define quota/backpressure boundaries.
  • Security and Verification: hostile-input tests should cover log flood aggregation and bounded diagnostic paths. Each new monitoring boundary (kernel stats caps, log/metrics/trace/audit services, scheduler nohz telemetry exports) must be carried into the docs/proposals/security-and-verification-proposal.md Track S.7 trust-boundary inventory before downstream services rely on it; the inventory is the canonical record that a boundary has been reviewed, not this proposal.
  • Live Upgrade: health, audit, and service status become prerequisites for credible upgrade orchestration.
  • System Performance Benchmarks: benchmark runners may read scoped status and metrics before and after a run, but benchmark artifacts and OS-comparison reports live outside the always-on metrics service. Only low-cardinality, validated summaries should be imported into monitoring.

Implementation Plan

  1. Document the model. Keep monitoring as a future architecture proposal and do not disturb the current manifest-executor milestone.

  2. Ring as Black Box. Completed by commit da5f5e9 at 2026-04-24 03:13 UTC: bounded SQE/CQE capture, host-side decoding, and one failing-call smoke form the first useful monitoring artifact.

  3. Userspace log service. (Phase 1 landed.) LogSink/LogReader schemas plus LogRecord/LogFilter exist (additive ordinals, reusing LogLevel as the severity type). A bounded drop-oldest kernel ring (kernel/src/cap/log.rs) backs both caps: the sink stamps the monotonic tick, drops records below the boot-seeded SystemConfig.logLevel threshold (accepted = false), bounds record size, and forwards accepted records to serial; the reader returns cursor/filtered records with nextCursor and a dropped overflow count. Scoped LogSink/LogReader caps are granted to children at spawn; make run-monitoring-log-smoke proves the drop, the read-back, and the reader-side minLevel filter. Remaining: the wider Severity (with critical), the correlation fields (subjectRef/sessionRef/serviceRef/transportId), per-process token buckets / suppression summaries, and persistent retention.

  4. Narrow kernel stats caps and SystemStatus. Add the narrow read-only caps (SchedStats, FrameStats, RingStats, CapTableStats, EndpointStats, CrashSnapshot) as bounded snapshot surfaces. A userspace SystemStatus service composes the ones it needs and exposes scoped wrappers to shells and operator tools. Leave ProcessInspector out of this step — it belongs with process-management authority, not monitoring.

  5. Metrics snapshots. Add fixed counters and gauges for ring, scheduler, resource, log, and trace state. Keep labels static until a cardinality policy exists.

  6. Health and supervisor status. Add Health and read-only supervisor status once restart policy and exported service caps are concrete. Keep restart authority in separate ServiceSupervisor caps.

  7. Audit path. Add append-only audit records for broker decisions, cap grants, releases, revocations, restarts, and recovery actions. Start serial or memory backed; move to storage once the storage substrate exists.

  8. Crash records. Preserve bounded panic/fault metadata across the current boot where possible; later store records durably.

  9. Device, network, and storage metrics. Add driver metrics only after those drivers exist: interrupts, DMA/bounce usage, queue depth, RX/TX/drop/error counts, block latency, and reset events.

Non-Goals

  • No global /proc or /sys equivalent with ambient read access.
  • No kernel-resident dashboard, alert manager, text search, or policy engine.
  • No programmable kernel tracing language in the first monitoring design.
  • No promise of durable log retention before storage exists.
  • No default payload tracing.
  • No service restart authority bundled into ordinary read-only status caps.
  • No network export path until networking and policy can constrain it.

Open Questions

  • Should KernelDiagnostics expose snapshots only, or also a bounded event cursor?
  • What is the minimum timestamp model before wall-clock time exists?
  • Should log records carry local cap IDs, stable object IDs, or only interface and service metadata by default?
  • How should schema-aware trace decoding find schemas before a full SchemaRegistry exists?
  • Which crash fields are safe to expose to non-recovery sessions?
  • What retention policy is acceptable before persistent storage?
  • Should MetricsReader use typed structs for each subsystem instead of generic name/value samples?
  • Where should remote monitoring export fit once network transparency exists: a dedicated exporter service, capnp-rpc forwarding, or storage replication?

Cross-References

This proposal is reader-facing target design. The canonical trackers for the observability-adjacent risks and verification obligations it depends on live elsewhere:

  • docs/proposals/security-and-verification-proposal.md Track S.7 – Stage-6-aware refresh owns the trust-boundary inventory that any new monitoring boundary (kernel stats caps, log/metrics/trace/audit services, scheduler nohz telemetry exports, payload-capturing taps) must be carried into before downstream services rely on it. Track S.7 already lists the active scheduler-evolution surfaces (Phase D WFQ, Phase E SchedulingContext, Phase F one-SQ-consumer and nohz telemetry) plus the WASI host-adapter Phase W.4 entropy/argv boundary as inventory items to carry forward.
  • docs/design-risks-register.md R12 – Verification coverage is partial, not full proof is the canonical caveat for any monitoring claim that could be read as a verified property. Bounded Kani/Loom/Miri/proptest coverage plus the panic-surface inventory are not whole-system functional refinement; monitoring records and audit entries describing security- relevant decisions must respect that distinction in their wording.
  • docs/design-risks-register.md Q9 – CPU accounting and scheduling contexts is the canonical answer for the CPU-time, weighted-vruntime, and SchedulingContext budget/donation/depletion semantics that monitoring metrics should observe rather than redefine. The nohz/realtime counter families in this proposal target the same surfaces; cross-service donation policy, full nohz activation, isolation leases, and fairness across principals remain proposal-shaped per Q9 and are tracked in docs/proposals/scheduler-evolution-proposal.md and docs/backlog/scheduler-evolution.md.

Adjacent risk-register entries observed by monitoring but owned elsewhere include R4 (Resource accounting fragmentation, source of the ResourceLedger metrics substrate), R8 (Networking lives inside the kernel TCB, gating exporter-service placement), and R11 (Pre-auth and post-auth share a shell process, gating who may receive scoped LogReader / SystemStatus / AuditLog readers).

Proposal: Time and Clock Capability Authority

How capOS should expose wall-clock time, clock discipline, and trusted timestamps without introducing ambient real time, allowing a service to forge timestamps, or creating a covert timing channel between processes.

Problem

Today capOS has one time-related capability: Timer, which exposes now() -> (monotonicNs, tick) and sleep(). The monotonic counter is useful for scheduling and rate limiting, but it carries no provenance, has no relationship to wall-clock time, and is not a trusted source for security decisions.

Several upcoming capability surfaces implicitly need trustworthy wall-clock time:

  • TLS certificate validation (certificates-and-tls-proposal.md) must compare notBefore/notAfter fields against a wall-clock source whose provenance the validator trusts.
  • OIDC token expiry (oidc-and-oauth2-proposal.md) must compare exp and iat claims against wall-clock time.
  • Audit records must carry a timestamp that a security reviewer can trust. A service must not be able to backdate its own audit entries.
  • WASI clock_time_get(CLOCKID_REALTIME) currently returns NOSYS. Any WASM payload that needs the current time, including TLS libraries compiled to WASM, hits this gap.
  • Cloud metadata bootstrap (cloud-metadata-proposal.md) supplies instance-launch time; any cloud image verification that checks a timestamp needs a root-of-trust for time.

None of these can be satisfied by handing callers a monotonic tick offset and asking them to add a boot-time offset they supply themselves: the capability model requires that time provenance be part of the granted interface, not an ambient convention.

User Stories

  • A TLS handshake service holds a WallClock cap labeled ntp-synced. It calls wallTime() to get the current UTC time and validates a certificate’s validity window. If the provenance were untrusted, it would refuse validation or surface a warning.
  • An audit service receives timestamped records from AuthorityBroker and session services. It does not trust the caller-supplied timestamp; it reads its own granted WallClock and stamps records at ingestion time.
  • A WASM payload loaded by capos-wasm calls clock_time_get(CLOCKID_REALTIME). The WASI host adapter reads the WallClock cap that was granted to the wasm-host process at launch, returns the wall-clock seconds, and sets the provenance flag in the host’s internal WASI state so that WASM callers cannot assume sync quality beyond what was granted.
  • An init operator grants clockDiscipline to a userspace NTP service. The NTP service calls step() or slew() to advance or discipline the system clock. No other process may call these methods.
  • A process running in an environment with no NTP synchronization receives a WallClock labeled measured-boot-monotonic. It can compute elapsed time accurately but knows that absolute wall time is only as accurate as the firmware real-time clock at boot.

Design

Existing Timer Interface

interface Timer {
    now @0 () -> (monotonicNs :UInt64, tick :UInt64);
    sleep @1 (durationNs :UInt64) -> ();
}

Timer remains the canonical interface for deadlines, sleep, and monotonic elapsed time. It does not change. WallClock is a separate, orthogonal capability whose provenance tracks the quality of the absolute time signal.

WallClock Interface

enum ClockProvenance {
    # Zero-value is fail-closed: an unset, default, or unrecognized provenance
    # decodes as untrusted, so a caller that skips the check never treats an
    # unknown source as trusted. No reliable source known; callers must fail
    # closed on sensitive decisions.
    untrusted        @0;
    # Synchronized to a trusted NTP source within the last sync window.
    ntpSynced        @1;
    # PTP hardware clock; higher precision, same trust level as ntpSynced.
    ptpSynced        @2;
    # Firmware RTC at boot; advanced monotonically since; no network sync.
    measuredBootMonotonic @3;
    # Manual set by an operator with clockDiscipline authority.
    manualSet        @4;
}

interface WallClock {
    # Returns UTC seconds since Unix epoch, nanoseconds within the second,
    # the current monotonic offset from the same Timer.now() base, and
    # the provenance label for this clock source.
    wallTime @0 () -> (
        utcSeconds  :Int64,
        utcNanos    :UInt32,
        monotonicNs :UInt64,
        provenance  :ClockProvenance
    );
}

Key properties:

  • No ambient access. A process must hold a granted WallClock cap to read wall time. Init-owned processes receive it via the manifest bundle; ordinary services receive it only if their supervisor grants it.
  • Provenance is part of the response, not a separate call. A validator that requires ntpSynced can check the provenance field on every read without a separate round-trip.
  • Monotonic offset is included. The returned monotonicNs ties the wall-clock sample to the Timer.now() timeline so callers can compute elapsed time without a second Timer call. The kernel ensures both fields are read from a consistent snapshot within the same tick.
  • Single method. WallClock is read-only and has no state. Its simplicity makes attenuation straightforward: a wrapper that downgrades provenance to untrusted or truncates resolution is trivially composable.

ClockDiscipline Interface

Clock setting and NTP/PTP synchronization require a separate, stronger capability. No userspace process can discipline the clock without holding it.

interface ClockDiscipline {
    # Atomically step the wall-clock by the given signed delta in nanoseconds.
    # Used for large corrections (initial set from RTC, NTP step).
    step @0 (deltaNs :Int64) -> ();

    # Gradually slew the clock toward the target offset, bounded to
    # `maxRateNsPerS` nanoseconds per second.  Used for NTP drift correction.
    slew @1 (offsetNs :Int64, maxRateNsPerS :UInt32) -> ();

    # Declare the current source and its estimated error bound.
    setProvenance @2 (
        provenance    :ClockProvenance,
        errorBoundNs  :UInt64
    ) -> ();

    # Read the current sync state.
    syncState @3 () -> (
        provenance    :ClockProvenance,
        lastSyncMonotonicNs :UInt64,
        lastStepMonotonicNs :UInt64,
        errorBoundNs  :UInt64,
        slewRateNsPerS :Int32
    );
}

ClockDiscipline is init-owned at boot. The manifest may grant it to a dedicated NTP service process. No service other than the designated NTP/PTP daemon should hold this cap.

step() adjusts only the UTC offset, never the monotonic base. Per the prior-art note’s clock-step/leap-second lesson (a monotonic timeline must never jump backwards), a step retargets the wall-clock offset layered on Timer.now(); it does not rewind the monotonic timeline that scheduler deadlines, ring timeouts, and slew() rate-limiting depend on. Large discontinuities use step() (initial set / NTP step), small drift uses slew(), and leap seconds are absorbed by slewing (smear) rather than a backwards step so ordered timestamps never regress. The lastStepMonotonicNs field lets a WallClock consumer detect that a step happened since a cached observation and re-read.

Timezone and Locale Data

Timezone and locale data are not ambient. They are delivered as named entries in a Directory-backed data store (per storage-and-naming-proposal.md). A process that needs timezone conversion receives a scoped read-only Directory cap pointing at the relevant tzdata namespace entry, not an environment variable or a path under a global filesystem.

Rationale: environment variables are not capability-scoped, and a process should not observe the host’s timezone as a side channel. Explicit directory delivery makes timezone data just another granted resource.

Manifest Seeding

The boot manifest may include a seedUtcSeconds field in SystemConfig (or an extension struct). At first kernel tick, the kernel initializes the wall-clock state from this seed with measuredBootMonotonic provenance. If no seed is present, the firmware RTC is read during early boot; if no RTC is available, provenance is untrusted.

After init starts the NTP service and that service disciplines the clock, it calls ClockDiscipline.setProvenance(ntpSynced, ...) to upgrade the provenance label. From that point, all WallClock.wallTime() calls return ntpSynced.

Audit Timestamps

Audit records must carry a server-stamped timestamp, not a caller-supplied one.

The audit service holds a WallClock cap. When it ingests a record from AuthorityBroker, SessionManager, or any other producer, it stamps the record with the time returned by its own WallClock call at ingestion. The producer may supply a monotonic offset for correlation, but the wall-clock stamp is always the audit service’s own read.

Audit record timestamps carry the same ClockProvenance enum value that was returned by WallClock.wallTime() at ingestion time. A security reviewer can verify that audit entries were timestamped with a synchronized source and reject or flag entries timestamped under untrusted.

WASI Integration

capos-wasm Phase W.3+ adds WallClock as a grantable cap in the per-instance CapSet launched by wasm-host. The WASI Preview 1 host function clock_time_get(CLOCKID_REALTIME, ...) reads from the granted WallClock, returns the UTC second/nanosecond pair, and records the provenance in the host state so that the wasm-host audit trail can assert what time quality the WASM instance saw. If no WallClock cap was granted, clock_time_get(REALTIME) returns NOSYS as it does today.

No Cross-Process Skew Side Channel

WallClock exposes only the current time from the kernel’s single wall-clock state. It does not expose skew history, NTP offset measurements, or raw clock-adjustment rates. ClockDiscipline.syncState() is the only path to sync state and is held by at most one NTP service.

A process cannot learn another process’s read pattern from WallClock because there is no shared counter or read-cursor that leaks observer timing. The monotonic offset in the wallTime() response is derived from the same TSC baseline as Timer.now() and does not introduce new covert-channel surface.

Fail-Closed Policy

Services that receive a WallClock cap and make security decisions on its output must treat untrusted provenance as a failure condition, not a degraded-but-functional mode. The recommended pattern:

let (utc, _, _, prov) = wall_clock.wallTime()?;
if prov == ClockProvenance::Untrusted {
    return Err(CapError::ClockProvenanceInsufficient);
}
validate_cert_notafter(utc, cert)?;

Callers that accept measuredBootMonotonic for non-security uses (e.g., log timestamps, cache TTLs) should document the provenance they accept. Callers that accept only ntpSynced or ptpSynced for security decisions should reject all other values.

Phasing

Phase 1 — WallClock Read and Provenance

Status: landed (2026-05-24 09:31 UTC), fixed-boot-base variant. The WallClock read cap and ClockProvenance enum exist end-to-end: schema + generated bindings, kernel/src/cap/wall_clock.rs, the capos-config wall_clock kernel source, the capos-rt WallClockClient, and a shell date command proven by make run-shell. The follow-up bullets below (manifest seed, stateful WallClockState, init audit/TLS grants, WASM realtime clock) remain Phase 1.x / Phase 2.

  • Add WallClock interface and ClockProvenance enum to schema/capos.capnp. Landed.
  • Landed (fixed-boot-base variant): the kernel cap derives UTC from a fixed compile-time base over the existing monotonic timebase and reports the fail-closed untrusted provenance (the ClockProvenance zero value). It is not read from firmware RTC and is not network-synchronized, so untrusted is the honest label; this also proves the zero-value fail-closed enum semantics end-to-end. A stateful WallClockState (UTC offset, provenance, last-sync tick, error bound) and a manifest seedUtcSeconds seed with measuredBootMonotonic provenance are deferred to Phase 1.x / Phase 2 where ClockDiscipline can upgrade the label.
  • cap/wall_clock.rs implements the cap; capos-rt adds a typed client. Landed (WallClockClient, with a fail-closed ClockProvenance::from_schema unknown-variant decode).
  • Init grants WallClock to audit service and TLS service in the manifest bundle. (Deferred; the landed proof grants wall_clock directly to the shell-as-init in system-shell.cue.)
  • WASM host adapter: clock_time_get(CLOCKID_REALTIME) reads the instance’s granted WallClock; if absent, returns NOSYS as before. (Deferred.)
  • Smoke: a shell date command in make run-shell boots, reads WallClock, prints UTC seconds/nanos/monotonic plus the provenance label, and exits cleanly. Landed (asserted in tools/qemu-shell-smoke.sh).

Phase 2 — Clock Discipline and NTP Service

  • Add ClockDiscipline interface to schema.
  • Kernel implements step(), slew(), setProvenance(), and syncState().
  • A userspace NTP client process receives ClockDiscipline from init and synchronizes to a configured NTP server (requires UdpSocket from the networking capability).
  • After first successful sync, calls setProvenance(ntpSynced, errorBoundNs). All subsequent WallClock.wallTime() calls return ntpSynced.
  • Audit entries timestamped post-sync carry ntpSynced provenance.

Phase 3 — PTP, Leap Second, and Suspend Recovery

  • PTP hardware clock support for environments that have it.
  • Leap-second policy: step vs. smear, configurable per ClockDiscipline.
  • Suspend/resume: WallClock provenance downgrades to measuredBootMonotonic after a suspend event until NTP re-syncs. (Cross-links to the future power/suspend proposal; no dependency today.)
  • Timezone delivery: a Directory namespace entry backed by tzdata is seeded from the manifest and delivered as a cap to timezone-aware services.

Hazards and Invariants

Monotonic vs. wall-clock relationship. The wall-clock state is an offset applied to the Timer monotonic base. step() changes the offset; the underlying monotonic timeline never goes backward. Callers that need monotonic guarantees must use Timer.now(); callers that need calendar time use WallClock.wallTime(). This separation prevents a clock step from violating monotonicity promises made to schedulers or ring timeouts.

ABI stability. ClockProvenance enum variants must only be added, never removed or reordered. Binaries compiled against an older schema that see an unrecognized provenance value should treat it as untrusted (fail-closed). This requires the capnp generated decode to default unknown enum values to zero, which is ntpSynced — so the schema field ordering above must put untrusted at zero or the generated bindings must use an explicit unknown-variant path. Ordering note: when adding to schema, put untrusted @0 first so that the zero default is fail-closed, not the most-trusted value.

DMA and IRQ neutrality. WallClock and ClockDiscipline do not touch device memory, DMA pools, or interrupt grant paths. They are pure kernel-state caps. No DMA/MMIO/IRQ hazard applies.

No capability-transfer amplification. WallClock is a read-only snapshot surface. Transferring it to another process does not grant clock-setting authority. ClockDiscipline must not be transferable through normal cap-grant paths; it should be restricted to init-owned grant at boot and explicit manifest-operator grants.

Relevant Research and Prior Art

In-Tree Grounding

  • NO_HZ, SQPOLL, and Realtime Scheduling records the Linux timer-stack split between clock sources (monotonic timeline counters) and clock events (hardware devices that interrupt at selected future times), and concludes capOS should “introduce a monotonic now_ns clocksource layer” distinct from the scheduler tick. This proposal builds directly on that separation: Timer.now()/WallClock.wallTime() expose the clocksource timeline, while clock-event programming stays a scheduler concern. The wall-clock offset rides on the same monotonic base so a clock step never rewinds the timeline the scheduler and ring timeouts depend on — the monotonicity invariant called out in that note.
  • Future Scheduler Architecture reinforces the same clocksource/clockevent boundary and the lesson that absolute-deadline waiters should be stored by expiry time, not periodic tick count. That confirms WallClock must not become the deadline substrate: deadlines remain monotonic, and wall-clock time is a separate, disciplinable view layered on top.

External Precedent and Lessons

  • Linux clock_gettime / adjtimex. Linux exposes distinct clock IDs (CLOCK_MONOTONIC vs CLOCK_REALTIME) and gates clock discipline behind a privileged interface: adjtimex/clock_adjtime and stepping the realtime clock require CAP_SYS_TIME. Lesson: reading time and disciplining time are different authorities. capOS encodes this as a read-only WallClock cap held by ordinary services and a separate, stronger ClockDiscipline cap held only by a designated sync service — the capability-native analog of the read/CAP_SYS_TIME split.
  • Linux time namespaces. CLOCK_MONOTONIC/CLOCK_BOOTTIME offsets can be virtualized per namespace so a container observes a different boot/monotonic origin. Lesson: time can be a per-context value rather than a single global ambient fact, which supports delivering wall-clock as a granted, attenuable cap (and timezone data as a scoped Directory) instead of a process-wide environment.
  • Fuchsia/Zircon UTC clock objects. Fuchsia models UTC as a kernel clock object distributed to processes as read-only handles, with a separate privileged maintainer service holding the write handle that disciplines the clock; clock reads carry an error bound and a “started/synced” signal so a reader can tell whether the clock is yet trustworthy. Lesson: this is the closest capability-native precedent for the design here. capOS’s read-only WallClock with a ClockProvenance label maps to Fuchsia’s read-only UTC handle plus its synced/error-bound signal, and ClockDiscipline maps to the single write-handle maintainer. (The in-tree Zircon report covers handles, rights, and VMOs but not the UTC clock object specifically; the UTC-clock mapping is external precedent, not yet captured as an in-tree research note.)
  • NTP step vs. slew. NTP daemons step the clock for large offsets and slew (bounded rate adjustment) for small drift, precisely because abruptly rewinding wall time breaks timestamp ordering and timeouts. Lesson: capOS exposes step() and slew() as distinct ClockDiscipline methods rather than a single “set time”, so the discipline policy is explicit at the cap boundary.
  • IEEE-1588 PTP. Precision Time Protocol provides sub-microsecond hardware timestamping via a dedicated hardware clock, distinct from software NTP. Lesson: provenance is not binary. The ptpSynced vs ntpSynced distinction in ClockProvenance lets a validator that needs high-precision time distinguish the two without conflating accuracy with mere network sync.

Dedicated Research Note

  • Time and Clock Authority is the focused prior-art survey for this proposal: verified Linux CAP_SYS_TIME read/discipline split, time namespaces as per-context clock offsets, chrony/NTP step/slew/smear discipline, PTP/IEEE-1588 hardware timestamping, Fuchsia’s ZX_RIGHT_READ/ZX_RIGHT_WRITE UTC clock object, and leap-second smearing vs stepping, each with its capOS lesson and real sources. It is the primary external grounding for WallClock, ClockDiscipline, and ClockProvenance.

Residual research still owed before Phase 2/3 implementation: the servo / loop-filter behavior, holdover and error-bound estimation, and suspend/resume clock recovery are the highest-risk underspecified areas and should be deepened in that note (or a follow-on) rather than fixed by this proposal’s sketch.

Relevant Proposals

  • Certificates and TLS — TLS validation delegates certificate validity-window checks to a granted WallClock.
  • OIDC and OAuth2 — Token expiry checks (exp, iat, nbf) use a granted WallClock with at least measuredBootMonotonic provenance.
  • WASI Host Adapter — Phase W.3+ clock_time_get(CLOCKID_REALTIME) backed by a per-instance WallClock cap.
  • Cloud Metadata — Cloud instance launch time delivered through the metadata capability; the WallClock seed path integrates with this bootstrap.
  • System Monitoring — Audit records carry ClockProvenance-labeled timestamps from the audit service’s own WallClock read at ingestion.
  • Storage and Naming — Timezone and locale data delivered as a read-only Directory cap, not an ambient environment.

Proposal: Crash Recovery and Supervision

How capOS handles unplanned process failure: propagating the death to capability holders, recording a structured crash event, and restarting the service within a bounded policy — all without resurrecting stale authority.

Problem

Live upgrade covers the planned case: a supervisor quiesces a running service, transfers state, retargets caps, and exits the old process in a controlled sequence. Unplanned failure is different. A process that panics, faults, or is killed by the kernel OOM path leaves no quiesce call, no state handoff, and no ordered exit. The kernel marks the process dead and epoch-bumps its caps, but nothing in the current model tells callers what happened or gives the supervisor a policy-bounded path to respawn it.

The gaps are:

  1. Stale-cap observability. Callers holding a cap to the dead process receive disconnected errors at the transport level (the epoch-revocation path from Stage 6 is in place), but there is no structured CQE event that carries crash context or lets the caller distinguish a crash from a planned termination.
  2. Crash metadata capture. Panic location, fault address, and last SQE opcode are useful for operators but must not leak raw cap-table contents, local cap IDs, or buffer bytes, which would break the no-ambient-authority invariant.
  3. Bounded restart policy. Re-spawning a crashing service without a budget produces crash-loop amplification; re-spawning must use the same broker and manifest authority that the original spawn used, not an escalated path.
  4. Watchdog liveness. A process that hangs without crashing is not detected by crash handling alone.
  5. Degraded boot. If a critical service fails to start, the system needs a safe fallback rather than a silent hang.

This proposal fills these gaps without touching the live-upgrade protocol and without adding a god-object supervisor.

User Stories

  • An operator running make run-smoke sees a structured crash record in the audit log when a demo service panics, not a silent stale-cap error.
  • A client process calling a crashed server receives a disconnected-class CapException promptly; the process does not block indefinitely.
  • Init restarts a failed service up to the configured failure budget, then stops and declares the service permanently failed rather than looping forever.
  • A watchdog-registered service that hangs (no panic, no exit) is detected within its timeout and restarted under the same policy.
  • If the network stack fails before a shell connects, the manifest-declared emergency shell starts instead of leaving the system unresponsive.

Design

Stale-Cap Propagation

When the kernel marks a process dead (panic, fault, or explicit terminate without a prior clean exit), it performs the same epoch-bump it already does for released caps. The existing disconnected value in ExceptionType covers the transport error. The new addition is a death CQE: a CapException { type: disconnected, message: "server-death" } delivered to any process with an outstanding CALL SQE whose target belongs to the dead process.

From the caller’s perspective an unplanned crash looks identical to a force-mode live upgrade that did not reattach: the in-flight CALL returns disconnected, epoch is bumped, and any subsequent CALL on that cap also returns disconnected until the supervisor retargets the cap to a fresh instance. No new CQE opcode is needed; the existing two-level error model from Error Handling is sufficient.

Invariants:

  • A disconnected CQE on an outstanding CALL must be delivered before the kernel recycles any frame that belonged to the dead process. Frame reuse ordering is the same constraint that applies to the force-mode live-upgrade path.
  • A cap whose epoch has been bumped must never route a new CALL to the dead process’s address space, even transiently. The epoch check is a load fence on the per-cap generation counter before any ring dispatch.
  • Endpoint client facets held by the dead process are revoked at the same epoch bump. Other processes’ client facets to the same endpoint are not affected — they route to the endpoint owner, not to the crashed client.

Crash Record Capture

When a process dies unplanned, the kernel appends a crash record to the AuditLog cap held by the supervisor that spawned the process (not to a global log visible to all processes). The record is structured to support operator debugging without leaking internal kernel state:

# Proposed addition to schema/capos.capnp (Phase 1)

enum CrashKind {
    panic @0;         # Rust panic! path
    pageFault @1;     # unmapped or protection fault
    generalProtection @2;
    stackOverflow @3;
    illegalInstruction @4;
    kernelKill @5;    # explicit ProcessHandle.terminate
}

struct CrashRecord {
    processName @0 :Text;
    kind @1 :CrashKind;
    # Instruction pointer at death, relative to ELF load base.
    # Absolute virtual address is NOT included to avoid leaking
    # kernel-side layout or userspace ASLR seeds.
    faultOffsetInBinary @2 :UInt64;
    # Last SQE opcode dispatched for this process (0 = none in flight).
    lastSqeOpcode @3 :UInt8;
    # Session context ID of the process (opaque; matches AuditLog sessionId
    # for attribution without carrying cap-table or buffer contents).
    sessionContextId @4 :Data;
    # Monotonic kernel timestamp at death.
    timestampNs @5 :UInt64;
}

Fields explicitly not included: raw cap IDs, cap-table slot contents, userspace buffer bytes, kernel heap pointers, or any data from the process’s address space beyond the fault offset. The crash record is attributed to the process’s session context ID so it can be correlated with prior AuditLog records without exposing the full cap graph.

The crash record is delivered through the same AuditLog.record path the hardware-audit service already uses: the supervisor holds the AuditLog cap; the kernel invokes it on the supervisor’s ring (via a kernel-initiated RECV) rather than on a shared global ring.

Bounded Restart Policy

The supervisor that spawned a failed process owns the restart decision. The restart budget is declared in the manifest’s initConfig.services entry and interpreted by init (or a delegated supervisor):

# CUE representation (illustrative)
restart: {
    policy:       "on-failure"   # never | on-failure | always
    maxRestarts:  5              # total budget over the window
    windowSecs:   60             # sliding window for the budget
    backoffBase:  "1s"           # initial delay before first restart
    backoffMax:   "30s"          # ceiling on exponential backoff
    emergencyFallback: "shell"   # service name to promote if budget exhausted
}

Backoff is bounded and service-class aware. The exponential backoffBasebackoffMax schedule suits user-facing services that should self-heal without spinning (the Kubernetes CrashLoopBackOff lesson). For always-available system services, the prior-art note’s systemd lesson favors a short flat delay so a transient fault recovers fast; such services set backoffBase == backoffMax for flat RestartSec-style behavior. In both cases maxRestarts/windowSecs is the hard give-up budget (the OTP max-restart-intensity lesson), so neither model spins forever.

Crash-loop detection. If maxRestarts attempts exhaust within windowSecs, the supervisor stops restarting and records a budget-exhausted event. The service is marked permanently failed until an operator issues an explicit override through the ProcessHandle or re-spawns via a fresh manifest reload.

Authority preservation. Each restart uses the original ProcessSpawner call with the same CapGrant list that was used at initial spawn. The supervisor does not invent new grants or escalate authority. If a grant source was a SpawnGrantSource::Kernel DDF handle that is now invalidated (for example, a DMA buffer whose owner quiesce failed), the restart fails closed with a spawn-grant-invalid error rather than falling back to an ambient grant.

No resurrection of stale caps. The restarted process receives a fresh cap table. The supervisor must call CapRetarget (from Live Upgrade) to re-point existing client caps to the new process. If CapRetarget is not yet implemented, clients observing disconnected must reconnect through the supervisor’s exported endpoint, which the supervisor re-registers after restart.

Watchdog Capability

A service that can hang without crashing (blocked ring, infinite loop, deadlock on a kernel-held lock it does not own) is not detected by exit-path crash handling. The watchdog provides periodic liveness proof:

# Proposed future addition to schema/capos.capnp (Phase 3)

interface Watchdog {
    # Service calls this on every iteration of its main loop to
    # reset the deadline.  If not called within `timeoutNs` of the
    # last kick (or of registration), the supervisor is notified.
    kick @0 () -> ();
    # Unregister.  Safe to call during planned shutdown.
    cancel @1 () -> ();
}

interface WatchdogSource {
    # Register this process with the given timeout.
    # Returns a Watchdog the service holds and kicks.
    register @0 (processName :Text, timeoutNs :UInt64) -> (watchdogIndex :UInt16);
}

The supervisor grants a Watchdog cap (minted from a WatchdogSource it holds) to each service it considers watchdog-registered. If the kernel timer fires without a kick, the supervisor receives a liveness-failure notification and treats it identically to an unplanned crash: crash record, restart budget check, backoff.

The watchdog is an opt-in service-level contract, not a mandatory kernel mechanism. Services that are inherently event-driven (blocked on cap_enter waiting for an SQE) do not need a watchdog; they will return disconnected to callers if they stop processing. Watchdog is primarily useful for services with internal polling loops or external I/O not driven by the capOS ring.

Degraded Boot

The manifest may declare an emergency fallback service that is promoted when a critical service exhausts its restart budget before the system reaches a usable state:

# CUE (illustrative)
degradedBoot: {
    trigger:  "net-stack"    # if this service fails permanently...
    fallback: "shell"        # ...promote this service to interactive
    timeoutSecs: 30          # deadline from kernel handoff to readiness
}

The init process monitors service readiness. If a declared critical service fails to reach readiness within the timeout and has exhausted its restart budget, init spawns the fallback service with a console cap and an audit cap so an operator can inspect what failed. The fallback service is not granted the failed service’s caps; it is a scoped interactive shell, not a repair agent with escalated authority.

Relevant Research and Prior Art

In-Tree Research Notes

  • Crash Recovery and Supervision is the dedicated prior-art survey for this proposal: supervision trees, restart budgets (OTP intensity/period, systemd StartLimit, Kubernetes CrashLoopBackOff), dead-server notification (Fuchsia ZX_CHANNEL_PEER_CLOSED, seL4 silence, Genode Ipc_error), and coredump-redaction concerns, each verified against primary sources.
  • OS Error Handling grounds the stale-cap surface. It records what callers observe when a server dies in comparable systems: Zircon channels close and the peer observes ZX_ERR_PEER_CLOSED; Genode capabilities to a dead server become invalid and subsequent invocations produce Ipc_error; seL4 routes faults to a per-thread fault endpoint; KeyKOS/EROS routes them to the domain keeper. The shared lesson is that a dead server must surface as a typed transport-level signal, not a hung invocation — which is exactly the disconnected death CQE this proposal specifies.
  • Cap’n Proto Error Handling fixes the meaning of disconnected in the four-kind capnp model: “connection to a necessary capability was lost,” with the client response being re-establish-and-retry. This proposal reuses that existing classification rather than minting a new exception kind; the only addition is when the kernel emits it (unplanned death) and the paired CrashRecord.
  • EROS, CapROS, Coyotos documents the EROS/KeyKOS keeper mechanism: a capability to a separate domain that the kernel invokes on fault, which can inspect, terminate, or restart the faulting domain (process supervision is an explicit listed use). capOS’s supervisor-owns-ProcessHandle model is the same shape with capnp typed methods instead of a keeper key, and the kernel never initiates the restart itself.
  • Genode and seL4 ground the no-resurrected-authority invariant: Genode’s parent-supervised component tree with revocable capabilities, and seL4’s hierarchical delegation plus Revoke over the capability derivation tree, both establish that a restarted child gets fresh authority and that revocation of the dead instance’s caps is the supervisor’s (parent’s) responsibility, not an ambient lookup.

External Precedent and Lessons

  • Erlang/OTP supervision trees. The “let it crash” philosophy plus supervisor restart strategies (one_for_one, one_for_all, rest_for_one) and max-restart-intensity (MaxR restarts within MaxT seconds, after which the supervisor itself terminates) are the direct precedent for this proposal’s per-service failure budget and crash-loop detection. The lesson: bound restarts over a sliding window and escalate (here: stop and mark permanently failed, optionally promote degraded boot) rather than loop forever.
  • systemd unit restart policy. Restart=on-failure|always, RestartSec backoff, and StartLimitIntervalSec/StartLimitBurst are the precedent for the policy/backoffBase/maxRestarts/windowSecs fields. The lesson: separate the whether (policy) from the pacing (backoff) from the give-up threshold (burst limit).
  • Kubernetes liveness/readiness probes and CrashLoopBackoff. Liveness probes (kubelet restarts a container that fails its probe) are the precedent for the Watchdog.kick/timeout design; readiness gating before promotion is the precedent for the degraded-boot readiness deadline; CrashLoopBackOff with exponential backoff is the precedent for capped exponential restart delay. The lesson: liveness is opt-in and orthogonal to crash detection — a hung-but-not-dead process needs an explicit liveness signal.
  • Fuchsia component lifecycle. Component-manager-driven start/stop and rebinding in the routing graph parallel capOS’s supervisor + CapRetarget reconnection. The dedicated research note above grounds Fuchsia’s death-observation behavior (ZX_CHANNEL_PEER_CLOSED, no implicit reconnect); a deeper write-up of Fuchsia component-manager restart and escrow semantics remains research-needed (per the docs/backlog/research-design-gaps.md convention) before this proposal cites specific escrow behavior as grounding.

Phasing

Phase 1 — Stale-cap DISCONNECTED propagation and crash record (most model-critical). The death CQE for in-flight CALLs is the highest-priority item because it closes the model gap: callers can observe server death as a typed transport error rather than a hung ring. Crash record delivery to the supervisor’s AuditLog is paired here because it uses the same kernel death path. Requires: epoch-revocation from Stage 6 (done), AuditLog cap (done), CrashRecord schema addition.

Phase 2 — Bounded restart policy and crash-loop detection. Init reads restart budget fields from initConfig.services, applies exponential backoff, and stops at budget exhaustion. Requires: Phase 1 crash record so init knows whether a death was planned or unplanned; CapRetarget from live-upgrade Phase 1 to reconnect client caps after restart.

Phase 3 — Watchdog capability. WatchdogSource and Watchdog schema, kernel timer integration, supervisor-side timeout detection, and liveness-failure events fed into the same restart budget path as Phase 2.

Phase 4 — Degraded boot. Manifest parser reads degradedBoot fields; init promotes the fallback service on budget exhaustion during the boot window. Requires: Phase 2 budget tracking.

Hazards and Invariants

Frame reuse ordering. The kernel must not return frames from a dead process’s address space to the frame allocator until all outstanding disconnected CQEs for that process’s caps have been delivered. Violating this could allow a concurrent FrameAlloc to map recycled memory into a new process before the old process’s CQEs complete, creating a window where a stale disconnected CQE arrives after the frame holds new data. The existing DMA quiesce/scrub ordering in the DMA pool grant path is the model for this constraint.

No stale authority after restart. A restarted process receives only the grants declared in the original ProcessSpawner.spawn call. The supervisor must not silently re-grant caps that were revoked as part of the death epoch bump. In particular, any DMAPool-derived handle that was in active use at crash time must be explicitly re-acquired through the grant-source path, not recycled from the dead process’s cap table.

Restart does not bypass the authority broker. If the original spawn was gated on an AuthorityBroker-selected session context, the restart uses the same broker path. The supervisor cannot substitute a broader session context or an anonymous context to make the restart succeed.

Capability revocation precedes any dump. The death epoch bump that invalidates the crashed process’s caps must complete before any crash record or future coredump is produced. A record produced post-revocation sees only dead cap indices, never live authority; a pre-revocation memory snapshot could otherwise capture live cap indices or ring-buffer contents (the race class behind recent coredump CVEs). Any future coredump extension must run only after revocation and must not be readable by unprivileged dump readers.

Crash record isolation. The crash record must not carry raw cap IDs, cap table slot numbers, or any data read from the process’s address space (stack contents, heap contents, message buffers). The fault offset is relative to the binary load base, not an absolute virtual address, to avoid leaking kernel layout or userspace address randomization.

Watchdog authority is narrow. A Watchdog cap proves liveness for exactly one registered process. It does not grant the holder any access to the supervisor, the process’s caps, or any other service. It is a pure liveness signal, not an authority surface.

Relationship to Adjacent Proposals

  • Live Upgrade — covers the planned case. The CapRetarget primitive defined there is consumed by Phase 2 of this proposal to reconnect client caps after an unplanned restart. The force-mode disconnected delivery and epoch-revocation paths are shared; this proposal adds the death CQE and crash record on top.
  • Service Architecture — defines the supervisor tree and the RestartPolicy type currently parsed by init. This proposal extends that policy with the budget, backoff, and budget-exhaustion fields, and binds crash handling to the supervisor that owns the ProcessHandle, not to a global daemon.
  • capos-service — defines the userspace service framework above capos-rt. The watchdog kick call and readiness notification in Phase 3 are natural additions to the service lifecycle hooks that capos-service abstracts.
  • Error Handling — the disconnected class in the two-level error model is the transport surface for stale-cap delivery. This proposal does not add new error types; it specifies when and how disconnected is delivered for an unplanned death.
  • System Monitoring — crash records, restart events, budget-exhaustion notifications, and watchdog timeouts are all audit-worthy. The monitoring proposal owns the operator visibility surface; this proposal defines the structured events that feed it.
  • Resource Accounting and Quotas — the failure budget is a quota: a count consumed by crash events and refilled by the sliding window. The accounting model for this quota follows the same ledger-of-record pattern as memory and scheduling quotas.

Proposal: Debug and Trace Authority

How capOS should expose process-attach, capability-table inspection, ring-trace capture, and sampler/profiler authority to debuggers and maintenance tools without granting kernel privilege, ambient inspection rights, or a covert channel for authority transfer.

Problem

A capability OS whose security claim is “you can only access what you were explicitly granted” breaks silently if a debugger can attach to any process without authority. Unix ptrace is the canonical example: any process with sufficient Unix privilege can stop, inspect, and modify another process’s address space and register state, bypassing all higher-level access controls. capOS must not import that model.

At the same time, debugging real failures requires more than serial output. The existing debug_tap facility (kernel/src/debug_tap.rs) emits bounded SQE/CQE records to the emergency serial path at QEMU-only build time, but it has no userspace-facing capability, no consent protocol, no audit trail, and no scoping to a specific target. The measure feature adds benchmark-only TSC counters, also build-gated and operator-facing only. There is currently no capability-shaped debug/trace/profile surface at all.

This is a capability-model gap. Until it is filled, the only debugging tool is serial output and offline log inspection — useful for early kernel work, but insufficient once real service decomposition and cross-process interactions exist.

User Stories

  • An operator maintenance session needs to inspect which capabilities a stuck service holds, without being able to invoke any of them.
  • A developer investigating a failing smoke test wants a bounded record of the SQEs and CQEs the target process issued around the failure, decoded against the current schema.
  • A profiler tool needs sampled PC/stack snapshots of a running service at a configured frequency without stopping the service or holding a live breakpoint.
  • An agent-shell maintenance workflow needs to attach to a service granted to it by the authority broker, with that attach action recorded in the audit log.
  • A supervisor needs to assert that a debugged process cannot escalate its authority into other processes by virtue of being debugged.

Design Principles

  1. Attach is authority. Connecting a debug session to a process requires an explicit DebugSession capability. No ambient ptrace analog. The kernel does not hand out debug access on the basis of Unix UID or any implicit privilege.
  2. Consent is required. A DebugSession for a live process is obtained either by explicit owner consent (the process or its supervisor grants one), or through a broker-mediated maintenance session policy decision. Neither path is self-minted.
  3. Attach is audited. Every DebugSession creation and every inspection operation through it is an auditable event. The target process and the audit log both observe it.
  4. Snapshots are read-only. Cap-table and VM inspection through a debug session produce read-only snapshots. No capability in the snapshot is transferable to or activatable by the inspector. A debug session must not become a covert authority-transfer channel. The GDB-RSP prior art is a reminder that a full debugger is read/write authority over its target; in this design the read-only snapshot/trace surface (Phases 1-3) and any future read/write control (breakpoints, register writes, Phase 4) are distinct authorities. Write authority is a separately leased, stronger cap and never rides implicitly on the read-only DebugSession.
  5. Secrets and payload bytes are redacted by default. Cap-table snapshots expose names, interface IDs, and slot indices — not raw capability payloads, bearer tokens, or memory-mapped buffer contents. Payload capture requires a separately leased and stronger cap.
  6. A debugged process cannot escalate. A process being debugged must not thereby gain the ability to inspect or affect other processes. The debug session is scoped to one target; no cross-process read or call is admitted through it.
  7. Symbol resolution is bounded. Resolving a PC address to a symbol name requires access to a symbol table file or binary, not filesystem authority. Symbol resolution is a separate, explicitly scoped cap — not bundled into the basic debug session.
  8. Build gates are graduated. The debug_tap kernel facility stays behind cfg(feature = "debug_tap") for its current always-emit emergency-serial behavior. The userspace-facing DebugSession and RingTrace caps are not build-gated but are absent from production bootstrap CapSets; a broker may mint them only under an explicitly authorized maintenance session policy.

A DebugSession is created through one of two paths:

Owner consent. The target process’s supervisor or owner holds a ProcessHandle and can call a createDebugSession method on it to mint a DebugSession for the target. This is the normal developer workflow: the supervisor that spawned a service grants a debug session to a maintenance tool.

Broker-mediated maintenance session. The authority broker holds a restricted ability to mint DebugSession caps for processes within a maintenance session scope — for example, for an operator who has authenticated and whose session policy permits debugging named services. The broker records the grant as an audit event. Normal shells and user sessions do not receive this authority.

Neither path is self-minted. A process cannot mint a DebugSession for itself or for peers from ambient state. The kernel does not expose a DebugAll cap at bootstrap.

Attaching a DebugSession produces an audit record covering: the initiator session, the target pid and service name, the authority source (owner consent or broker grant), and the timestamp. The target process receives a notification at attach time if it has an active ring — not as a blocking gate, but as an observable event.

Proposed Interfaces

These are conceptual interfaces. They should not be added to schema/capos.capnp until a Phase 1 implementation slice needs them.

# Read-only snapshot of one capability slot in the target's cap table.
# Does not transfer or activate any authority.
struct CapSlotSnapshot {
  slotIndex   @0 :UInt32;
  interfaceId @1 :UInt64;   # capnp type ID; 0 if untyped or unknown
  methodCount @2 :UInt16;
  label       @3 :Text;     # kernel-assigned or schema-derived name
  state       @4 :Text;     # e.g. "live", "released", "pending-return"
}

# Read-only snapshot of the target's capability table.
# None of these slots are transferable to or callable by the inspector.
struct CapTableSnapshot {
  targetPid    @0 :UInt32;
  tick         @1 :UInt64;
  slots        @2 :List(CapSlotSnapshot);
  slotTotal    @3 :UInt32;
  slotUsed     @4 :UInt32;
  snapshotDrop @5 :UInt32;  # slots omitted due to budget/redaction
}

# A scoped debug session attached to one process.
interface DebugSession {
  # Read-only snapshot of the target's current capability table.
  capTableSnapshot @0 () -> (snapshot :CapTableSnapshot);

  # Arm a bounded ring-trace capture on the target.
  # Returns a RingTrace cap scoped to this session and target.
  armRingTrace @1 (maxRecords :UInt32, maxBytes :UInt32)
      -> (trace :RingTrace);

  # Read a bounded sampler record set for the target.
  # Returns PC/stack samples at the configured frequency without
  # stopping the target.
  armSampler @2 (intervalNs :UInt32, maxSamples :UInt32)
      -> (sampler :Sampler);

  # Detach. Further calls on this session are rejected.
  detach @3 () -> ();
}

# Bounded ring-trace cap, scoped to one DebugSession target.
interface RingTrace {
  # Drain buffered SQE/CQE records for the attached target.
  drain @0 (maxRecords :UInt32)
      -> (records :List(TraceRecord), complete :Bool, dropped :UInt64);

  # Disarm and release the capture buffer.
  release @1 () -> ();
}

# Sampler cap for sampled PC/stack snapshots.
interface Sampler {
  # Read the next available sample batch.
  read @0 (maxSamples :UInt32)
      -> (samples :List(SamplerRecord), dropped :UInt64);

  # Stop sampling and release the reservation.
  stop @1 () -> ();
}

struct SamplerRecord {
  tick         @0 :UInt64;
  pid          @1 :UInt32;
  pc           @2 :UInt64;
  # Shallow inline frames; bounded to avoid variable-length allocation
  # on the capture hot path.
  frames       @3 :List(UInt64);
  framesDrop   @4 :UInt8;  # frames omitted due to depth cap
}

TraceRecord is the same shape defined in docs/proposals/system-monitoring-proposal.md: tick, pid, opcode, cap_id, method_id, interface_id, result, flags, and an optional payload blob gated by a separately leased stronger cap.

Symbol and Source Boundary

Resolving a sampled PC address or a ring-trace cap_id to a human-readable symbol requires access to symbol tables and debug info, not filesystem authority. The design uses an explicit, scoped symbol-resolver cap:

  • A SymbolTable cap holds a read-only ELF DWARF/symbol section for one binary, loaded from a trusted source (boot package or signed artifact store).
  • The inspector passes a SymbolTable cap and a list of addresses; the resolver returns bounded name strings.
  • No arbitrary filesystem path traversal is admitted through this path.
  • SymbolTable is separately minted from DebugSession; holding a debug session does not imply symbol resolution authority, and holding a symbol table does not imply attach authority.

Symbol resolution is Phase 3+ work. Phase 1 produces raw addresses; offline host-side tools (e.g., addr2line on the kernel ELF) handle symbol lookup during the research phase.

Phasing

Phase 1 — DebugSession Attach and Cap-Table Snapshot (model-critical)

  • Define DebugSession, CapSlotSnapshot, and CapTableSnapshot in schema/capos.capnp.
  • Implement ProcessHandle.createDebugSession in the kernel, guarded by the existing ProcessHandle authority boundary. capOS uses process-level debug authority here because most current services are single-threaded; the seL4 per-TCB-cap prior art argues for deriving per-thread sessions from ThreadControl, the intended finer-grained follow-up once multi-threaded targets need it.
  • capTableSnapshot returns a bounded, redacted read-only snapshot of the target’s current cap table. No cap in the snapshot is transferable or callable.
  • Audit record emitted to AuditLog at attach and at each snapshot call.
  • No payload capture, no ring trace, no sampler in this phase.
  • Proof: a smoke test where a supervisor attaches a debug session to a child, calls capTableSnapshot, and verifies the snapshot fields against what the child was granted at spawn time. The audit log must contain the attach record.

Phase 2 — Ring Trace via DebugSession

  • Add armRingTrace and RingTrace to the schema and kernel.
  • Build on the existing debug_tap ring-capture record format (RingCaptureRecord in capos_config::ring), but route capture through the DebugSession authority rather than the always-emit emergency-serial path.
  • The RingTrace cap is scoped to the attached target; it cannot observe other processes.
  • Payload capture (includePayloadBytes) requires a separately presented stronger cap (not yet defined in Phase 2).
  • Disarming the RingTrace releases the capture buffer and emits an audit record.
  • Proof: extend the failing-call smoke from the Ring-as-Black-Box milestone (commit da5f5e9) to route capture through a DebugSession instead of the emergency serial path, and verify the drained records match the expected SQE/CQE sequence.

Phase 3 — Sampler Authority

  • Add armSampler and Sampler to the schema and kernel.
  • The sampler fires at a configured interval, captures PC and a bounded inline call frame, and buffers records for drain.
  • The target process is not stopped; sampler overhead is bounded by sample interval and buffer depth.
  • Relates to the System Performance Benchmarks proposal: a benchmark runner may arm a sampler before a workload and drain it after to produce a flamegraph, subject to the same audit and consent rules.
  • Symbol resolution is offline in this phase (host-side addr2line).

Phase 4 — Breakpoint, Single-Step, and Payload Capture (deferred)

Breakpoint and single-step authority has a much larger kernel surface than read-only snapshot and sampling. Payload capture risks exposing secrets. Both are deferred until the Phase 1–3 model is stable and the audit/consent infrastructure is proven.

When payload capture is added, it must:

  • require a separately leased PayloadCapture cap distinct from the base RingTrace cap;
  • be a separately audited grant;
  • carry a per-call byte budget enforced by the kernel.

Hazard Preflight

paging/MMIO: cap-table snapshots and ring-trace records read kernel state under existing locks. No new user-mapping or MMIO surface is introduced in Phase 1–3.

ABI: DebugSession, CapTableSnapshot, and RingTrace are new schema interfaces. Generated bindings must be refreshed via make generated-code-check before merging any Phase 1 branch.

authority transfer via snapshot: the critical invariant is that no CapSlotSnapshot entry can be used by the inspector to call or transfer a capability. The kernel must enforce that the snapshot data path does not return live cap references — only metadata fields (interface ID, label, state). This must be verified in the Phase 1 implementation review.

audit bypass: an inspector must not be able to suppress or delay audit records for its own actions. Audit writes must occur synchronously within the debug session dispatch path, not deferred.

covert timing channel: a sampler that returns precise timestamps could be used to extract timing side-channel information about a target service. The sampler tick field is clamped to PIT-resolution granularity in Phase 3 to reduce precision; finer clock access for profiling remains deferred.

Security Boundaries

  • A DebugSession holder can read snapshots of one target. It cannot call, transfer, or activate any capability belonging to the target.
  • A RingTrace holder can read ring metadata for one target. Payload bytes require a separate stronger cap.
  • A Sampler holder receives PC and bounded stack frames for one target. No memory-mapped content, no register state beyond PC.
  • None of these caps admit cross-process inspection. A DebugSession for process A cannot observe process B.
  • A debugged process remains subject to normal scheduler and capability enforcement. Being debugged does not grant the target any additional capability slots or authority.
  • Redaction applies at snapshot construction time, not at read time. The kernel constructs the redacted view; the inspector never sees the raw kernel state.

Non-Goals

  • No ambient ptrace-style process attach without authority.
  • No kernel debugger (GDB stub, JTAG) exposed as a userspace capability surface — those are operator boot-time tools, not capability-model components.
  • No replay semantics. Ring trace is inspection, not record/replay. Replay requires payload retention, timer modeling, and capability checkpoints; that is out of scope.
  • No cross-process or system-wide trace aggregation in this proposal. Aggregate trace is a monitoring concern covered by docs/proposals/system-monitoring-proposal.md.
  • No memory read/write through a debug session. Address-space inspection is a separate and stronger authority not proposed here.
  • No DebugSession self-grant. A process cannot debug itself through this interface.
  • No crash/exception observation here. A read-only ExceptionObserver cap (the Zircon task_create_exception_channel analog) for receiving crash notifications without debug-write authority is a separate, weaker authority owned by Crash Recovery and Supervision, not bundled into DebugSession.

Relevant Research and Prior Art

In-Tree Notes

  • Debug, Trace, and Profiling Authority is the dedicated prior-art survey for this proposal: GDB remote serial protocol, Linux ptrace/Yama, perf/CAP_PERFMON, Fuchsia handle-scoped debug_agent/zxdb, seL4 TCB-cap hardware debug, and Genode CPU-session GDB monitor, grounding the DebugSession/Sampler/exception-observer authority split against real sources.
  • docs/research/zircon.md documents Fuchsia’s handle model: handles are process-local references with a rights bitmask, there is no ambient authority, and a process can only interact with kernel objects through handles it holds. capOS draws the directly applicable lesson here — a DebugSession is a held capability, not an ambient privilege, and inspection of a target’s cap table is itself a distinct grantable authority rather than a side effect of holding a generic “debug” right. The note covers handle rights and transfer but not Fuchsia’s debug_agent/zxdb debugging service specifically; that service is now surveyed in the dedicated research note above (and summarized below).
  • docs/research/sel4.md records that seL4 has no in-kernel debug traps or thread-introspection mechanism in the verified configuration; debugging is pushed to userspace and the design constraints (typed authority, no ambient inspection) matter more than any debugger feature. capOS follows the same posture: keep the kernel surface to read-only snapshot and bounded capture, and route policy (who may attach, to what) to userspace consent and the broker.
  • docs/research/genode.md documents Genode’s session-and-label model, where every cross-component request carries a label and is mediated by a parent component. The applicable lesson is that attach authority should flow through the same parent/supervisor relationship that already governs spawning — a supervisor that holds a child’s ProcessHandle is the natural minter of a DebugSession for that child, mirroring Genode’s parent-mediated session routing rather than a global debugger service.
  • docs/research/completion-ring-threading.md grounds the io_uring-style SQ/CQ ring transport that the Phase 2 ring trace observes. The trace records the same SQE/CQE structures already captured by the kernel debug_tap facility (RingCaptureRecord in capos_config::ring); this proposal adds the authority and consent layer that the existing build-gated emergency-serial capture lacks.

External Precedent

  • GDB remote serial protocol (gdbserver). GDB separates the debugger front-end from a target-side stub that exposes register, memory, and breakpoint operations over a serial/TCP channel. The lesson for capOS is that the inspection surface can be a narrow, well-defined protocol object rather than ambient access — but full register/memory read-write is exactly the strong authority capOS defers to Phase 4 and keeps out of the read-only DebugSession.
  • Linux ptrace(2). ptrace is the canonical ambient-authority footgun: attach authority derives from Unix UID and the Yama ptrace_scope sysctl rather than from a held, transferable capability, and a successful attach grants register and full address-space read/write at once. This conflates “may observe” with “may control” and bypasses higher-level access controls. capOS rejects this directly — DebugSession attach is owner-consented or broker-granted, audited, and read-only; observation and control are separate authorities.
  • Linux perf and eBPF tracing. Sampled profiling and tracing on Linux sit behind privilege boundaries (perf_event_paranoid, CAP_PERFMON/CAP_BPF) precisely because PC/stack sampling and kernel-wide tracing leak timing and topology information across trust boundaries. capOS treats the same risk as a capability and an audit event: the Sampler cap is scoped to one consented target, its timestamp resolution is clamped, and arming it is recorded.
  • Fuchsia debug_agent / zxdb. Fuchsia’s debugger is a userspace service (debug_agent) that the zxdb front-end drives; it operates on process and thread handles rather than ambient privilege, consistent with Zircon’s object-capability model. This is the closest external precedent for capOS’s intended shape — debugging as a handle/capability-mediated service, not a kernel-ambient right. A dedicated in-tree note on the debug_agent design is research-needed per the docs/backlog/research-design-gaps.md convention before the Phase 4 breakpoint/single-step surface is designed.
  • Object-capability systems generally. Capability systems avoid an ambient ptrace analog because there is no global principal that implicitly dominates other processes; the authority to inspect must be granted like any other capability. This is the structural reason capOS can offer debugging without reintroducing ambient authority, and why the consent and audit requirements in this proposal are load-bearing rather than optional hardening.

Relevant Proposals

  • System Monitoring (system-monitoring-proposal.md): owns aggregate ring traces (TraceCapture), log/metric/audit signal taxonomy, and the RingTap move-semantics note for payload-capturing taps. This proposal owns the per-process debug attach authority and consent model that monitoring’s trace surfaces do not cover. TraceRecord schema is shared; authority and consent model is separate.
  • Security and Verification (security-and-verification-proposal.md): the trust-boundary inventory (Track S.7) must be updated to include DebugSession, RingTrace, Sampler, and CapTableSnapshot as new boundaries before downstream services rely on them.
  • System Performance Benchmarks (system-performance-benchmarks-proposal.md): benchmark runners may arm a Sampler before a workload run; this proposal defines the authority and consent model for that use.
  • Task State and Agent Telemetry (task-state-and-agent-telemetry-proposal.md): agent maintenance sessions may use DebugSession to inspect service state; telemetry records that fact.

Proposal: Durable Hardware Audit Log Persistence

How the HardwareAuditLog capability moves from a bounded volatile in-kernel ring to durable, tamper-evident audit storage without claiming authority it does not have.

Problem

HardwareAuditLog is the read-only observer over the four hardware authority caps (DeviceMmio, Interrupt, DMAPool, DMABuffer). The kernel still emits one cap-audit: line per lifecycle event and appends a copy into a fixed-size volatile ring (capacity 64, drop-oldest). The userspace hardware-audit-service now drains that ring into a Store-backed, hash-chained segment ring recoverable through Store.list inventory, and serves scoped HardwareAuditReader snapshots with self-describing persistence, retention, subscriber-admission, keyed-seal, key-lifecycle, physical-persistence, and runtime-admission metadata. The regular DDF audit service smoke uses the RAM-backed StoreCap and keeps the IOMMU abort-held DMAPool/DMABuffer evidence strict. The physical persistence proof manifest grants persistent_store to the service and reuses one disk image across two QEMU boots; pass 2 must recover and verify pass-1 audit segment blobs before draining current-boot records. The smoke also stores and reads a separate content-addressed marker as an independent Store-disk sanity check.

The current keyed mode uses a RAM-local RamSymmetricKey minted through the development-only local DevelopmentSoftwareKeySource and seals each segment header with HMAC-SHA256. The audit service never exports raw key material. Snapshot metadata reports the signing key identifier, generation, single-local-key rotation status, and RAM-local revocation caveat so a verifier can distinguish this local proof from external KeyVault custody.

The remaining gaps before a full production durability and audit-verifier claim are:

  1. External verifier key custody. The shipped keyed seal is local HMAC evidence from a development-only deterministic key source. It is not yet a production KeyVault/KeySource-managed key with durable rotation and revocation enforcement.
  2. Production media and rollback policy. The QEMU persistent_store reboot proof demonstrates Store-backed survival across boot using the CAPOSST1 disk format. Volume rollback resistance and cloud/hardware media assumptions remain the storage track’s responsibility.
  3. Runtime subscribers are refused until a broker path exists. Manifest scoped reader grants work. HardwareAuditReader runtime admission now fails closed with an explicit no-authority-broker status instead of silently implying support.

The local proof was implemented by docs/tasks/done/2026-06-07/hardware-audit-physical-persistence-signing-local-proof.md.

This proposal selects the target design for those production extensions and records the boundaries of the Store-backed service that has landed.

Scope and Non-Claims

This proposal is deliberately narrow. It is observer-evidence design only.

  • Audit persistence records authority events. It does not grant, gate, or imply authority. The authority checks stay in the device-manager and cap-object paths exactly where they are now.
  • Durable audit is not IOMMU isolation. It does not bound DMA, validate MMIO ranges, or constrain interrupt routes. It records that those events happened.
  • Durable audit is not provider-driver readiness. A persisted audit trail does not make a userspace driver production-ready; it makes the driver’s hardware-cap lifecycle reviewable.
  • Tamper-evidence is detection, not prevention. A signed, hash-chained log proves history was not edited if verification passes; it cannot stop a privileged writer from refusing to append. Availability of the audit path is a separate concern.
  • The durable path must not depend on volatile QEMU-only state, the qemu cargo feature proof rings, or local run telemetry. Those remain harness scaffolding.

Design Grounding

  • docs/tasks/done/2026-05-22/ddf-audit-cap-durable-persistence.md — acceptance criteria and hazard preflight this proposal answers.
  • docs/proposals/cryptography-and-key-management-proposal.mdSymmetricKey (mac/verify), PrivateKey (sign), KeySource, and KeyVault primitives consumed for tamper-evidence and key lifecycle.
  • docs/proposals/storage-and-naming-proposal.md — capability-native Store, append-only File/ledger semantics, content hashing, previous-record hash chaining, and stale-write rules consumed for the durable ring.
  • docs/proposals/system-monitoring-proposal.md — audit as a distinct append-only record type with its own readers and retention, X.740 audit field model, and “observation is authority” principle.
  • docs/dma-isolation-design.md and docs/backlog/hardware-boot-storage.md — the device-driver foundation context the hardware authority caps live in.
  • kernel/src/cap/hardware_audit.rs — the current volatile-ring behavior this design preserves and extends.

Design

1. Durable Audit-Record Ring

The durable audit path is a two-tier structure: the existing bounded in-kernel volatile ring stays as a fast-path staging buffer, and a userspace audit log service owns durable persistence behind the capability-native Store interface.

flowchart LR
    DM[Device manager and<br/>hardware cap objects] -->|emit_cap_audit| KR[Kernel volatile ring<br/>capacity 64, drop-oldest]
    KR -->|drain cursor poll| ALS[Audit log service<br/>userspace]
    ALS -->|append-only records| ST[(Store / append-only<br/>ledger segment)]
    ALS -->|sealed segment digest| KV[KeyVault / KeySource]
    ALS -->|scoped read window| SUB[Admitted subscribers]

Why a userspace service, not kernel-side disk I/O. Durable storage means a block device, a filesystem-like layout, segment rotation, and signing. None of that belongs in the kernel: the kernel’s job is dispatch and isolation. The kernel keeps doing exactly what it does today — bounded, alloc-free, lock-light ring emission — and a userspace audit log service drains it through HardwareAuditLog.drain with a per-cap cursor. This also keeps the durable path off QEMU-only telemetry: the service persists through the Store interface. The current bootstrap StoreCap is RAM-backed and therefore demonstrates the contract; a real BlockDevice or cloud bridge adapter per the storage proposal is required before this path claims post-reboot retention.

Drain protocol. The audit log service polls HardwareAuditLog.drain with a monotonic expected_sequence cursor. Each successful drain returns the window since the last durably-committed sequence. The service:

  1. Reads the drained window and the dropped_records counter.
  2. Appends each record to the current segment (see rotation below).
  3. Advances its cursor to next_sequence only after the segment write is durably committed (Store sync).

If the kernel ring drops records between polls (dropped_records advanced by more than the records the service consumed), the service writes a gap marker record into the durable log: { kind: gap, lost_count, observed_at }. A gap is itself audit evidence — it is recorded, not hidden. The drop-oldest behavior of the kernel ring is therefore preserved and made visible in the durable log rather than silently lost.

Retention and rotation. The durable log is a sequence of fixed-size segments (proposed 1 MiB each; an implementation tuning parameter, not an ABI). When a segment fills:

  1. The service computes the segment digest (see tamper-evidence below).
  2. It seals the segment (digest + chain link recorded).
  3. It opens the next segment, whose first record carries the previous segment’s digest as prev_segment_digest.

Retention is count-bounded and age-bounded: keep at most N sealed segments (proposed default 64) or segments newer than T (proposed default 30 days), whichever is smaller. The bound is a manifest-configurable policy on the audit log service, not a kernel constant.

Overflow policy. Two distinct overflow points, two distinct policies:

  • Kernel ring → service drain lag. Drop-oldest, as today, with a recorded gap marker. Rationale: the kernel ring must never block a hardware cap lifecycle path on a slow or absent consumer. Audit emission is best-effort by construction; the gap marker makes the loss auditable.
  • Durable segment retention limit. Drop-oldest sealed segment, with a retention-eviction record appended to the active segment naming the evicted segment’s digest and sequence range. Rationale: an operator querying “what did we lose to retention” gets a definite answer, and the hash chain stays intact across the eviction (the eviction record links forward; the evicted segment’s digest is permanently recorded before deletion).

Backpressure is explicitly rejected for both points. Backpressuring a hardware authority cap on audit-storage latency would let a stalled disk wedge device lifecycle — an availability and correctness hazard far worse than a recorded gap. Audit is evidence over authority, never a gate on it.

Crash-recovery semantics. On audit log service restart:

  1. The service scans sealed segments oldest-to-newest, verifying each segment digest and the prev_segment_digest chain link.
  2. It finds the last segment. If the last segment is unsealed, it replays its records, recomputing the running digest; a torn final record (incomplete write) is truncated at the last valid record boundary and a recovery_truncation marker is appended.
  3. It re-derives the drain cursor from the highest durably-committed sequence and resumes polling the kernel ring from there.

Records lost in the window between the last durable commit and the crash are not recoverable — the kernel ring is volatile and a crash loses it. This is an explicit, accepted limitation: see Assumptions. The recovery markers make the boundary of trustworthy history explicit to any consumer.

2. Tamper-Evidence and Segment Seals

Tamper-evidence is a hash chain plus segment signing, consuming the cryptography/key-management proposal’s primitives. No new crypto is invented here.

Per-record chaining. Each durable audit record carries prev_record_hash — a hash over the previous record’s canonical bytes. This is exactly the append-only-ledger pattern the storage proposal already prescribes (“append new records with previous-record hashes rather than rewriting history”). Editing or reordering any record breaks every subsequent prev_record_hash, so a verifier walking the chain detects the first divergence.

Per-segment signing. The shipped service records per-segment digests and a running chain head so retained-window tampering is detectable. The local keyed proof seals each segment header with HMAC-SHA256 using a RAM-local symmetric key cap minted by the development-only local key source. When a segment is sealed, the audit log service computes the segment digest (a hash over the sealed record range, anchored on the running chain hash) and produces a keyed seal over { segment_index, sequence_range, record_count, segment_digest, prev_segment_digest }. Production deployment should select one of these key custody modes by manifest policy:

  • MAC mode (default). A SymmetricKey with KeyPurpose.integrity produces an HMAC tag over the segment header via SymmetricKey.mac. Cheaper, no asymmetric key handling, sufficient when the verifier is trusted to hold the same key. Verification is SymmetricKey.verify.
  • Asymmetric mode. A sign-only PrivateKey produces a signature via PrivateKey.sign. Used when audit evidence must be verifiable by a consumer that should not be able to forge records (e.g. an external reviewer holding only the public key). Verification uses the corresponding PublicKey.verify.

The audit log service receives a signing-capable key cap (a SymmetricKey restricted to mac, or a PrivateKey restricted to sign) at manifest grant time. It never holds raw key material — the key is a capability object per the key-management design. The current local proof follows the same no-raw-key custody rule with a RamSymmetricKey minted by the development-only software key source. That source deterministically remints the same non-extractable local HMAC key from stable source metadata and an audit label for the reboot proof, but it is still not production custody: there is no external root, rollback resistance, rotation, or persistent revocation state.

What signs what. The chain hash protects record order and content within and across segments. The segment signature protects the segment header, binding the digest, sequence range, and previous-segment digest under a key. Together: a verifier with the verification key can confirm that the sealed segments form an unbroken, unedited chain back to the first segment, and that each seal was produced by the holder of the signing key.

Key lifecycle.

  • Current local proof. signing_key_id = "local-audit-hmac-v1" and signing_key_generation = 1 identify the development-key-source RAM-local HMAC key generation. key_rotation_status = "single-local-key-no-rotation" and key_revocation_status = "ram-local-key-revocation-not-persistent" are explicit caveats, not production lifecycle controls.
  • Provenance. The signing key is produced by a KeySource and stored sealed in a KeyVault (per the key-management proposal). The manifest grants the audit log service a use capability for the key, not the vault.
  • Rotation. Keys rotate on a policy interval (proposed default 90 days) or on demand. Rotation is segment-aligned: a segment is always signed by exactly one key. The first segment after rotation records a key_rotation marker carrying the new key’s identifier (KeySource.info identifier — a label, not a secret) and the previous key’s identifier. A verifier follows the identifier sequence to know which key verifies which segment range.
  • Revocation. If a signing key is suspected compromised, it is revoked in the KeyVault. Revocation does not invalidate already-sealed segments — those remain verifiable against the (now-revoked) key, and the revocation itself is recorded as a key_revocation marker. What revocation prevents is future seals with that key. A consumer treats segments signed by a revoked key as “authentic at seal time, key later revoked” — still evidence, with a documented caveat.
  • What is NOT protected. Tamper-evidence cannot protect records the kernel ring dropped before the service drained them, cannot protect the crash-window records, and cannot prevent an attacker who holds the live signing key from forging new well-formed history going forward. It detects edits to already-sealed history. These limits are stated in Assumptions.

3. Production Subscriber Admission Policy

Today exactly one manifest-granted reader gets a volatile snapshot. The production model keeps “observation is authority” but adds structure.

Reader caps are typed and scoped. The audit log service exposes readers as distinct capability objects, not a single shared snapshot method:

  • HardwareAuditReader — a read-only cap over a scoped window: a subscriber may be granted the full history, a single hardware-cap-tag slice (e.g. DMAPool events only), or a bounded recent window. Narrowing is structural — a narrower reader is a wrapper cap exposing less, per the capOS capability-model principle, not a rights bitmask.
  • The cap exposes snapshot (cursor-based, preserving the existing field model) and verify (returns segment-chain verification status so a subscriber can confirm tamper-evidence without holding the signing key, when the deployment uses asymmetric mode and grants the public verification key).

Admission is manifest-declared, with a runtime broker path. Two tiers:

  • Manifest-declared subscribers. The boot manifest declares which services receive which scoped reader caps, exactly like every other capability grant. This is the baseline and covers the monitoring/audit service itself.
  • Runtime-admitted subscribers. A later phase may route audit-reader requests through the userspace authority broker (docs/proposals/userspace-authority-broker-proposal.md), so an operator session can be granted a scoped, time-bounded reader without a reboot. This is explicitly future work, gated on the broker. The shipped reader endpoint exposes a runtime-admission method that refuses with InvalidArgument and reports runtime_admission_policy = "runtime-reader-admission-refused-no-authority-broker", so callers get a fail-closed status instead of an implied grant.

Revocation. Reader caps are ordinary caps and are revoked the ordinary way (cap-table teardown). Revoking a reader does not touch the durable log.

4. Preservation of Existing Volatile-Snapshot Behavior

The kernel-side volatile ring and its snapshot ABI are preserved unchanged as the staging tier:

  • The bounded ring (capacity 64), head/len/next_sequence/dropped_records bookkeeping, and drop-oldest admission stay exactly as in kernel/src/cap/hardware_audit.rs.
  • The snapshot cursor (start_sequence), truncation labels (no-records-requested, request-limited, snapshot-limit-limited, available-records-exhausted), and the dropped_records counter stay available to direct HardwareAuditLog.snapshot observers.
  • The durable service path uses HardwareAuditLog.drain(expected_sequence, max_records) as its per-cap cursor protocol. A cursor mismatch still fails closed; a cursor-verified overflow reanchors at the retained window and reports the advanced dropped_records counter so the service can record a visible gap.
  • The QEMU-only proof rings and prove_qemu_snapshot_truncation_contract remain harness scaffolding and are not on the durable path.
  • The HardwareAuditReader.snapshot result’s self-describing status fields stay, and their values advance as the durable path lands. The Store-backed service reports persistence_status = "store-backed-segment-ring", signature_status = "hash-chain-plus-local-hmac-segment-seals", keyed_seal_count greater than or equal to the retained sealed segment count, signing_key_id = "local-audit-hmac-v1", key_rotation_status = "single-local-key-no-rotation", key_revocation_status = "ram-local-key-revocation-not-persistent", physical_persistence_status = "store-cap-backing-manifest-selected", subscriber_admission_status = "manifest-admission-active-runtime-broker-refused", and runtime_admission_policy = "runtime-reader-admission-refused-no-authority-broker". Changing those field values is an ABI-adjacent change and must land with schema, generated bindings, runtime decode, demos, and smoke assertions in one branch, per the task hazard preflight.

No focused hardware-audit smoke is invalidated by this design: the kernel-side behavior they assert is unchanged. New durable-path behavior gets new smokes (see Evidence Expectations in the task file).

5. Assumptions

The durable evidence is trustworthy only under stated assumptions. A consumer must know these before trusting the log.

  • Crash window is lossy. Records in the kernel volatile ring that were not yet durably committed by the audit log service are lost on a crash or power loss. The durable log’s recovery markers bound trustworthy history; they do not recover the lost window. Audit is best-effort at the volatile staging tier by design — it must never block hardware cap lifecycle.
  • Rollback below the audit log is out of scope. This design assumes the Store/BlockDevice beneath the audit log service does not silently roll back committed segments. If the underlying storage can roll back (e.g. a snapshot-restore of the whole volume), the hash chain detects the resulting gap on next verification, but the design does not prevent it. Volume-level rollback protection is the volume-encryption/storage proposals’ concern.
  • Rotation is segment-aligned and monotonic. A production segment is signed by exactly one key. Key identifiers in key_rotation markers are assumed monotonic and unique so a verifier can deterministically map segment ranges to keys.
  • Key lifecycle is delegated. Key generation, sealing, rotation scheduling, and revocation are the KeySource/KeyVault services’ responsibility. This proposal assumes those primitives behave as the key-management proposal specifies; it does not re-implement them. The landed local HMAC proof uses a development-only deterministic source and states its lack of production rotation/revocation in reader-visible metadata.
  • Signing key compromise forges the future, not the past. An attacker holding the live signing key can produce well-formed new records. The hash chain plus revocation marker make the compromise boundary detectable once revocation is recorded, but records sealed during the compromise window are only as trustworthy as the key was. Asymmetric mode narrows this: a verifier holding only the public key cannot itself forge, but a compromised private key still can until revoked.
  • The audit log service is trusted to append. Tamper-evidence detects edits to sealed history. It does not prevent the audit log service from refusing to append, stalling, or being killed. Availability of the audit path — restart policy, health checks — is the service-architecture and monitoring proposals’ concern, not this one.

Relationship to Other Proposals

  • Cryptography and Key Management — this proposal consumes SymmetricKey.mac/verify, PrivateKey.sign, KeySource, and KeyVault. It adds no cryptographic primitive.
  • Storage and Naming — the durable ring is an append-only ledger on the capability-native Store, using the previous-record-hash chaining the storage proposal already prescribes.
  • System Monitoring — the audit log service is the hardware-cap-specific producer feeding the broader audit-record model in the monitoring proposal; scoped HardwareAuditReader caps follow the monitoring proposal’s “observation is authority” and per-record-type retention principles.
  • Device Driver Foundation — this design records hardware authority cap lifecycle events. It does not change where authority is checked, and does not claim provider-driver readiness or IOMMU isolation.

Open Questions

  • Segment size, retention counts, and rotation interval are proposed defaults, not ABI. The focused smoke currently retains eight sealed segments so boot-time abort-held DMA records remain inside the proof window; production defaults still need a tuning pass once a real BlockDevice backend exists.
  • Whether the verify method on HardwareAuditReader should return a full chain proof or a bounded status summary depends on the first real consumer’s needs and is deferred to implementation.
  • Cloud-bridge-backed Store for the durable log inherits the storage proposal’s stale-write and size-bound rules; whether audit segments should also be content-addressed objects in that backend is left to the storage track.

Proposal: System Performance Benchmarks

How capOS should benchmark system performance against other operating systems without producing misleading numbers, rewarding special-case optimizations, or treating speed as a substitute for correct capability behavior.

Problem

capOS already has smoke tests, QEMU boot proofs, ring-tap debugging, and a measure feature for focused cycle measurements. Those are necessary, but they do not answer the product-level question: can capOS remain effective on common workloads while preserving its capability model?

Generic OS benchmark suites are useful but dangerous in this project. Most assume POSIX process, file, pipe, socket, and shell semantics. capOS should not fake broad ambient Unix authority just to run a familiar benchmark. It also should not compare a capability-native path against Linux, FreeBSD, or a microkernel by publishing a single blended score that hides unsupported semantics, incorrect outputs, or different isolation boundaries.

The benchmark system needs to produce three kinds of evidence:

  • Primitive cost: capability calls, IPC, scheduling, park waits, VM changes, process creation, memory copy, and later device I/O.
  • Common workload adequacy: database, compression, build, network, storage, shell/session, service graph, and runtime workloads that users recognize.
  • Correctness under load: workload outputs, service boundaries, capability denial paths, and data integrity must remain correct while performance is measured.

Current State

Implemented measurement and comparison hooks:

  • make run-measure builds a separate measurement kernel feature and boots system-measure.cue.
  • kernel/src/measure.rs records benchmark-only dispatch counters and cycle segments for ring processing, SQE validation, cap lookup, Cap’n Proto encode/decode, method body dispatch, CQE posting, and waiter wake checks.
  • The measurement manifest grants ring-nop a measurement-only NullCap and ParkBench capability through ProcessSpawner.
  • demos/ring-nop measures CAP_OP_NOP, empty and small NullCap calls, and compact-versus-generic park-shaped operations.
  • demos/thread-lifecycle measures private ParkSpace failed wait, empty wake, wait-to-block, wake-to-runnable, and wake-to-resume paths.
  • make run-smp-process-scale boots a focused SMP proof manifest under QEMU/KVM, runs 1/2/4 independent prime-counting worker-process cases, verifies aggregate prime count and checksum, records raw serial logs and CSV rows under target/smp-process-scale/, and enforces the completed milestone’s default five-run 1.6x median 1-to-2 speedup threshold when KVM evidence is available.
  • tools/linux-smp-process-scale-baseline.sh builds a tiny Linux initramfs and runs the same forked prime-counting workload under the same QEMU/KVM CPU and memory envelope for reference-OS comparison. Its split defaults must stay in sync with the capOS SMP process-scale workload before a comparison table is published.
  • make run-linux-thread-scale-baseline runs the in-process thread-scale fixed-size checksum workload as a native Linux pthread baseline, recording worker-window and total pthread timings plus compact-versus-padded result slot diagnostics under target/linux-thread-scale/.
  • make run-smoke, make run-spawn, make run-net, and focused service smokes provide correctness and user-visible behavior proofs, but they do not yet emit structured performance results.

Planned CPU-scaling profiles should prefer uniform fixed-size chunk work, such as parallel hashing/checksum over disjoint buffers, when the claim is near-linear scheduler/runtime scaling. Prime counting is retained as historical multi-process evidence, but its trial-division cost requires tuned partitioning and is a weaker default for same-process thread scaling.

The next full-SMP CPU profile should not use nested QEMU as its primary performance source. QEMU/KVM remains useful for boot, CI, and virtualization comparisons, but a 16/32-core scheduler result needs direct capOS execution on a dedicated perf runner or bare-metal/cloud-bare-metal machine, with native Linux baselines on the same hardware. The report should include 1, 2, 4, 8, 16, and 32-worker rows where hardware exists, and separate SMT rows from physical-core rows.

That is enough for local dispatch decisions. It is not enough for comparing capOS with Linux, FreeBSD, seL4-based systems, Genode scenarios, or other OS baselines on common workloads.

Design Principles

  1. Correctness gates first. A benchmark result is publishable only when the workload’s output verifier passes and capOS-specific authority checks still hold.
  2. No semantic laundering. Unsupported POSIX features are reported as unsupported or not applicable, not silently emulated through broad authority.
  3. Benchmark artifacts are not normal metrics. Always-on monitoring may expose low-cost counters. Benchmark logs, raw samples, host configuration, and per-run outputs are retained as explicit benchmark artifacts.
  4. Compare like mechanisms where possible. Compare capOS capability IPC to Linux pipes, Unix domain sockets, io_uring, or futexes only when the semantic differences are declared in the result.
  5. Use common suites as references, not design masters. lmbench, UnixBench, fio, iperf3, SQLite speedtest, Phoronix/OpenBenchmarking profiles, and SPEC CPU are valuable precedent. capOS should adopt their methodology where it fits and reject assumptions that would distort capOS.
  6. Publish raw context. Results include kernel commit, manifest, QEMU command, CPU model, host OS, compiler, build flags, feature flags, warmup, run count, and raw logs.
  7. Separate hosted and native comparisons. Early capOS runs in QEMU. Compare against Linux/FreeBSD guests under the same QEMU/KVM envelope, and separately against native host OS runs when the question is absolute hardware performance.
  8. Regression gates are narrower than claims. CI gates should catch local regressions in stable paths. Public OS comparisons need controlled machines, repeated runs, and manual review.
  9. Security posture is part of the result. A fast result that requires a broader cap bundle, disabled validation, payload tracing, or a special kernel build must be labeled as such.
  10. No single score. capOS should publish a matrix of workload results and ratios, not an aggregate score that implies all workloads matter equally.

Benchmark Tiers

Tier 0: Existing Correctness Smokes

Tier 0 is not a performance suite. It is the mandatory correctness floor:

  • default boot/login/shell smoke;
  • focused spawn, shell, terminal, credential, login, chat, adventure, revocable-read, memory-object, ringtap, networking, and measurement smokes;
  • host tests for config, ring Loom, capos-lib, mkmanifest, generated code, and runtime surface checks.

No performance result should be retained when the relevant Tier 0 proof fails.

Tier 1: capOS-Native Primitive Benchmarks

These benchmarks measure the cost of capOS mechanisms directly:

AreaInitial measurementsCorrectness condition
Ring transportCAP_OP_NOP, empty NullCap, small payload NullCap, CQE postexpected CQE result, no overflow, bounded dropped count
Cap dispatchcap lookup, generation rejection, revoked cap rejection, invalid methodcorrect CAP_ERR_* or CapException
IPCendpoint CALL/RECV/RETURN round trip, direct handoff, transfer copy/movereply payload and transferred-cap identity match oracle
Park/threadingfailed wait, timeout, wake-one, wake-many, wake-to-resumewaiter count and join status match oracle
Schedulercontext switch latency, timer wake latency, direct IPC handoff latencyno runnable-thread loss or unexpected starvation
Process lifecyclespawn, ELF load, wait, failed spawn rejectionchild output and exit code match manifest oracle
VM/memorymap/protect/unmap, MemoryObject map, frame allocation/freedata visibility, W^X, quota, and cleanup checks pass
Terminal/sessionreadLine/write latency and throughput under foreground ownershipecho/cancellation/stale-input checks pass

These are capOS results first. Linux or FreeBSD baselines can use matching native mechanisms, but the report must describe the mapping. For example, a capOS endpoint IPC round trip can be compared with Linux pipe, Unix-domain socket, eventfd, or futex ping-pong results, but none is a perfect semantic match.

Tier 2: Translated OS Microbenchmarks

lmbench and UnixBench are useful because they isolate OS primitives such as system-call overhead, process creation, context switching, pipes, networking, and filesystem reads. They are also Unix-shaped.

capOS should implement a capos-osbench harness that translates the benchmark intent into capability-native operations:

  • fork/exec/wait intent becomes ProcessSpawner.spawn plus ProcessHandle.wait.
  • pipe throughput/context switching becomes Endpoint or a future byte-stream or socket capability round trip, labeled by transport.
  • getpid syscall overhead becomes a minimal kernel fact cap or CAP_OP_NOP, labeled as “capOS ring entry” rather than “POSIX syscall”.
  • file reread and mmap benchmarks remain unsupported until Store/Namespace and file-backed mappings exist.
  • networking tests map to TcpSocket/TcpListener once the Telnet and socket capability work lands.

The translated suite must emit not_applicable for missing capability subsystems instead of adding compatibility shims that change the OS being measured.

Tier 3: Portable Common Workloads

These benchmarks answer whether capOS is useful on recognizable work:

WorkloadCandidate benchmarkcapOS prerequisiteResult verifier
SQLite databaseSQLite speedtest1, optionally via a Phoronix profile on reference OSesC runtime or native port, Store/Namespace or RAM-backed DBSQLite exit status, optional SQL result checksum
OLTP databaseTPC-C/TPC-E-inspired profile, not an official TPC result until disclosure and durability rules are metdurable Store/block I/O, SQL/database stack, transaction integrity, terminal/client driver modelcommitted transaction counts, invariant checks, ACID/error-injection proof
Decision-support databaseTPC-H/TPC-DS-inspired profile at declared scale factors, not an official TPC result until rules are metSQL/query engine, bulk data load, durable or explicitly memory-backed storage, query result verifierquery answer hashes, load status, scale factor, refresh/query stream status
Key-value servingYCSB-style read/update/scan/insert mixesStore/Namespace, KV service, stable client driveroperation counts, latency distribution, value/hash verifier
Storage engineRocksDB/LevelDB db_bench-style fill/read/overwrite/seek profilesfile/store semantics, fsync/sync policy, storage engine portkey/value integrity, database reopen, configured write durability
Compressionxz, zstd, or small native compressor corpusC/Rust userspace runtime and file/store accesscompressed output hash and decompression hash
Build/developer workloadsmall Rust/C package build, later IX package buildprocess spawning, Store/Namespace, toolchain supportoutput artifact hash and build log status
Network throughputiperf3-equivalent TCP stream and request/response latencyTcpSocket, network harnessbyte count, JSON/structured summary, peer checksum
Storage I/Ofio-equivalent sequential/random read/write, verify modeblock device, Store/Namespace, direct I/O policyfio-style verify/checksum result
File serviceSPECstorage-inspired workload profilenetwork filesystem or capOS file-service equivalent, durable storage, client load generationthroughput, response time, data integrity
Java/server runtimeSPECjbb 2015 or Renaissance-inspired profilesJVM or Java compatibility profile, timers, threads, networking/storage as neededbenchmark verifier and SLA/throughput summary
HTTP servicewrk-style request load against a capOS HTTP serviceTCP, HTTP service, stable response corpusresponse checksum/status mix, latency distribution, error rate
Cloud servicesCloudSuite-inspired data caching/serving/search/web profilesmulti-service graph, storage/network/runtime supportworkload-specific answer checks and service SLOs
MicroservicesDeathStarBench/TailBench-inspired tail-latency profilesservice graph, network or local RPC, load generator, tracing/status capsrequest correctness, p95/p99 latency, no unauthorized cap exposure
ML storageMLPerf Storage-inspired data feeding profilehigh-throughput storage path, dataset loader, accelerator or simulated training readerrecords/images delivered, latency/throughput, data checksum
ML inference/trainingMLPerf-inspired inference/training profilemodel runtime, accelerator/GPU capability or CPU baseline, dataset and accuracy harnessaccuracy/quality target plus throughput or time-to-train
Shell/sessionboot-to-shell, Telnet shell, command launch latencycurrent shell plus terminal/socket pathtranscript oracle and authority denial checks
Service graphchat/adventure/resident service loadshared-service demosscripted transcript and service identity checks
Runtime/libraryGo/Lua/Wasm micro and app kernelsrelevant runtime proposal milestoneslanguage-level test suite or checksum oracle

Early capOS should start with RAM-backed variants where storage is not ready, but those results must be labeled as memory-backed. A RAM-backed database result does not compare to a Linux disk-backed SQLite result.

Industry benchmark families belong later than SQLite speedtest and simple compression/build profiles. TPC-C/TPC-E and TPC-H/TPC-DS are database-system references with strict workload, disclosure, pricing, and correctness expectations. SPEC, MLPerf, CloudSuite, TailBench, and DeathStarBench bring similar setup and disclosure obligations in their domains. capOS can use inspired profiles to exercise the same workload classes before it can make official or directly comparable claims, but reports must label them as such and state which upstream rules are not yet satisfied.

Tier 4: User-Story Benchmarks

User-story benchmarks measure complete workflows that a person, operator, or service owner would recognize. They are intentionally broader than a single primitive or portable benchmark profile, and they should be described by the user outcome they prove rather than by the current demo implementation.

Initial user stories:

StoryExample capOS proofResult verifier
Start a local sessionboot to an interactive shell or terminal prompttranscript reaches ready prompt with expected cap bundle
Authenticate and receive authorityanonymous session upgrades to an operator/session profilewrong credential denied, right credential grants exact profile
Run a delegated tasklaunch a child process with a narrow cap bundlechild output, exit code, and denied extra authority match oracle
Use a remote terminalhost-local TCP terminal reaches the same shell/session modelconnect, authenticate, run command, clean disconnect
Use a resident serviceclient talks to a long-running service through scoped authorityrequest/reply transcript and service-visible identity match oracle
Serve a network requestnetwork-facing service handles requests while local work continuesresponse checksum, latency, and no unauthorized cap exposure
Complete a developer workflowbuild or transform an artifact from declared inputsoutput hash, logs, and resource profile match declared policy
Recover from expected failureservice fault, rejected grant, timeout, or restart pathfailure is bounded, audited, and visible through status

User-story results report latency distribution, success rate, resource usage, and authority outcome. They are the closest evidence for “effective on common workloads,” but they are not substitutes for primitive measurements when a regression appears.

Reference Operating Systems

Initial comparisons should use these environments:

ReferenceWhy include itCaveat
Linux guest under same QEMU/KVM flagsStable baseline with broad benchmark supportLinux has mature drivers, filesystems, VM, scheduler, and libc
FreeBSD guest under same QEMU/KVM flagsSecond mature Unix-like baseline, useful for POSIX-independent signalNot every benchmark profile has equal FreeBSD support
Linux native hostShows absolute host hardware ceilingNot directly comparable to capOS-in-QEMU latency
seL4 or Genode reports/scenariosPrior art for capability/microkernel IPC and service decompositionOften not the same hardware, workload, or application stack

The default published table should show capOS versus Linux guest first. Native host and external microkernel data belong in separate context columns, not the primary ratio.

Correctness Model

Every benchmark definition carries:

  • expected input corpus hash;
  • command or manifest used to run the workload;
  • output verifier;
  • allowed nondeterminism, such as timestamps or generated IDs;
  • capOS authority profile;
  • unsupported-feature policy;
  • result parser version.

A result is invalid when:

  • the output verifier fails;
  • QEMU exits abnormally;
  • the kernel panics or reports an unexpected fault;
  • the benchmark had to grant broader authority than its declared profile;
  • host logs show dropped records that invalidate the measurement;
  • the run used a special fast path not available in the declared configuration;
  • the reference OS result used a materially different workload size or dataset.

Correctness should be stored alongside the performance value. A fast failed run is not a slow successful run; it is no result.

Measurement Method

Controlled runs should use:

  • fixed capOS commit, reference OS image hash, benchmark source hash, compiler version, and toolchain flags;
  • fixed QEMU version, machine type, CPU model, memory size, SMP count, KVM/TCG mode, disk image type, and network backend;
  • for direct-hardware SMP runs, fixed machine identity, firmware version, APIC mode, CPU topology, SMT state, frequency governor or fixed-frequency policy, isolation policy, memory size, storage/network devices when relevant, and bare-metal versus cloud-bare-metal provider details;
  • warmup runs for workloads with caches, JITs, connection setup, or first-use allocation;
  • at least 5 measured runs for primitive and user-story benchmarks, more when coefficient of variation is high;
  • median, min, max, standard deviation, and p95/p99 for latency where sample count supports it;
  • raw logs retained for the benchmark artifact;
  • no performance claim from one isolated run unless explicitly labeled as a smoke measurement.

Kernel-internal cycle-counter measurements remain inside cfg(feature = "measure") and are used for relative path decisions. Focused benchmark demos may use user-mode cycle counters when the result is explicitly labeled and the workload remains correctness-gated; run-smp-process-scale uses a scaled worker-side cycle count because the 100 Hz timer tick is too coarse for the selected speedup gate. Wall-clock user-story and workload comparisons use host-side timestamps around QEMU transcripts or in-guest monotonic timers when the timer contract is adequate.

Result Schema

The benchmark harness should emit a structured artifact, not a free-form log:

enum BenchmarkStatus {
  passed       @0;
  failed       @1;
  unsupported  @2;
  invalid      @3;
}

struct BenchmarkResult {
  runId          @0 :Text;
  benchmarkName  @1 :Text;
  tier           @2 :UInt16;
  status         @3 :BenchmarkStatus;
  correctnessId  @4 :Text;
  configHash     @5 :Data;
  artifactHash   @6 :Data;
  notes          @7 :Text;

  result :union {
    measurement @8 :MeasurementSummary;
    failure     @9 :RunFailure;
    unsupported @10 :RunFailure;
    invalid     @11 :RunFailure;
  }
}

struct MeasurementSummary {
  unit           @0 :Text;
  lowerIsBetter  @1 :Bool;
  median         @2 :Float64;
  p95            @3 :Float64;
  samples        @4 :List(Float64);
}

struct RunFailure {
  reason  @0 :Text;
  detail  @1 :Text;
}

This schema is conceptual. It should not be added to schema/capos.capnp until a concrete benchmark-runner service exists. The important property is that measurement values exist only in the passed/publishable branch; failed, unsupported, and invalid runs carry reasons instead of zero-valued scalar defaults. Before that, host scripts can emit JSON with the same shape.

Integration With System Monitoring

System Monitoring should expose operational state; the benchmark system should store explicit run artifacts. The overlap is narrow:

  • benchmark runs may read scoped MetricsReader, SystemStatus, RingStats, SchedStats, and later device stats before and after a run;
  • benchmark summaries may be imported into a metrics service as low-cardinality gauges such as benchmark.last_median_ms, keyed by benchmark name and profile, after validation;
  • raw samples, transcripts, QEMU logs, host environment, and correctness evidence belong in a BenchmarkStore or CI artifact store, not in always-on metrics;
  • starting a privileged benchmark profile is an auditable event because it may require measurement-only caps, debug taps, or broad status readers;
  • benchmark readers should receive scoped read-only caps, not global monitoring roots.

The existing system-monitoring-proposal.md boundary remains correct: cycle-counter instrumentation stays behind measure, while cheap counters can later graduate into narrow stats caps.

External Grounding

Relevant local design grounding:

  • docs/build-run-test.md
  • docs/status.md
  • docs/proposals/system-monitoring-proposal.md
  • docs/architecture/capability-ring.md
  • docs/architecture/park.md
  • docs/architecture/scheduling.md
  • docs/research/sel4.md
  • docs/research/zircon.md
  • docs/research/genode.md
  • docs/research/out-of-kernel-scheduling.md

External sources checked:

  • USENIX lmbench paper page: https://www.usenix.org/conference/usenix-1996-annual-technical-conference/lmbench-portable-tools-performance-analysis
  • fio documentation: https://fio.readthedocs.io/en/master/fio_doc.html
  • iperf3 documentation: https://software.es.net/iperf/
  • SPEC CPU 2017 overview and run rules: https://www.spec.org/osg/cpu2017/ and https://www.spec.org/cpu2017/Docs/runrules.html
  • Byte UnixBench repository: https://github.com/kdlucas/byte-unixbench
  • SQLite testing documentation and OpenBenchmarking SQLite speedtest profile: https://www.sqlite.org/testing.html and https://openbenchmarking.org/test/pts/sqlite-speedtest
  • TPC benchmark overview, TPC-C, TPC-H, and TPC-DS descriptions: https://www.tpc.org/information/benchmarks5.asp, https://www.tpc.org/tpcc/default5.asp, https://www.tpc.org/tpch/default5.asp, and https://www.tpc.org/tpcds/
  • YCSB and storage-engine benchmark references: https://hse-project.github.io/apps/ycsb/, https://github.com/facebook/rocksdb/wiki/Benchmarking-tools, and https://github.com/google/leveldb
  • SPECjbb 2015, Renaissance, and HTTP service benchmark references: https://www.spec.org/jbb2015/, https://renaissance.dev/, and https://github.com/wg/wrk
  • Cloud/service benchmark references: https://github.com/parsa-epfl/cloudsuite, https://github.com/delimitrou/DeathStarBench, and https://tailbench.csail.mit.edu/
  • Storage and ML benchmark references: https://www.spec.org/storage2020/, https://mlcommons.org/working-groups/benchmarks/storage/, https://mlcommons.org/benchmarks/training/, and https://docs.mlcommons.org/inference/index_gh/
  • OpenBenchmarking test-suite/profile descriptions: https://openbenchmarking.org/suites/ and https://openbenchmarking.org/tests

The relevant lessons are straightforward:

  • lmbench isolates OS primitives from larger application behavior and was explicitly used to compare system implementations.
  • fio and iperf3 provide flexible, parameterized I/O and network workload models with machine-readable output and verification options.
  • SPEC CPU’s run rules show why disclosure, correct output, and configuration control matter when publishing comparative results.
  • UnixBench is useful as a historical system benchmark, but its own workload descriptions reveal Unix assumptions that capOS must translate carefully.
  • SQLite speedtest is a recognizable application workload with broad public baseline data, but database benchmarking must distinguish RAM-backed and storage-backed results.
  • TPC-C/TPC-E and TPC-H/TPC-DS are the right industry references for later OLTP and decision-support database claims, but capOS should treat early runs as TPC-inspired unless it can satisfy the relevant TPC rules and disclosure requirements.
  • YCSB and db_bench are useful earlier data-system pressure tests because they can exercise key-value, read/write mix, and storage-engine behavior before capOS has a full SQL system.
  • SPECjbb and Renaissance become relevant only when a Java profile exists; until then they are runtime targets, not near-term OS benchmarks.
  • CloudSuite, DeathStarBench, and TailBench are good references for cloud, microservice, and tail-latency user stories, but they require a mature service graph, load generation, and workload-specific correctness checks.
  • SPECstorage and MLPerf Storage are later storage references once capOS has durable storage and enough client/load infrastructure to avoid misleading fio-only claims.
  • MLPerf inference/training is relevant only after model runtimes and accelerator or CPU-baseline execution are credible, and any result must carry the benchmark’s accuracy or quality target rather than only throughput.
  • OpenBenchmarking/Phoronix-style test profiles are useful precedent for packaging benchmark definitions separately from result storage.

Implementation Plan

  1. Structured parser for current run-measure. Add a host parser that converts existing measure: and demo output lines into JSON artifacts with config hash, raw log path, and verifier status.

  2. Primitive benchmark manifest set. Split ring, park, IPC, process, VM, and scheduler benchmarks into focused manifests so each can be repeated independently without running unrelated demos.

  3. Reference guest harness. Add Linux guest scripts that run equivalent primitive tests under the same QEMU/KVM settings. Keep these scripts outside the capOS boot image. Partially done for the SMP process-scale proof through tools/linux-smp-process-scale-baseline.sh; future benchmark profiles need their own reference guest harnesses or explicit unsupported status.

  4. Translated OS microbench suite. Implement capos-osbench for the subset of lmbench/UnixBench intents that capOS can represent honestly. Emit unsupported results for missing Store, file, mmap, and socket primitives until those subsystems exist.

  5. Common workload pilots. Start with workloads that can be made deterministic early: compression, SQLite speedtest against RAM-backed storage once Store exists, shell/session latency, and remote-terminal user-story latency after the current milestone.

  6. Network and storage workloads. Add iperf3/fio-equivalent profiles only after socket and block/storage capabilities exist. Use verification modes for write workloads.

  7. Benchmark store and monitoring bridge. Add a BenchmarkStore service or CI artifact convention. Import only validated summary values into monitoring metrics, and audit privileged benchmark starts.

  8. Regression gates. Add narrow CI thresholds for stable primitive paths. Use review-only warnings for noisy or hardware-dependent workloads until enough history exists.

  9. Cloud-VM rerun profile. After the first real cloud-VM boot path exists, rerun the benchmark profiles that are valid for the booted hardware surface. At minimum, retain separate cloud evidence for boot/session smokes and CPU-only profiles such as run-smp-process-scale and later run-thread-scale, recording provider, region, instance type, cloud image id, firmware/device model, CPU topology, SMT state, QEMU pinning/isolation policy, nested-KVM availability, and serial-console collection method. Cloud results are separate environments; they do not replace local QEMU/KVM proof gates unless a milestone explicitly changes that gate. A shape like n2-highcpu-8 is credible for 1/2/4-vCPU CPU-only profiles if /dev/kvm is available to the benchmark user and the run records the exact CPU platform and topology.

  10. Full-SMP hardware profile. Add a profile for direct 16/32-core scheduler evidence. It should reuse the parallel-pattern plan rather than inventing one checksum-only workload: static map/reduce, dynamic task pool, barrier phase loop, independent processes, same-process threads, and one service/capability-call workload. The artifact should report work-window and total-time medians, variance, verifier output, speedup, efficiency, scheduler counters, and matching native Linux rows on the same hardware. QEMU rows may accompany the report only as separate virtualization or regression context.

Reporting Format

Published reports should include:

  • executive table with benchmark, status, unit, capOS median, Linux guest median, ratio, and notes;
  • separate sections for primitive, common workload, and user-story results;
  • correctness summary with failed/unsupported/invalid runs;
  • configuration appendix with hashes and QEMU commands;
  • raw artifact links;
  • explicit warning for benchmark-only builds, debug tap runs, or special caps.

Do not publish a capOS “system score.” The useful output is a workload matrix with enough context to explain the result.

Non-Goals

  • No POSIX compatibility layer purely to run Unix benchmarks.
  • No public comparison that treats unsupported workloads as zero performance.
  • No single aggregate score.
  • No benchmark-only fast paths in normal dispatch builds.
  • No always-on cycle-counter tracing.
  • No network result publication before the network path has correctness and authority proofs.
  • No storage result publication before write verification and crash/error semantics are defined.

Open Questions

  • Which Linux primitive baselines should be first-class: pipe, Unix socket, futex, eventfd, io_uring, or all of them?
  • Should the benchmark store be a capOS service, a host CI artifact convention, or both?
  • What variance threshold should turn a benchmark from a CI gate into a review-only signal?
  • How should reference OS images be pinned and distributed without bloating the repository?
  • Which cloud provider and instance shape should be the first benchmark rerun target after capOS boots outside local QEMU/KVM? A GCE n2-highcpu-8 host is a plausible first nested-KVM target for CPU-only profiles, but the final choice should follow the first cloud boot path that can expose /dev/kvm and usable serial-console artifacts.
  • What is the earliest honest SQLite storage profile: RAM-only, MemoryObject backed, Store-backed, or block-backed?
  • Should benchmark definitions be modeled as manifest fragments, host-side YAML/JSON, or capOS service objects?

Proposal: HPC Parallel Processing Patterns

capOS should grow from focused SMP/threading speedup proofs into a correctness-gated suite of generic parallel processing patterns. The suite should cover the single-node and multi-node algorithm shapes commonly used in HPC without pretending that capOS already supports MPI, POSIX files, shared memory libraries, or cluster networking.

This proposal extends System Performance Benchmarks. It also defines the workload matrix that the future full-SMP scalability milestone tracked in Scheduler Evolution Phase F.5 should use when capOS is ready for 16/32-core evidence on top of the SMP substrate and the in-process threading contract. The old single checksum workload remains useful as one static map/reduce row, but it is too narrow to stand in for broad multicore behavior.

Design Grounding

Local grounding:

External grounding summarized in the research note covers Berkeley dwarfs, NAS Parallel Benchmarks, HPL/LINPACK, HPCG, Graph500, MPI collectives, and OpenMP loop/task/reduction constructs.

Current capOS Benchmark Analysis

Current CPU-scaling evidence is useful but narrow:

  • make run-smp-process-scale exercises independent worker processes under QEMU/KVM. Its prime-counting workload is static partition plus final verification. Current rows reach 1, 2, and 4 vCPUs, plus one 8-logical-CPU SMT row on a 4-core/8-thread host.
  • make run-thread-scale uses a fixed-size checksum workload with per-thread rings and guest phase counters. The strongest current row records capOS 1-to-4 work/total speedups 3.088x / 2.700x under QEMU/KVM, while the matching Linux pthread baseline on the same host and pin set records 3.974x / 3.850x.
  • Native Linux pthread baselines show the checksum shape can scale on the benchmark host, but also expose coordinator and oversubscription sensitivity. Larger workloads help separate Amdahl effects from thread lifecycle overhead.
  • Guest measurement now covers scheduler, serial, scheduler-lock, timer, TLB, and user-PC attribution, but the workload still represents only static partitioned CPU work.

So the current suite covers one pattern well: independent fixed chunks with a final checksum/reduction. It does not yet cover dynamic task scheduling, barriers, prefix/scan, all-to-all movement, stencils, sparse/dense kernels, graph frontiers, pipelines, or multi-node communication. It also does not yet produce direct-hardware 16/32-core rows, which is the bar that Scheduler Evolution Phase F.5 sets for full-SMP scalability evidence on top of the SMP bring-up substrate.

Goals

  • Classify parallel benchmark coverage by algorithm pattern, not by a single score or one “HPC benchmark” label.
  • Keep correctness and authority gates ahead of speed claims.
  • Provide single-node pattern kernels before multi-node transport exists.
  • Give scheduler, runtime, memory, IPC, storage, and networking work concrete future coverage targets.
  • Allow Linux/FreeBSD/MPI/OpenMP comparisons only when the semantic mapping is declared.

Non-Goals

  • Do not port MPI or OpenMP as a prerequisite for the first pattern kernels.
  • Do not run full HPL, HPCG, NAS, or Graph500 before capOS has the required runtime, memory, file/store, and network substrate.
  • Do not add POSIX compatibility or ambient filesystem authority only to run a familiar suite.
  • Do not count SMT diagnostics as core-count scaling evidence.
  • Do not add benchmark-only fast paths to normal kernel dispatch.

Pattern Coverage Matrix

Each pattern should have a capOS-native kernel, a result verifier, and a declared authority profile. Multi-node variants stay future until network transport and distributed-capability authority are explicit.

PatternSingle-node kernelMulti-node shapeVerification
Static map/reducesplit fixed-size byte/block ranges across threads or processesscatter chunks, local compute, reduce rootdeterministic root hash or numeric reduction
Dynamic task poolvariable-cost tasks in a bounded deque or queuework requests between nodes or delegated task shardsall task ids completed once, result hash, cancellation proof
Barrier phase looprepeated phase computation with a barrier between phasesbarrier across ranks or servicesphase count, no early phase observation
Prefix/scanper-thread prefix over numeric blocksdistributed scan over rank partitionsprefix checksum and boundary carry checks
Stencil/halo1D/2D/3D grid update with neighbor halo buffershalo exchange between rank partitionsfinal grid checksum and boundary oracle
Dense tiled computetiled matrix multiply or small LU-like update2D block-cyclic tile distributionmatrix checksum/residual bound
Sparse iterative computeCSR-like sparse matrix-vector plus dot productspartitioned sparse rows with global reductionsresidual/checksum and iteration count
FFT/transposestaged local FFT-like butterflies plus matrix transposeall-to-all transpose between ranksoutput checksum against reference
Sort/partitioninteger bucket partition plus local sortsample/splitter exchange and all-to-all bucketssortedness, permutation checksum
Graph frontierBFS-like frontier over synthetic graphdistributed frontier exchangeparent tree/level validation
Pipeline/streambounded producer/stage/consumer service graphservice pipeline across nodesordered records, backpressure, no dropped records
Collective-onlybarrier, broadcast, gather, scatter, reduce, allreducesame operations over networked rankscollective-specific oracle and timeout behavior

Proposed Stages

Stage 0: Keep Current CPU Rows Explainable

Keep make run-thread-scale as the fixed-size checksum workload so historical rows remain comparable. Add new pattern targets alongside it rather than changing the meaning of the existing table. The benchmark page should describe the workload, timed region, verifier, environment, and limitations directly.

Stage 1: Single-Node Pattern Kernels

Add a parallel-patterns demo crate and host harness that can run small single-node kernels under one process and multiple worker processes. Workers should be expressible both as same-process threads via the in-process threading contract and as independent processes over the SMP substrate, so the rows distinguish thread-local pick/wake costs from process/IPC boundaries. These are the first rows needed for the future full-SMP hardware profile that Scheduler Evolution Phase F.5 treats as its 16/32-core success bar:

  • static_reduce: successor to the checksum workload, reusable as the sanity baseline.
  • dynamic_pool: uneven task sizes to force runtime scheduling and fairness.
  • barrier_loop: repeated phases to expose barrier and wakeup overhead.
  • scan: prefix computation to exercise ordered fan-in/fan-out.
  • stencil_2d: shared-buffer or private-buffer halo copies inside one node.

Each kernel prints compact structured lines with pattern, workers, cpus, input_class, verified, work, total, and relevant counters. Host harness summaries must keep raw logs under target/parallel-patterns/. The hardware profile should run these kernels at 1, 2, 4, 8, 16, and 32 workers when the machine has enough physical cores, with SMT rows separated.

Stage 2: Memory And IPC Intensive Kernels

After MemoryObject/shared-buffer and IPC paths mature, add:

  • sparse_spmv: CSR-style row partition with deterministic matrix generator;
  • graph_bfs: synthetic graph frontier with visited-set validation;
  • sort_bucket: bucket partition, prefix counts, local sort, and merge verification;
  • pipeline_stream: bounded service stages with backpressure telemetry.

These kernels should run both thread-local and process/service forms so capOS can distinguish scheduler overhead from IPC, cap-table, shared-buffer, and service-boundary costs. The thread form follows the in-process threading contract; the process and service forms exercise the cross-CPU wake, migration, and stale-context paths that Scheduler Evolution Phase F.5 expects to harden on top of the SMP substrate.

Stage 3: Capability-Native Collectives

Introduce a small collective service or library abstraction before pretending to support MPI. The first operations are:

  • barrier;
  • broadcast;
  • scatter/gather;
  • reduce/allreduce;
  • scan;
  • all-to-all for fixed-size blocks.

Collectives are benchmark subjects and future runtime building blocks, not ambient cluster authority. A caller receives only the communicator/session cap for its benchmark group. Membership, timeout, cancellation, and stale-session behavior are part of the verifier.

Stage 4: Multi-Node Harness

After capOS has a network-capability path suitable for services, add a multi-node harness that can start N capOS guests or capOS plus Linux reference guests under a controlled topology. The first target is not full MPI; it is a capOS-native rank/session model:

  • rank membership is represented by explicit capabilities;
  • transport authority is scoped to the benchmark group;
  • result collection includes per-node raw logs and topology metadata;
  • failed, slow, or stale ranks produce controlled errors instead of hanging the harness indefinitely.

Only then should capOS attempt NAS-like, HPL-like, HPCG-like, or Graph500-like profiles with clearly labeled deviations from upstream rules.

Authority And Safety Rules

  • A benchmark group cap grants participation, not ambient network or process authority.
  • Distributed pattern kernels must separate control-plane capabilities from data-plane buffers or sockets.
  • Every kernel must have bounded allocation, queue, and message sizes.
  • Timeouts and cancellation are correctness paths, not harness afterthoughts.
  • Result verification must fail closed before speed summaries are accepted.
  • Measurement features may add counters, but normal dispatch must remain the code path being evaluated unless the result is labeled as a measure build.

Reporting Format

Pattern results should extend the existing benchmark artifact conventions:

  • source commit, manifest, input class, worker/rank count, CPU count, run count;
  • host, QEMU/KVM, pinning, SMT/core topology, and network topology;
  • capOS authority profile and any benchmark-only feature flags;
  • per-run raw logs and results.csv;
  • median work and total windows, plus variance;
  • verifier status and reason for any not_applicable or diagnostic result;
  • comparison-system mapping, such as OpenMP taskloop, pthreads, MPI collectives, or native Linux process/thread equivalents.

Near-Term Recommendation

Do not start with HPL, HPCG, or Graph500 ports. Start with a small capOS-native parallel-patterns harness after the current thread-scale milestone closes. The first five kernels should be static_reduce, dynamic_pool, barrier_loop, scan, and stencil_2d. That set broadens coverage from “static independent chunks” to synchronization, irregular scheduling, ordered reductions, and neighbor exchange while staying within single-node capOS mechanisms.

When networking and storage mature, extend the same pattern definitions to multi-node and data-intensive variants rather than creating a parallel, unrelated benchmark suite. Pattern adoption stays paced by the substrate it exercises: Scheduler Evolution Phase F.5 gates the 16/32-core single-node rows, the SMP proposal gates the per-CPU substrate those rows depend on, and the in-process threading contract gates the same-process worker forms each pattern kernel needs.

Proposal: Scientific Standard Package And Agent Lab Capabilities

capOS should eventually ship a curated scientific standard package: a capability-scoped service graph that gives agents and users high-level access to computer algebra, numerical computing, solvers, formal proof systems, notebooks, reproducible package environments, and experiment records.

This is not a request to turn the kernel into a scientific runtime. The kernel still provides capability tables, address spaces, scheduling, IPC, memory, device, and storage primitives. The scientific package lives in userspace, above package, workspace, job-graph, model, and broker services.

Design Grounding

Local grounding:

External grounding is summarized in the research notes and covers PARI/GP, SageMath, GAP, Singular, OSCAR, SymPy, SciPy, R, Octave, JupyterLab, Z3, cvc5, HiGHS, SCIP, OR-Tools, JuMP, CVXPY, Lean/mathlib, Rocq, Isabelle, Agda, Spack, Guix-HPC, Nix, Apptainer, Linux namespaces/cgroups/seccomp/Landlock, User-Mode Linux, gVisor, QEMU/KVM, Firecracker, Kata Containers, and Linux CPU isolation/housekeeping.

Goals

  • Give users, agent runners, and batch services high-level scientific capabilities without granting an unrestricted shell.
  • Make exact computation, numerical computation, optimization, SMT solving, and proof checking ordinary capOS services with explicit authority.
  • Preserve reproducibility: package closure, input data, seed, backend, version, timeout, quota, output, and audit metadata travel with every result.
  • Reuse mature upstream tools wherever possible.
  • Support both interactive research and unattended agent jobs.
  • Keep tool authority separate from model inference. Models propose; trusted capOS runners execute through broker policy.

Non-Goals

  • Do not invent a replacement for SageMath, PARI, GAP, Singular, OSCAR, SciPy, Jupyter, Lean, Rocq, Isabelle, or established solvers.
  • Do not add POSIX, Docker, Conda, Nix, Guix, or Spack as ambient system authority.
  • Do not make notebook execution equivalent to shell access.
  • Do not treat SMT or CAS answers as formal proof unless a proof checker validates an artifact.
  • Do not make this package part of the active in-process threading milestone.

Package Profiles

The standard package should be split into explicit profiles so capOS can ship or grant only what a session needs.

ProfileContentsPrimary use
scientific-basePARI/GP or PARI C service, SymPy, Z3, cvc5, HiGHS, Lean checker, artifact storeLow-risk exact math, solver, and proof assistance
scientific-researchSageMath, GAP, Singular, OSCAR/Julia, R, Octave, SciPy, JuMP, CVXPY, SCIP, OR-ToolsFull interactive research workflows
scientific-notebookJupyter-compatible notebook/session service and language kernelsLiterate experiments with replayable artifacts
scientific-labExperiment registry, workspaces, job graphs, retrieval, review gates, GPU/model integrationLong-running research labs with users, agents, and review workflows
scientific-commercialOptional proprietary/commercial connectors such as Wolfram Engine or commercial solversExplicitly licensed site-local extensions

Profiles grant service roots, not every concrete backend cap. A user, agent runner, or batch service normally receives a ScientificSession facade that advertises only the tools and methods permitted for the current session.

Capability Surface

Catalog And Environment

  • ScientificCatalog: lists installed profiles, backend identities, supported interfaces, licenses, package closures, and known reproducibility caveats.
  • PackageCatalog: resolves named package environments to content-addressed closures.
  • PackageClosure: immutable description of packages, build inputs, toolchain versions, hashes, license metadata, vulnerability metadata, and supported CPU/GPU features.
  • Environment: starts a bounded interpreter, solver, proof, notebook, or job process with exactly the selected closure and granted caps.

Workspaces And Artifacts

  • ResearchWorkspace: branchable namespace for source, notebooks, data, generated files, proofs, and run records.
  • ArtifactStore: immutable objects for solver inputs, proof logs, notebooks, datasets, plots, tables, binaries, and transcripts.
  • ProvenanceLog: append-only record of who or which agent produced an artifact, with model/tool/package/session metadata.
  • ExperimentRegistry: immutable run specifications plus mutable review status, labels, and publication decisions.

CAS And Mathematical Services

  • ComputerAlgebra: general symbolic manipulation facade for factorization, simplification, integration, exact linear algebra, polynomial operations, and expression normalization.
  • NumberTheory: PARI-backed exact number theory, elliptic curves, modular forms, algebraic number fields, L-functions, and related computations.
  • DiscreteAlgebra: GAP-backed group, representation, finite algebra, and combinatorics workflows.
  • PolynomialAlgebra: Singular-backed ideals, modules, Groebner bases, quotient rings, and algebraic geometry computations.
  • JuliaAlgebraKernel: OSCAR/Nemo/Hecke/AbstractAlgebra workflows for cases where a general Julia session is the correct backend.

Each method returns structured values when practical and always records the backend, package closure, input, output, elapsed time, and resource envelope.

Solvers

  • SmtSolver: typed SMT-LIB import/export, assertions, check-sat, model, unsat core, timeout, random seed, proof/certificate metadata, backend selection among Z3/cvc5 or future solvers.
  • OptimizationSolver: LP, MIP, QP, conic, CP-SAT, routing, scheduling, and nonlinear solve jobs with declared model format, backend, objective, constraints, time limit, memory limit, gap/tolerance policy, and solution status.
  • ModelingSession: JuMP/CVXPY/OR-Tools-style language session for models that need high-level construction rather than direct serialized input.

Solver calls must distinguish optimal, feasible, infeasible, unbounded, unknown, timeout, resource_exceeded, and backend_error. User-facing tools should not collapse these into a single textual answer.

Formal Proof

  • ProofCatalog: installed proof assistants, libraries, theorem indexes, and package closures.
  • ProofSession: checkout, edit, build, query goals, run tactics, run tests, and produce checked proof artifacts.
  • ProofChecker: batch verification of a named theorem or project under a pinned closure.
  • LemmaSearch: retrieval over local proof libraries, declarations, docs, and prior accepted project artifacts.

The first implementation target should be Lean plus mathlib because it is the most useful default for current agent-assisted mathematics. Rocq, Isabelle, and Agda should remain first-class future backends with separate kernels and project layouts.

Notebook And Interactive Kernel Sessions

  • NotebookDocument: immutable or branchable notebook object with cells, outputs, attachments, environment id, and execution provenance.
  • NotebookSession: starts kernels, executes cells, captures outputs, renders rich media, and gates side effects.
  • KernelSession: Python, Sage, Julia, R, Octave, Lean, GAP, PARI, or other REPL-like process with explicit workspace and package environment caps.

Notebook execution is authority-bearing. Opening a notebook for reading should not execute it. Running a notebook should prompt or use session policy for network access, package installation, writes outside the workspace, long jobs, credential access, GPU use, and publication.

Agent Lab Architecture

An LLM agent research lab on capOS should be a service graph:

flowchart LR
    User[User] --> Runner[Agent Runner]
    Runner --> Broker[AuthorityBroker]
    Runner --> Model[LanguageModel]
    Runner --> Sci[ScientificSession]
    Sci --> CAS[CAS Services]
    Sci --> Solvers[Solver Services]
    Sci --> Proof[Proof Services]
    Sci --> Notebook[NotebookSession]
    Sci --> Jobs[JobGraph]
    Sci --> Workspace[ResearchWorkspace]
    Workspace --> Artifacts[ArtifactStore]
    Jobs --> Compute[CPU/GPU/Storage/Network Caps]
    Runner --> Audit[ProvenanceLog]

The runner owns the user session and applies tool policy. The model service does not hold scientific tool caps directly. Tool calls from the model become typed proposals to the runner, and the runner invokes ScientificSession methods only when broker policy allows it.

Authority And Safety Rules

  • A tool cap grants only the named interface. NumberTheory does not imply file, shell, network, package install, or proof-publication authority.
  • Package installation and environment resolution are separate authorities from executing an already-pinned environment.
  • External network fetch is separate from local computation. Literature search, package download, model-provider calls, and dataset upload are different caps.
  • Every long-running calculation must have a job id, quota, cancellation path, and durable status.
  • GPU use requires a GPU/session cap and should record driver/runtime/kernel metadata.
  • Proof acceptance must be checker-backed. Agent confidence, CAS evidence, or SMT success is advisory unless the proof kernel accepts the artifact.
  • Published results must cite the artifact ids and package closure ids that produced them.
  • Commercial or proprietary engines must be opt-in, labeled, and grantable only through site policy.
  • Linux workload placement must distinguish ordinary resource-limited work from capOS-native auto-nohz-eligible work. Linux nohz_full inside a guest may be useful compatibility or benchmark state, but capOS CPU isolation, auto full-nohz activation, housekeeping placement, IRQ routing, and exclusive CPU use are outer scheduler-authority decisions, not options an agent tool descriptor can set by itself.

Linux Workload And Virtualization Strategy

The first implementation is likely to consume the generic Linux workload sandbox substrate for large scientific stacks. Scientific jobs should be selected by trust and compatibility class:

BackendUseBoundary claim
namespace/cgroup/seccomp/Landlock sandboxtrusted batch tools and fast command wrappersshares host Linux kernel; useful policy layer, not strong multi-tenant isolation
bubblewrap/nsjailearly command-wrapper executor for gp, solvers, proof checkers, and scriptsstructured process sandbox over Linux primitives
User-Mode Linuxdeveloper/debug fallback when KVM is unavailableLinux-as-host-process compatibility; not the main strong-isolation path
gVisorcontainer-compatible higher-risk workloadsper-sandbox application kernel reduces direct host-kernel exposure
QEMU/KVM Linux guestbroad compatibility, full distro roots, package builds, untrusted notebookshardware-backed guest kernel boundary
Firecracker or Kata-style microVMrepeated stateless solver/proof/notebook jobs with narrow device modelshardware-backed microVM boundary with smaller operational surface
dedicated host or single-tenant nodehigh-risk tenants, sensitive data, GPU/device passthrough, side-channel-sensitive jobs, long-lived browser/GUI workloadsreduces shared-host and VM-escape blast radius beyond ordinary VM tenancy

The generic LinuxWorkloadSandbox service should record backend, image/rootfs/package hashes, sandbox policy or VM device model, kernel version, CPU affinity, cgroup quota, deployment location, external-host placement metadata, capOS NoHzEligibility/NoHzActivation state for capOS-scheduled proxies or VMMs, guest tickless/nohz state, network policy, artifact inputs, artifact outputs, and exit reason. A result from a namespace sandbox and a result from a KVM guest may be functionally equivalent, but their security, scheduler, and reproducibility claims are different.

For the native capOS auto full-nohz scheduler track, scientific jobs should use the generic workload placement classes:

  • ordinary placement: cgroup v2 resource limits and optional affinity for normal solver, proof, CAS, package, and notebook jobs.
  • auto-nohz-eligible placement: explicit capOS eligibility plus CPU-time authority for low-jitter benchmark, realtime, GPU-feed, SQPOLL-like, or latency-bound workload loops. The outer capOS scheduler must know the workload’s vCPU/helper/poller threads and must also account for housekeeping CPUs, IRQ placement, timers, and deferred kernel work. Guest Linux tickless state and external Linux-host isolation state are recorded separately and do not by themselves activate capOS nohz.

Existing Solutions To Adapt

AreaAdapt firstReasonable capOS adaptation
Number theoryPARI/GPWrap gp early; use PARI C library for stable service calls later.
Broad mathSageMathHost as a Python/Sage kernel with pinned closure and notebook integration.
Discrete algebraGAPWrap CLI and package loading; later expose common group-theory methods.
Polynomial algebraSingularWrap command/batch mode; later expose polynomial/ideal operations.
Algebra researchOSCARHost Julia/OSCAR kernel; avoid flattening its object model prematurely.
Symbolic PythonSymPyEmbed in Python service for lightweight symbolic calls and code generation.
Scientific PythonNumPy/SciPyProvide Python kernel and batch-job environments with BLAS/LAPACK metadata.
StatisticsRProvide Rscript and R kernel sessions with package closure capture.
MATLAB-like workflowsGNU OctaveProvide batch and interactive kernel sessions.
SMTZ3, cvc5Provide SmtSolver with backend identity, model, core, and timeout fields.
Optimization enginesHiGHS, SCIP, OR-ToolsProvide direct solve jobs and higher-level modeling sessions.
Modeling layersJuMP, CVXPYHost Julia/Python modeling kernels and export normalized model artifacts.
Formal proofLean/mathlib first; Rocq, Isabelle, Agda laterProvide proof sessions, build logs, theorem search, and checked artifacts.
NotebooksJupyterLab modelReuse .ipynb concepts and kernels but replace ambient authority with caps.
Package closureNix, Guix, SpackIngest closures and recipes; expose capOS package catalogs and Store objects.
HPC containersApptainerUse as a Linux-sidecar compatibility bridge, not as native authority.

Staged Implementation

Stage 0: Interface-Only Design

Define schemas for ScientificSession, ArtifactStore, PackageClosure, SmtSolver, OptimizationSolver, ProofSession, and NotebookSession. No backend porting is required. The goal is to make the authority and result model reviewable.

Stage 1: Linux Sidecar Prototype

Run tools on a controlled Linux host or hardware-backed Linux guest and expose them to capOS through a capability proxy. Namespace/cgroup/seccomp/Landlock wrappers are acceptable for trusted batch tools, but untrusted notebooks, model-generated code, package builds, and multi-tenant jobs should use a QEMU/KVM guest first and Firecracker/Kata-style microVMs later. High-risk tenants, sensitive data, GPU/device passthrough, and side-channel-sensitive jobs may require single-tenant hosts instead of shared VM hosts. User-Mode Linux may remain a developer/debug fallback when KVM is unavailable, but it is not the default strong-isolation backend. This proves the API, audit, and reproducibility model before native userspace can run Python, Julia, R, and large C++ stacks.

Initial tools:

  • PARI/GP;
  • SymPy;
  • Z3 and cvc5;
  • HiGHS;
  • Lean plus mathlib project build;
  • immutable artifact store and provenance records.

Stage 2: Native Wrapper Services

When capOS userspace has the necessary binary/runtime support, add command wrapper services for gp, lean/lake, z3, cvc5, highs, Rscript, octave, gap, and Singular. Each wrapper runs with an explicit workspace, environment, timeout, and resource ledger.

Stage 3: Notebook And Language Kernels

Add Jupyter-compatible document storage and kernel-launch policy. Python, Sage, Julia, R, Octave, Lean, and GAP kernels can then run as KernelSession services with capOS-owned artifact capture.

Stage 4: Package-Closure Store

Import or build Nix/Guix/Spack-style closures into capOS Store and Namespace capabilities. Package resolution stays outside the kernel. The important kernel-visible property is that executable environments are immutable objects with explicit resource and authority grants.

Stage 5: Lab Workflow

Combine scientific sessions with hosted-agent workspaces, experiment registry, review gates, browser/literature tools, GPU/model services, and stateful job graphs. This is the point where capOS becomes a credible LLM agent research lab rather than a collection of math commands.

Open Questions

  • Should the first sidecar protocol be Cap’n Proto RPC directly, MCP through a gateway, or both?
  • Which package-closure source should capOS ingest first: Nix for breadth, Guix for scientific reproducibility, or Spack for HPC variants?
  • Which hardware-backed Linux guest backend should be first after QEMU/KVM: Firecracker for narrow batch workers, Kata-style VM containers for OCI integration, or both?
  • Which workload classes are eligible for capOS native auto full-nohz placement, and how should that map to future CpuIsolationLease, NoHzEligibility, NoHzActivation, and SchedulingContext authority?
  • How much of .ipynb should be preserved versus represented as a capOS-native notebook object with import/export?
  • Which proof artifacts can be reduced to small trusted checker inputs, and which require full project build logs for confidence?
  • How should floating-point nondeterminism and randomized solver behavior be summarized so agents do not overclaim exactness?
  • Where should license policy live: package catalog, broker policy, or both?

Near-Term Recommendation

Do not start by porting SageMath or Jupyter. Start with a small scientific-base sidecar proof:

  • NumberTheory.eval backed by PARI/GP;
  • SmtSolver.check backed by Z3 and cvc5;
  • OptimizationSolver.solve backed by HiGHS for LP/QP/MIP smoke cases;
  • ProofChecker.build backed by Lean/mathlib for a pinned project;
  • immutable artifact/provenance records for every call.

That profile gives users, agent runners, and batch services exact arithmetic, constraint checking, optimization, and formal proof validation while keeping authority narrow enough for review. SageMath, OSCAR, Jupyter, R, Octave, and full package closure support should follow after the base interfaces and audit model are credible.

Proposal: User Identity, Sessions, and Policy

How capOS should represent human users, service identities, guests, anonymous callers, and policy systems without reintroducing Unix-style ambient authority.

Status: partially implemented. The current tree has entropy-backed UserSession metadata for anonymous, operator, and guest profiles; a bootstrap CredentialStore; shell-driven login, setup, and guest profile changes; AuthorityBroker.shellBundle returning broker-issued launcher, copied session, SystemInfo, and operator-scoped service endpoint caps; and manifest seed records for local operator/guest proofs. Guest shell bundles are manifest-gated and receive no default service endpoints. Endpoint calls now keep subject details private by default and disclose only requested-and-allowed fields from cap-held service/broker disclosure scope. The broader proposal remains target design for durable account storage, external identity bindings, session logout/revocation/renewal lifecycle, quota-backed profiles, ABAC/MAC policy engines, and POSIX compatibility metadata.

Problem

capOS has processes, address spaces, capability tables, object identities, badges, quotas, and transfer rules. It deliberately does not have global paths, ambient file descriptors, a privileged root bit, or Unix uid/gid authorization in the kernel.

Interactive operation still needs a way to answer practical questions:

  • Who is using this shell session?
  • Which caps should a normal daily session receive?
  • How does a service distinguish Alice, Bob, a service account, a guest, and an anonymous network caller?
  • How do RBAC, ABAC, and mandatory policy fit a capability system?
  • How does POSIX compatibility expose users without letting uid become authority?

The answer should keep the enforcement model simple: capabilities are the authority. Identity and policy decide which capabilities get minted, granted, attenuated, leased, revoked, and audited.

Design Principles

  • user is not a kernel primitive.
  • uid, gid, role, and label values do not authorize kernel operations.
  • A process is authorized only by capabilities in its table.
  • Authentication proves or selects a principal; it does not itself grant authority.
  • An account is a durable local record for a principal; it is not a running subject.
  • A session is a live policy context with selected policy and resource profiles that receives a cap bundle.
  • A workload is a process or supervision subtree launched with explicit caps.
  • POSIX user concepts are compatibility metadata over scoped caps.
  • Guest and anonymous access are explicit policy profiles, not missing policy.
  • External roles, groups, claims, and local roles are broker inputs, not authority after the corresponding caps are absent.

Concepts

Principal

A principal is a durable or deliberately ephemeral identity known to auth and policy services. It is useful for policy decisions, ownership metadata, audit records, and user-facing display. It is not a kernel subject.

Examples:

  • human account
  • operator account
  • service account
  • cloud instance or deployment identity
  • guest profile
  • anonymous caller
  • pseudonymous key-bound identity

The schema excerpt below is proposal-level shape. Where the interfaces already exist in schema/capos.capnp, the ordinals shown here must match the checked-in schema; future methods must be assigned from the next free ordinal when the schema is actually extended.

enum PrincipalKind {
  human @0;
  operator @1;
  service @2;
  guest @3;
  anonymous @4;
  pseudonymous @5;
}

struct PrincipalInfo {
  id @0 :Data;             # Stable opaque ID, or random ephemeral ID.
  kind @1 :PrincipalKind;
  displayName @2 :Text;
}

PrincipalInfo is intentionally descriptive. Possessing a serialized PrincipalInfo value must not grant authority.

Federated authentication uses a canonical external subject key: hash(providerKind, issuer, tenant, subject). For OIDC, issuer is iss, subject is sub, and tenant is the normalized tenant or configured empty tenant. sub alone is not unique across IdPs and must not be used directly. Admission policy either maps that external key to an existing local principal through an ExternalIdentityBinding or admits it as a pseudonymous principal under an explicit policy/resource profile pair. PrincipalKind covers the resolved local principal through human / operator / service / pseudonymous depending on deployment intent; a federated service account is service, a federated human is human, and a federated ephemeral identity with no stable person behind it is pseudonymous. The OIDC integration details live in OIDC and OAuth2.

User

user is a user-facing category for a principal/session that represents a human or human-adjacent actor. It is not a kernel object, not a UID, and not an authority source. Use principal, account, session, or workload when one of those narrower concepts is meant.

Account

An account is a durable local record for a principal. It binds credential references, status, roles, attributes, storage roots, quotas, and default policy/resource profile names. Some principals deliberately have no account: anonymous callers, some guests, and some one-shot external sessions.

Accounts do not run and do not hold capabilities. Session creation reads an account record, manifest seed record, or external admission binding, then asks a trusted broker to mint the actual CapSet for a live session or workload.

Profile

A profile is a named policy template. It contains no authority by itself.

  • A policy profile selects roles, ABAC defaults, allowed bundle fragments, approval paths, label defaults, and external admission constraints.
  • A resource profile selects storage, memory, CPU share, process/thread/cap limits, IPC limits, log volume, network posture, and launcher posture.

Use plain profile only when prose intentionally covers both policy and resource profiles.

Session

A session is a live context derived from a principal plus authentication and policy state. Sessions carry freshness, expiry, auth strength, audit identity, and selected policy and resource profiles. The selected profiles influence which caps a broker may mint and which quotas wrappers apply; the profiles are not usable authority.

AuthStrength aligns with ITU-T X.1254 Entity authentication assurance framework (= ISO/IEC 29115) level-of-assurance tiers. X.1254 defines LoA 1 (little or no confidence) through LoA 4 (very high confidence) as a composite of identity-proofing strength, credential strength, and authentication-protocol strength. capOS uses the same tiers so that policy decisions can be expressed as “require LoA ≥ 3 for ServiceSupervisor(net-stack)” without inventing parallel terminology.

# ITU-T X.1254 / ISO/IEC 29115 level-of-assurance tiers.
# `loa0` covers "no assertion" (`anonymous` sessions) and sits below
# the X.1254 lattice; the standard numbers LoA 1-4 only.
enum AuthStrength {
  loa0 @0;   # no authentication; anonymous
  loa1 @1;   # little/no confidence; self-asserted identity
  loa2 @2;   # some confidence; single-factor, e.g. password
  loa3 @3;   # high confidence; multi-factor, hardware-backed key
  loa4 @4;   # very high confidence; multi-factor with tamper-resistant
             # hardware and in-person or equivalent identity proofing
}

struct SessionInfo {
  sessionId @0 :Data;
  principal @1 :PrincipalInfo;
  authStrength @2 :AuthStrength;
  createdAtMs @3 :UInt64;
  expiresAtMs @4 :UInt64;
  policyProfile @5 :ProfileSummary;
  resourceProfile @6 :ProfileSummary;
  # Multi-party / delegated / federated session context. Populated when
  # the session was minted through an AuthorityBroker approval flow or a
  # federated IdP rather than direct interactive login.
  delegationChain @7 :List(Data);    # opaque session/IdP IDs
}

struct ProfileSummary {
  id @0 :Data;
  displayName @1 :Text;
  versionId @2 :Data;
  epoch @3 :UInt64;
}

struct CapabilityResultHandle {
  brokerId @0 :Data;
  grantId @1 :Data;
  interfaceId @2 :UInt64;
  issuedAtMs @3 :UInt64;
  expiresAtMs @4 :UInt64;
}

interface UserSession {
  info @0 () -> (info :SessionInfo);
  auditContext @1 () -> (sessionId :Data, principalId :Data);
  logout @2 () -> ();
  # Future result/grant metadata methods must use fresh ordinals; they are
  # intentionally not assigned in this proposal sketch.
}

interface SessionManager {
  login @0 (
    method :Text,
    selector :LoginSelector,
    proof :Data,
    source :LoginSourceMetadata
  ) -> (sessionIndex :UInt16);
  guest @1 () -> (sessionIndex :UInt16);
  anonymous @2 () -> (sessionIndex :UInt16);
  sshPublicKey @3 (
    username :Text,
    algorithm :Text,
    publicKey :Data,
    authBytes :Data,
    signature :Data,
    sourceAddr :Data
  ) -> (sessionIndex :UInt16);
  # Future renewal must use the next free ordinal in the checked-in schema,
  # currently @4, not @3.
}

When brokers return granted caps, GrantedCap should be the same transport-level result-cap concept used by ProcessSpawner, not a parallel authority encoding.

UserSession is the live session/profile summary surface, not the account database and not the process invocation subject itself. In the session-bound invocation model, the immutable kernel-installed SessionContext on the process is the invocation context; kernel/src/session_context.rs owns that state and the spawn-time inheritance/broker-selection rules described in Service Architecture. A UserSession cap may expose stable session metadata, profile summaries, audit context, expiry, and opaque handles for cap-broker results that have already been minted. It can also be used as trusted broker/session-manager input to spawn a child with a matching SessionContext, but copying a UserSession into an existing process cannot install a second session or relabel future calls. These handles are non-bearer metadata for audit and UI display: they cannot be redeemed into caps unless the caller also holds the separate broker, approval, or launcher authority required for the grant. UserSession must not expose mutable account records, credential records, role bindings, storage-root records, policy document bodies, or redeemable grant tokens. Fresh cap bundles come from AuthorityBroker or a launcher/supervisor that consumes the session context; the session cap itself is not a general account-store reader and is not the ordinary authority-vending path.

Session Lifecycle And Renewal

The expiresAtMs field is not sufficient by itself. The target model treats a session as a revocable lease with explicit state:

live | logged_out | revoked | expired | recovery_only

The immutable process SessionContext identifies the subject selected at spawn (see Service Architecture for the kernel-owned spawn-time installation and the make run-session-context proof). It should point at, or be paired with, trusted session-manager liveness state that can change without relabeling the process:

SessionLivenessCell {
    sessionId
    sessionEpoch
    state
    notBeforeMs
    notAfterMs
    policyEpoch
    resourceProfileEpoch
    auditRecordId
}

The liveness cell answers whether ordinary invocation may continue. Grant leases answer whether a particular broker-issued bundle or elevated cap remains valid. Object/facet epochs answer whether the target live object generation has been revoked or replaced. These checks compose; none of them is a substitute for capability possession.

For local password-authenticated shells, fixed short wall-clock expiry should not be the only interactive policy. A sane default is that the session remains live until explicit logout, terminal/connection close, owner shell or supervisor subtree exit, administrator revocation, account disablement, policy version invalidation, or a configured idle/hard maximum. Guest, anonymous, remote, federated, and elevated sessions may use much shorter leases.

Renewal must be a narrow session-manager or broker path. The exact Cap’n Proto signature is future schema work; with the current checked-in SessionManager ordinal map, the first renewal method would be assigned @4 unless another schema change lands first:

interface SessionManager {
  renew @nextFree (
    session :UserSession,
    proof :Data,
    requestedDurationMs :UInt64
  ) -> (session :UserSession);
}

renew may extend the same liveness cell or mint a successor session in the same audit family, depending on policy. It must check account status, auth freshness, session state, policy/resource profile epochs, requested duration, absolute maximum lifetime, and explicit revocation state. It must not make all old grants fresh. When policy needs a new decision, the broker returns fresh grant leases and wrapper caps; stale ordinary grants remain stale or are explicitly revoked.

Only named recovery methods should work after expiry: logout, renew, recovery, and narrowly scoped self-diagnostic status. Explicit revocation should block ordinary renewal unless a separately audited recovery policy says otherwise. Owner-shell exit and gateway disconnect should call logout for sessions they own, then process-exit cleanup releases local hold edges.

Workload

A workload is a process or supervision subtree started from a session, service, or supervisor. Workloads may carry session metadata for audit and policy, but they do not run “as” a user in the Unix sense. They run with a CapSet.

Common workload shapes:

  • interactive native shell
  • agent shell
  • POSIX shell compatibility session
  • user-facing application
  • per-user service instance
  • shared service handling many user sessions
  • service account process

Capability

A capability remains the actual authority. A process can only use what is in its local capability table. Policy services can choose to mint, attenuate, lease, transfer, or revoke capabilities, but they do not create a second authorization channel.

Account and Admission Sources

capOS should have three account and admission sources. All three feed policy; none of them bypass the capability graph.

  1. Manifest seed accounts. Immutable or append-only bootstrap records in the boot package. These create first local operators, recovery identities, service identities, emergency guest policy, and initial policy bundles. Seed data must be sufficient to boot, recover, unlock storage, and create or repair the local account store. It must not become the ordinary mutable account database.
  2. Local account store. Mutable Store/Namespace-backed records for accounts, credentials, roles, attributes, quotas, policy profiles, resource profiles, and storage roots. After initialization, disk state is authoritative for ordinary local accounts, with explicit versioning, rollback detection, and recovery import/export.
  3. External identity admission and bindings. OIDC, passkey, cloud, deployment, or certificate-backed principals mapped to named policy/resource profiles or existing local accounts. External claims are normalized ABAC inputs and may select a binding; they do not grant local authority by themselves.

Account Store Boundary

Mutable account state belongs in a separate account-store schema and service slice, not in the session schema. The identity/session schema should contain PrincipalInfo, SessionInfo, profile summaries, audit context, and opaque broker result handles. The account-store slice owns durable account records, credential references, local role bindings, external identity bindings, profile bodies and versions, storage-root references, recovery/import records, and mutation/audit metadata.

The account-store service should expose typed reads for trusted policy services and compare-and-set mutation methods for administrative tooling. SessionManager reads account-store records only while creating or refreshing a session, then returns a UserSession summary. AuthorityBroker uses that summary plus account-store/profile lookups to mint caps. Ordinary workloads must not learn more than the scoped session/profile metadata and caps they were explicitly granted.

Initial records should stay cap-shaped:

struct AccountRecord {
  recordId @0 :Data;
  principalId @1 :Data;
  kind @2 :PrincipalKind;
  displayName @3 :Text;
  status @4 :AccountStatus;
  credentialRefs @5 :List(Data);
  roles @6 :List(Text);
  attributes @7 :List(Attribute);
  resourceProfile @8 :ProfileRef;
  policyProfile @9 :ProfileRef;
  homeRoot @10 :StorageRootRef;
  createdAtMs @11 :UInt64;
  updatedAtMs @12 :UInt64;
  schemaVersion @13 :UInt32;
  storeEpoch @14 :UInt64;
  recordVersion @15 :UInt64;
  policyEpoch @16 :UInt64;
  previousHash @17 :Data;
  contentHash @18 :Data;
}

struct ProfileRef {
  profileId @0 :Data;
  versionId @1 :Data;
  epoch @2 :UInt64;
}

struct StorageRootRef {
  storageServiceId @0 :Data;
  rootObjectId @1 :Data;
  rootKind @2 :StorageRootKind;
  schemaVersion @3 :UInt32;
  rootVersion @4 :Data;
}

enum StorageRootKind {
  namespace @0;
}

enum AccountStatus {
  active @0;
  disabled @1;
  locked @2;
  recoveryOnly @3;
}

struct ResourceProfile {
  profileId @0 :Data;
  versionId @1 :Data;
  epoch @2 :UInt64;
  homeQuotaBytes @3 :UInt64;
  tempQuotaBytes @4 :UInt64;
  processLimit @5 :UInt32;
  threadLimit @6 :UInt32;
  capLimit @7 :UInt32;
  memoryCommitLimitBytes @8 :UInt64;
  frameGrantLimitPages @9 :UInt64;
  endpointQueueLimit @10 :UInt32;
  inFlightCallLimit @11 :UInt32;
  retired12 @12 :UInt32; # was pending IPC submission quota; do not reuse
  ringScratchLimitBytes @13 :UInt64;
  logQuotaBytesPerWindow @14 :UInt64;
  networkProfile @15 :Text;
  cpuBudgetUsPerWindow @16 :UInt64;
  cpuWindowUs @17 :UInt64;
  timerWaiterLimit @18 :UInt32;
  launcherProfile @19 :Text;
}

struct ExternalIdentityBinding {
  bindingId @0 :Data;
  provider @1 :Text;
  subjectHash @2 :Data;     # hash(provider kind, issuer, tenant, subject)
  principalId @3 :Data;
  tenant @4 :Text;
  acceptedClaims @5 :List(Text);
  expiresAtMs @6 :UInt64;
  policyProfile @7 :ProfileRef;
  resourceProfile @8 :ProfileRef;
  schemaVersion @9 :UInt32;
  storeEpoch @10 :UInt64;
  recordVersion @11 :UInt64;
  policyEpoch @12 :UInt64;
  previousHash @13 :Data;
  contentHash @14 :Data;
}

homeRoot is a persistent reference that the account/storage broker resolves into a live Namespace capability at session-bundle time. It is not a path, not a raw Directory, and not itself a capability. Compatibility Directory views are projections returned only when a workload needs file-like APIs.

Manifest seed records and local account records may name roles and profiles, but the resulting authority is still the CapSet returned by AuthorityBroker. A disabled or locked account can authenticate only to explicit recovery flows allowed by its account state and current policy.

Stable ID Formats

Names are display and lookup hints only. They must not be treated as authority or as stable cross-store identity. All durable IDs used for account-store joins should be opaque binary values with a declared version and fixed length:

  • Local principals: principalId is a 32-byte opaque random value minted by the local account store or imported from a trusted recovery record. User names, display names, POSIX names, and email addresses are attributes, not identifiers.
  • Account records: recordId is a 32-byte opaque record identity. It may equal principalId only if the store permanently enforces one account record per local principal; otherwise it must be separate.
  • External bindings: subjectHash is a 32-byte hash over canonical provider kind, issuer, tenant, and external subject. bindingId is a 32-byte opaque or content-derived ID over the normalized binding tuple plus the local principal ID. Provider display names and group strings are not authority.
  • Policy and resource profiles: profileId is a 32-byte opaque profile identity. versionId is a 32-byte content hash of the canonical profile body, schema version, parent version if any, and effective constraints. Profile display names such as operator or guest-shell are aliases.
  • Policy versions: policy bundles use a 32-byte versionId plus a monotonically increasing policyEpoch. Brokers refuse grants when the session/profile summary names a stale epoch.
  • Storage roots: storageServiceId, rootObjectId, and rootVersion are storage-service-owned opaque binary identifiers. A storage root is never a path or user name; the storage broker resolves it into a live Namespace only after current policy permits the grant.

Version, Rollback, and CAS Rules

Disk-backed account-store records must be rejected unless their integrity and freshness checks pass. The minimum record header is schemaVersion, storeEpoch, recordVersion, policyEpoch, previousHash, and contentHash. schemaVersion selects the decoder and migration policy. storeEpoch is a monotonic store-wide epoch advanced for every accepted mutation batch. recordVersion is monotonic per record. policyEpoch binds the record to the policy/profile generation used to evaluate it. previousHash chains the prior accepted canonical record bytes when a previous record exists, and contentHash covers the canonical bytes excluding the hash field itself.

Mutations use compare-and-set semantics:

update(recordId, expectedStoreEpoch, expectedRecordVersion, expectedHash, patch)
  -> accepted(newStoreEpoch, newRecordVersion, newHash)
  -> stale(currentStoreEpoch, currentRecordVersion, currentHash)
  -> denied(reason)

Administrative tools must submit the last observed epoch, version, and hash. The store accepts an update only when those values match the current durable record and the new record validates against the active schema and policy epoch. Replayed records, older store epochs, lower or equal record versions, hash-chain breaks, unknown schema versions, profile versions not recognized by the active policy bundle, and missing rollback metadata are fail-closed denials. A failed check may leave the account disabled for ordinary login while allowing only explicit recovery identities to inspect or repair it.

The account store should persist a signed or sealed store checkpoint that records the latest storeEpoch, account-store installation ID, accepted policy epoch, and root hash. If the checkpoint says a later epoch existed than the records currently on disk, the store is in recovery mode and must not let disk account records override manifest seed data or widen authority.

Recovery Import and Seed Repair

Manifest seed data is the recovery source when the local account store is missing, unreadable, or rollback-damaged. Recovery records should include first-operator or break-glass principal IDs, recovery credential references, profile refs, storage-root repair refs, import/export record IDs, allowed repair operations, expiry or quorum requirements, and audit requirements. Recovery identities are not normal operators: their default session bundle is limited to inspecting account-store state, exporting/importing records, disabling stale bindings, and applying exact-target repairs.

Import from seed or offline export is additive and conservative:

  • preserve local principalId, recordId, profile IDs, storage-root refs, and external bindingId values when their hashes and epochs validate;
  • import missing seed operators, service identities, recovery identities, and minimum guest/anonymous profiles needed to boot and repair the system;
  • disable, not delete, external bindings whose provider, tenant, subject hash, policy epoch, or profile version cannot be validated;
  • never auto-map a new external subject to a broader local role or profile than the signed seed/import record names;
  • never widen caps, quotas, storage roots, roles, or approval paths as a side effect of recovery import;
  • emit audit records for import start, source identity, records accepted, records preserved, records disabled, denials, and the final store epoch.

If audit storage is unavailable, recovery may continue only into a bounded emergency mode whose transcript is written to the best available append-only sink and whose repaired accounts remain disabled for ordinary login until an auditable store checkpoint is committed.

Session Startup Flow

flowchart TD
    Input[Login, guest, or anonymous request]
    Auth[Authentication or guest policy]
    Source[Manifest seed, account store, or external binding]
    Session[UserSession cap]
    Broker[AuthorityBroker / PolicyEngine]
    Bundle[Scoped cap bundle]
    Shell[Native, agent, or POSIX shell]
    Audit[AuditLog]

    Input --> Auth
    Auth --> Source
    Source --> Session
    Session --> Broker
    Broker --> Bundle
    Bundle --> Shell
    Broker --> Audit
    Shell --> Audit

The shell proposal’s minimal daily cap set is a session bundle:

terminal        TerminalSession
self            self/session introspection
status          read-only SystemStatus
logs            read-only LogReader scoped to this principal/session
home            Directory or Namespace scoped to account storage
launcher        restricted launcher for approved user applications
approval        ApprovalClient

The shell still cannot mint additional authority. It can ask ApprovalClient for a plan-specific grant, and a trusted broker can return a narrow leased capability if policy and authentication allow it. The terminal cap is the session-scoped foreground TerminalSession, not the boot debug Console; login hands that terminal into the shell bundle only after authentication or explicit guest/setup policy succeeds. The concrete default-boot login/setup flow that consumes this bundle is documented in Boot to Shell, and the shell-side contract for receiving and inspecting it lives in Shell.

Detailed decomposition for manifest-seeded accounts, disk-backed account storage, default resource bundles, local roles, RBAC, ABAC, MAC/MIC labels, POSIX profile metadata, and external identity bindings lives in Local Users, Storage, and Policy.

Multi-User Workloads

capOS should support two normal multi-user patterns.

Per-Session Subtree

The session owns a shell or supervisor subtree. Every child process receives an explicit CapSet assembled from the session bundle plus workflow-specific grants.

Example:

  • Alice’s shell receives home = Namespace("/users/alice").
  • Bob’s shell receives home = Namespace("/users/bob").
  • The same editor binary launched from each shell receives different home and terminal caps.
  • The editor cannot cross from Alice’s namespace into Bob’s unless a broker deliberately grants a sharing cap.

This is the right default for interactive applications and POSIX shells.

Shared Service With Per-Client Session Authority

A server process may handle many users in one address space. It should not infer authority from a caller’s self-reported user name, principal ID, role name, or endpoint label. Instead, a trusted issuer binds the subject before the service accepts it:

  • authentication or admission creates a live SessionContext;
  • a spawned process receives exactly one immutable session context, installed at spawn time by kernel/src/session_context.rs (see Service Architecture);
  • AuthorityBroker grants service roots or narrower facets for that session;
  • endpoint calls expose privacy-preserving caller-session metadata by default;
  • subject details are disclosed only when the method/call explicitly requests disclosure and a broker/service-granted disclosure scope allows the named fields;
  • quota donations or accounting caps may accompany service grants when server-side state needs explicit resource backing.

The service uses the caller session reference, disclosed subject facts, and service-local records to select scoped storage, enforce per-client limits, emit audit records, and return narrowed caps. Endpoint badges are not a normal identity mechanism; any remaining badge-shaped kernel field should be treated as internal endpoint transport state during the migration. This is the right shape for HTTP services, databases, log services, terminals, and shared daemons.

Service Accounts

Service identities are principals too. They are usually non-interactive and receive caps from init, a supervisor, or a deployment manifest rather than from a human login flow.

Service-account policy should be explicit:

  • which binary or measured package may use the identity,
  • which supervisor may spawn it,
  • which caps are in its base bundle,
  • which caps it may request from a broker,
  • which audit stream records its activity.

Service account records may be manifest seeded or stored in the local account store, but their sessions should receive no terminal and no interactive bundle. They launch as workloads with measured binary, supervisor, service name, network/IPC, log, state namespace, and key-use constraints.

Anonymous, Guest, and Pseudonymous Access

These are distinct profiles.

Empty Cap Set

An untrusted ELF with an empty CapSet is not a user session. It is the roadmap’s “Unprivileged Stranger”: code with no useful authority. It can terminate itself and interact with the capability transport, but it cannot reach a resource because it has no caps. The visible proof was achieved by commit d4016ab at 2026-04-22 16:35 UTC.

Anonymous

Anonymous means unauthenticated and usually remote or programmatic. It should receive a random ephemeral principal ID and a very small cap bundle.

Typical properties:

  • no durable home namespace by default,
  • strict CPU, memory, outstanding-call, and log quotas,
  • short session expiry,
  • no elevation path except “authenticate” or “create account”,
  • audit records keyed by ephemeral session ID and network/service context.

Guest

Guest means an interactive local profile with weak or no authentication.

Typical properties:

  • terminal/UI access,
  • temporary namespace,
  • optional ephemeral home reset on logout,
  • restricted launcher,
  • no administrative approval path unless policy grants one explicitly,
  • clearer user-facing affordance than anonymous.

Pseudonymous

Pseudonymous means durable identity without necessarily naming a human. A public key, passkey, service token, or cloud identity can select the same principal across sessions. This can receive persistent storage and quotas while still remaining separate from a verified human account.

External pseudonymous sessions require explicit admission configuration. A binding either maps the external subject to an existing local account or allows auto-creation of a tenant-scoped account with named policy and resource profiles. Durable storage is granted only through that local principal mapping and a broker-minted storage cap.

POSIX Compatibility

POSIX user concepts are compatibility metadata, not authority.

  • uid, gid, user names, groups, $HOME, /etc/passwd, chmod, and chown live in libcapos-posix, a filesystem service, or a profile service.
  • open("/home/alice/file") succeeds only if the process has a Directory or Namespace cap that resolves that synthetic path.
  • setuid cannot grant new caps. At most it asks a compatibility broker to replace the process’s POSIX profile or launch a new process with a different cap bundle.
  • POSIX ownership bits may influence one filesystem service’s policy, but they cannot authorize access to caps outside that service.

This lets existing programs inspect plausible user metadata without making Unix permission bits the capOS security model.

Policy Models

RBAC, ABAC, and mandatory access control fit capOS as grant-time and mint-time policy. They should mostly live in ordinary userspace services: AuthorityBroker, PolicyEngine, SessionManager, RoleDirectory, LabelAuthority, AuditLog, and service-specific attenuators.

The kernel should keep enforcing capability ownership, generation, transfer rules, revocation epochs, resource ledgers, and process isolation. It should not evaluate roles, attributes, or label lattices on every capability call.

RBAC

Role-based access control maps principals or sessions to named role sets. Roles select cap bundles and approval eligibility.

Examples:

  • developer can receive a launcher for development tools and read-only service logs.
  • net-operator can request a leased ServiceSupervisor(net-stack).
  • storage-admin can request repair caps for selected storage volumes.

Implementation shape:

interface RoleDirectory {
  rolesFor @0 (principal :Data) -> (roles :List(Text));
}

interface AuthorityBroker {
  request @0 (
    session :UserSession,
    plan :ActionPlan,
    requestedCaps :List(CapRequest),
    durationMs :UInt64
  ) -> (grant :ApprovalGrant);

  # Mint an ApprovalInbox for the bound session. The broker policy
  # decides whether the requesting session is allowed to triage
  # approvals and which entries are visible (own requests only,
  # role-scoped queue, multi-party reviewer queue).
  inbox @1 (
    session :UserSession
  ) -> (inbox :ApprovalInbox);
}

The detailed ActionPlan, ActionStep, CapRequest, GrantedCap, ApprovalInbox, ApprovalEntry, and ApprovalListener schemas live in Shell under Approval and Authentication. The broker is the single producer for both ApprovalGrant (the requester-side handle) and ApprovalInbox (the decider-side handle); they meet only at the broker, never on a shared transport channel.

Roles do not bypass capabilities. They only let a broker decide whether it may mint or return particular scoped caps.

The role/attribute/decision split matches the ITU-T X.812 Access control framework (= ISO/IEC 10181-3) decomposition into ADF (access-control decision function) and AEF (access-control enforcement function). In capOS terms:

  • The AEF is the CapObject::call dispatch plus wrapper caps: the enforcement point that cannot be bypassed because it is the only path to the underlying object.
  • The ADF is the PolicyEngine / AuthorityBroker: it evaluates a decision request and returns a capability (or refuses) rather than returning a boolean that downstream code might ignore.

The ADF/AEF split is why capOS can make PolicyDecision a cap-minting input rather than a per-call allow/deny flag — the enforcement point is already structural (you need a cap to reach the object) and the decision point returns the cap.

Remote Client Bundles

Remote programmatic and GUI clients consume the same identity and policy model as shells, but they need a different bundle shape. A remote host app may authenticate with password, public key, OIDC, passkey/WebAuthn, mTLS, guest/anonymous admission, or a service/workload credential. After admission, the broker returns a remote-client bundle whose entries are exported as Cap’n Proto RPC object references by a per-session gateway worker.

Those references are live capability proxies, not bearer tokens and not local cap-table metadata. A remote bundle may include session, systemInfo, and specific service caps such as chat, paperclips, or command surfaces. It should not inherit terminal, launcher, broad storage, raw network, key-vault, credential-store, or process-spawn authority merely because an operator shell profile would receive some of those caps. The detailed transport and lifetime rules live in Remote Session CapSet Clients.

ABAC

Attribute-based access control evaluates a richer decision context:

  • subject attributes: principal kind, roles, auth strength, session age, device posture, locality,
  • action attributes: requested method, target service, destructive flag, requested duration,
  • object attributes: service name, namespace prefix, data class, owner principal, sensitivity,
  • environment attributes: time, boot mode, recovery mode, network location, cloud instance metadata, quorum state.

ABAC is useful for contextual narrowing:

  • allow log read only for the caller’s session unless break-glass policy is active,
  • issue ServiceSupervisor(net-stack) only with fresh hardware-key auth,
  • grant Namespace("/shared/project") read-write only during a maintenance window,
  • deny network caps to guest sessions.

ABAC decisions should return capabilities, wrappers, or denials. They should not create hidden ambient checks downstream.

OAuth2 scopes and OIDC claims (acr, amr, groups, tenant-specific fields) are ABAC inputs. The broker consumes them alongside session freshness, object attributes, and environment state to pick a cap bundle or decline. They never authorize capability calls directly, and they do not create a downstream check outside the broker’s decision path. See OIDC and OAuth2.

ABAC Policy Engine Choices

Do not invent a policy language first. The capOS-native interface should be small and capability-shaped, while the broker implementation can start with a mainstream engine behind that interface.

Recommended order:

  1. Cedar for runtime authorization. Cedar’s request shape is already close to capOS: principal, action, resource, and context. It supports RBAC and ABAC in one policy set, has schema validation, and has a Rust implementation. That makes it the best fit for AuthorityBroker and MacBroker service prototypes.

  2. OPA/Rego for host-side and deployment policy. OPA is widely used for cloud, Kubernetes, infrastructure-as-code, and admission-control style checks. It is useful for validating manifests, cloud metadata deltas, package/deployment policies, and CI rules. The Wasm compilation path is worth tracking for later capOS-side execution, but OPA should not be the first low-level runtime dependency.

  3. Casbin for quick prototypes only. Casbin is useful for simple RBAC/ABAC experiments and has Rust bindings, but its model/matcher style is less attractive as a long-term capOS policy substrate than Cedar’s schema-validated authorization model.

  4. XACML for interoperability and compliance, not native policy. XACML remains the classic enterprise ABAC standard. It is useful as a conceptual reference or import/export target, but it is too heavy and XML-centric to be the native capOS policy language.

The capOS service boundary should hide the selected engine:

interface PolicyEngine {
  decide @0 (request :PolicyRequest) -> (decision :PolicyDecision);
}

struct PolicyRequest {
  principal @0 :PrincipalInfo;
  action @1 :Text;
  resource @2 :ResourceRef;
  context @3 :List(Attribute);
}

struct PolicyDecision {
  allowed @0 :Bool;
  reason @1 :Text;
  leaseMs @2 :UInt64;
  constraints @3 :List(Attribute);
}

PolicyDecision is still not authority. It is input to a broker that returns actual caps, wrapper caps, leased caps, or denial.

References:

Mandatory Access Control

Mandatory access control is non-bypassable policy set by the system owner or deployment, not discretionary sharing by ordinary users. In capOS, MAC should be implemented as mandatory constraints on cap minting, attenuation, transfer, and service wrappers.

Examples:

  • a Secret cap labeled high cannot be transferred to a workload labeled low,
  • a LogReader for security logs cannot be granted to a guest session even if an application asks,
  • a recovery shell can inspect storage read-only but cannot write without a separate exact-target repair cap,
  • cloud user-data can add application services but cannot grant FrameAllocator, DeviceManager, or raw networking authority.

Implementation components:

enum Sensitivity {
  public @0;
  internal @1;
  confidential @2;
  secret @3;
}

struct SecurityLabel {
  domain @0 :Text;
  sensitivity @1 :Sensitivity;
  compartments @2 :List(Text);
}

interface LabelAuthority {
  labelOfPrincipal @0 (principal :Data) -> (label :SecurityLabel);
  labelOfObject @1 (object :Data) -> (label :SecurityLabel);
  canTransfer @2 (
    from :SecurityLabel,
    to :SecurityLabel,
    capInterface :UInt64
  ) -> (allowed :Bool, reason :Text);
}

For ordinary services, MAC can be enforced by brokers and wrapper caps. For high-assurance boundaries, the remaining question is whether transfer labels need kernel-visible hold-edge metadata. That should be added only for a concrete mandatory policy that cannot be enforced by controlling all grant paths through trusted services.

The attribute model borrows from ITU-T X.741, which enumerates the managed-object attributes a directory-based access-control system tracks: ACL entries, access-control information (ACI), default access, initiator ACI, target ACI, and access-decision outcome. X.741 targets the X.500 directory, so the schema does not port directly, but the attribute taxonomy is a good completeness check for what LabelAuthority and PolicyEngine requests should expose to a decision engine.

GOST-Style MAC and MIC

Russian GOST framing is stricter than the generic “MAC means labels” summary. The relevant standards split at least two policies that capOS should keep separate:

  • Mandatory access control for confidentiality. ГОСТ Р 59383-2021 describes mandatory access control as classification labels on resources and clearances for subjects. ГОСТ Р 59453.1-2021 goes further: a formal model that includes users, subjects, objects, containers, access levels, confidentiality levels, subject-control relations, and information flows. The safety goal is preventing unauthorized flow from an object at a higher or incomparable confidentiality level to a lower one.

  • Mandatory integrity control for integrity. ГОСТ Р 59453.1-2021 treats this separately from confidentiality MAC. The integrity model constrains subject integrity levels, object/container integrity levels, subject-control relationships, and information flows so lower-integrity subjects cannot control or corrupt higher-integrity subjects.

For capOS, this should map to labels on sessions, objects, wrapper caps, and eventually hold edges:

struct ConfidentialityLabel {
  level @0 :Text;              # e.g. public, internal, secret.
  compartments @1 :List(Text);
}

struct IntegrityLabel {
  level @0 :Text;              # ordered by deployment policy.
  domains @1 :List(Text);
}

struct MandatoryLabel {
  confidentiality @0 :ConfidentialityLabel;
  integrity @1 :IntegrityLabel;
}

Capability methods need a declared flow class. capOS cannot rely on generic read and write syscalls:

  • read-like: File.read, Secret.read, LogReader.read;
  • write-like: File.write, Namespace.bind, ManifestUpdater.apply;
  • control-like: ProcessSpawner.spawn, ServiceSupervisor.restart;
  • transfer-like: CAP_OP_CALL, CAP_OP_RETURN, and result-cap insertion when they carry caps or data across labeled domains.

Initial rules can be expressed as broker/wrapper checks:

read data-bearing cap:
  subject.clearance dominates object.classification

write data-bearing cap:
  target.classification dominates source.classification
  # no write down

control process or supervisor:
  controlling subject is same label, or is an explicitly trusted subject

integrity write/control:
  writer.integrity >= target.integrity

This is not enough for a GOST-style formal claim, because uncontrolled cap transfer can bypass the broker. A higher-assurance design needs:

  • kernel object identity for every labeled object,
  • label metadata on kernel objects or per-process hold edges,
  • transfer-time checks for copy, move, result caps, and endpoint delivery,
  • explicit trusted-subject/declassifier caps,
  • an audit trail for every label-changing or declassifying operation,
  • a formal state model covering users, subjects, objects, containers, access rights, accesses, and memory/time information flows.

The proposal therefore has two levels:

  • Pragmatic capOS MAC/MIC: userspace brokers and wrapper caps enforce labels on grants and method calls.
  • GOST-style MAC/MIC: a formal information-flow model plus kernel-visible labels/hold-edge constraints for transfers that cannot be forced through trusted wrappers. See Formal MAC/MIC for the dedicated abstract-automaton and proof track.

References:

Composition Order

When policies compose, use this order:

  1. Mandatory policy defines the maximum possible authority.
  2. RBAC selects coarse eligibility and default bundles.
  3. ABAC narrows the decision for context, freshness, object attributes, and requested duration.
  4. The broker returns specific capabilities or denies the request.
  5. Audit records the plan, decision, grant, use, release, and revocation.

The composition result is still a CapSet, leased cap, wrapper cap, or denial.

Service Architecture

The policy stack should be decomposed into ordinary capOS services. Init or a trusted supervisor grants broad authority only to the small services that need to mint narrower caps.

SessionManager

Creates and manages session metadata/control caps:

  • guest() for local guest sessions,
  • anonymous(purpose) for ephemeral unauthenticated callers,
  • login(method, proof) for authenticated users,
  • renew(session, proof, requestedDurationMs) for narrow continuation or recovery when policy allows it,
  • logout(session) through the UserSession control cap.

The first implementation can be manifest-seed backed. It does not need a persistent account database, but its seed records must use the same principal, account, policy-profile, and resource-profile vocabulary as the later local account store. UserSession should describe the principal, session ID, policy profile, resource profile, auth strength, expiry, and audit context. It should not be a general-purpose authority vending machine unless it was itself minted as a narrow wrapper around a fixed cap bundle. Session IDs should come from the same dedicated entropy source that the bootstrap login/setup flow in Boot to Shell uses for credential salts and setup tokens; if fresh randomness is unavailable, authenticated session creation should fail closed instead of recycling predictable IDs.

SessionManager should own the mutable liveness cell for sessions it mints. The kernel-installed process SessionContext (owned by kernel/src/session_context.rs; see Service Architecture) remains immutable; renewal changes the cell or produces a successor session, not a new subject label inside the same process. This is the mechanism that makes long-running shells usable without treating fixed short wall-clock expiry as the only safety boundary.

Safer first split:

SessionManager -> UserSession metadata cap
AuthorityBroker(session, policyProfile, resourceProfile) -> base cap bundle
Supervisor/Launcher -> spawn shell with that bundle

AuthorityBroker

The broker owns or receives powerful caps from init/supervisors and returns narrow caps after RBAC, ABAC, and mandatory checks.

Examples:

  • broad ProcessSpawner -> RestrictedLauncher(allowed = ["shell", "editor"]),
  • broad NamespaceRoot -> Namespace("/users/alice"),
  • broad ServiceSupervisor -> LeasedSupervisor("net-stack", expires = 60s),
  • broad BootPackage -> BinaryProvider(allowed = ["shell", "editor"]).

The broker is the normal policy decision and cap minting point.

AuditLog

Append-only audit interface. Initially this can write to serial or a bounded log buffer; later it should be Store-backed.

Record at least:

  • session creation,
  • cap request,
  • policy input summary,
  • policy decision,
  • cap grant,
  • cap release or revocation,
  • denial,
  • declassification or relabel operation.

Audit entries must not contain raw auth proofs, private keys, bearer tokens, or broad environment dumps. For auth/session flows, the initial backend should record opaque credential/token record IDs, volatility flags, and policy/result codes rather than secret-bearing payloads. Failed pre-auth attempts should log only a terminal-local event ID and generic failure class; do not emit principal-identifying fields to the serial-backed path before authentication actually succeeds.

RoleDirectory

Role lookup should start static and boot-config backed:

guest -> guest-shell
alice -> developer
ops -> net-operator
net-stack -> service:network

This is enough for early RBAC bundles. Dynamic role assignment moves into the local account store once persistent storage and administrative tooling exist. Provider groups are not imported as roles automatically; a binding rule may map a provider group to a local role only for a named provider/tenant, expiry, and policy version.

LabelAuthority

Owns the label lattice and dominance checks. In the pragmatic phase, it is a userspace dependency of brokers and wrappers. In a GOST-style phase, the same lattice needs a kernel-visible representation for transfer checks.

Wrapper Caps

Wrappers are the main mechanism. Prefer them over per-call ACL checks in a central service:

  • RestrictedLauncher wraps ProcessSpawner.
  • ScopedNamespace wraps a broader namespace/store.
  • ScopedLogReader filters by session ID or service subtree.
  • LeasedSupervisor wraps a broader supervisor with expiry and target binding.
  • ApplicationManifestUpdater rejects kernel/device/service-manager grants.
  • LabelledEndpoint enforces declared data-flow and control-flow constraints.

This keeps authority visible in the capability graph.

Bootstrap Sequence

Early boot can be static:

init
  -> starts AuditLog
  -> starts SessionManager
  -> starts AuthorityBroker with broad caps
  -> asks broker for a system, guest, or operator shell bundle
  -> spawns shell through a restricted launcher

Before durable storage exists, policy config comes from BootPackage / manifest config. Early authentication may still use bootstrap verifier or public-credential records plus guest/anonymous/local-presence profiles, but it must keep fresh-entropy requirements fail-closed and treat any RAM-only credential or disable-state changes as volatile.

Revocation, Audit, and Quotas

User/session policy depends on the Stage 6 authority graph work:

  • one-session-per-process plus privacy-preserving endpoint caller-session metadata lets shared services distinguish session/client relations; receiver selectors are only routing metadata,
  • mutable session liveness cells distinguish live, logged-out, revoked, expired, and recovery-only sessions without relabeling running processes,
  • resource ledgers and session quotas prevent denial-of-service through session creation,
  • CAP_OP_RELEASE and process-exit cleanup reclaim local hold edges,
  • epoch revocation lets a broker invalidate leased or compromised caps,
  • renewal mints or refreshes session/grant leases under policy; it must not revive stale ordinary grants by accident,
  • audit logs record the cap grant and release lifecycle.

The cross-cutting quota model lives in Resource Accounting and Quotas. Account and session resource profiles are templates; brokers, supervisors, and resource owners translate them into concrete ledgers and wrapper caps.

Audit should record identity and policy metadata, but it should not contain secrets, raw authentication proofs, or broad environment dumps.

Implementation Plan

  1. Document the model. Keep user identity out of the kernel architecture, publish the principal/user/account/profile/session/role/workload vocabulary, and link this proposal from the shell, service, storage, and roadmap docs.

  2. Manifest-seeded account and profile schema. Define boot-package seed records for first operators, recovery identities, service identities, guest policy, policy profiles, resource profiles, and initial role bindings. Validate that seed data names policy inputs only and does not grant ordinary accounts privileged kernel caps directly.

  3. Session-aware native shell profile. Treat the shell proposal’s minimal daily cap set as a session bundle. Add self/session introspection and scoped logs/home caps once the underlying services exist.

  4. Authority broker and audit log. Add ActionPlan, ActionStep, CapRequest, ApprovalClient, ApprovalInbox, ApprovalEntry, leased grant records, and an append-only audit path. The shell-proposal Approval and Authentication section defines the schemas; the broker is the single producer for both the requester-side ApprovalGrant and the decider-side ApprovalInbox. Start with RBAC-style policy/resource profile bundles and explicit local authentication.

  5. Local account store and external bindings. Add a Store/Namespace-backed AccountStore for account records, credential references, role bindings, external identity bindings, policy versions, resource profiles, and storage-root references. Include version and rollback checks before treating disk-backed account mutation as durable.

  6. ABAC policy engine. Extend the broker decision with session freshness, auth strength, object attributes, requested duration, and environment state. Prefer Cedar for the runtime broker interface; use OPA/Rego for host-side manifest and deployment checks. Keep decisions visible in audit records.

  7. Mandatory policy labels. Add pragmatic labels to policy-managed services and wrappers. Keep confidentiality and integrity separate. Defer kernel-visible labels until a specific MAC/MIC policy cannot be enforced by trusted grant paths.

  8. Guest and anonymous demos. Show a guest shell with terminal, tmp, and restricted launcher, and show an anonymous workload with strict quotas and no persistent storage.

  9. POSIX profile adapter. Provide synthetic uid/gid, $HOME, /etc/passwd, and cwd behavior from session policy/resource profiles and granted namespace caps.

  10. GOST-style formalization checkpoint. If capOS later claims GOST-style MAC/MIC, write the abstract state model before implementation: users, subjects, objects, containers, access rights, accesses, labels, control relations, and information flows. Then decide which labels must become kernel-visible.

Non-Goals

  • No kernel uid/gid.
  • No ambient root.
  • No global login namespace in the kernel.
  • No authorization from serialized identity structs.
  • No model-visible authentication secrets.
  • No POSIX permission bits as system-wide authority.
  • No per-call role/attribute/label interpreter in the kernel fast path.
  • No claim of GOST-style MAC/MIC until the formal model and transfer enforcement story exist.

Open Questions

  • Which session interfaces are needed before persistent storage exists?
  • Which audit store is acceptable before durable storage and replay exist?
  • Which MAC policies, if any, justify kernel-visible hold-edge labels?
  • How should remote capnp-rpc or future OCapN/CapTP-style identities map into local principals? Transport identity, locator hints, and routing metadata are not local user/session identity by themselves; remote peers should enter through broker/session policy rather than raw protocol fields. See Cloudflare, Cap’n Proto, Workers RPC, and Cap’n Web and Spritely, OCapN, and CapTP.
  • Should the first broker prototype embed Cedar directly, or use a simpler hand-written evaluator until the policy surface stabilizes?

Proposal: Default User Avatar From Identity Hash

How capOS should pick a default avatar for an account or session in a way that is deterministic, stable across reboots and devices, free of network side-effects, and easy for the user to override with an explicit choice.

Status: partially implemented. The current tree implements the first shell-side phase without adding schema, kernel state, or broker state. The native shell derives a default avatar from existing UserSession.info data: account, operator, and service sessions use their principal identifier, while anonymous and guest sessions use their minted session identifier. It uses a shell-local BLAKE2b-512 domain prefix over the same framed identity input shape, then prints the selected set-flat tile in session output as avatar=set-flat/<asset-stem> avatar_source=default avatar_override=none. The capability-carried UserSession.avatar method, durable account override storage, active-set discovery through SystemInfo, and remote-session view model propagation remain future phases.

Problem

Today every UserSession is metadata-only: name (sometimes), profile class (anonymous / guest / operator / future durable accounts as defined by user identity and policy), and a session-token entropy field. Any UI that needs to show “who is this” — the boot-to-shell login prompt, the shell prompt, the remote-session client, future GUI — has nothing to draw beyond a profile-class fallback. The consequences:

  • New accounts and anonymous sessions look identical even when they have different identities, which is misleading in any multi-account context.
  • If an admin assigns avatars by hand, the assignment lives outside the identity surface and is not stable across re-imports of the account.
  • Without an authority-controlled default, every UI invents its own, including potentially a Gravatar-style network call that exposes the account email to a third party.

The branding asset set already ships 144 curated rounded-card tiles (branding/user-icons/set-flat/, 72 icons; branding/user-icons/set-modern/, 72 icons). They are typed as user avatars but have no consumer yet.

Goals

  1. Every account and session resolves to a concrete avatar without relying on network lookups or external services.
  2. The default is stable: the same identity always resolves to the same tile, on every host that imports the account, until the user explicitly sets an override.
  3. The default is derived from a stable identifier, not from mutable profile fields like display name. Renaming an account does not change its avatar.
  4. The override is persistent and travels with the account, not with a per-host UI preference store.
  5. Anonymous and short-lived sessions still get some deterministic avatar so they look distinct from each other within a session lifetime, without leaking durable identity.
  6. The avatar surface is a capability, not an ambient lookup. UI code asks for an Avatar from the UserSession it already holds.

Non-Goals

  • Generative identicons (jdenticon-style pixel art). The curated tile sets are already on disk and visually consistent with the rest of the branding.
  • Per-user avatar uploads. The override is a selection from the shipped set for now; arbitrary blob uploads are a separate, larger design question (storage, scanning, capability scope).
  • Avatar themes that follow OS dark/light mode. Theme handling is the responsibility of the rendering surface; the identity layer commits to a single tile per account.
  • Group/role icons, badges, presence indicators. Those layer on top of the avatar, they do not replace it.

Design

Identity Inputs

The hash input is the stable account identifier, never the display name and never anything that can be rotated for security reasons. Every subject class is length-framed and domain-tagged, so an attacker who can choose bytes for one class cannot synthesize a collision against another:

input := classTag || u16(len(field_1)) || field_1 || ... || u16(len(field_k)) || field_k
SubjectClass tagFields (in order)
Durable account"acct"principalId
Manifest-seeded operator account"oper"principalId (resolved to the seeded operator)
Service identity"svc "principalId (manifest or registry)
Federated account"fed "providerKind, issuer, tenant, subject
Anonymous session"anon"session-token entropy
Guest session"gst "session-token entropy

Class tags are 4 fixed bytes (space-padded where shorter) so the input prefix is unambiguous without needing a separator. The federated layout matches the canonical external subject key from user identity and policy: the same (providerKind, issuer, tenant, subject) tuple that produces AccountExternalBinding.subjectHashsubject alone is not unique across identity providers and must not be used directly. Length-framing ensures that, e.g., (issuer="A", tenant="BC") and (issuer="AB", tenant="C") hash differently even though their concatenations would otherwise be equal.

Hash and Mapping

message    = "capos-avatar-v1" || 0x00 || input
digest     = BLAKE2b-512(message)
tile_index = u32_be(digest[0:4]) % len(active_set)
  • BLAKE2b is the digest primitive named by the cryptography and key management proposal; no new primitive.
  • "capos-avatar-v1" is the public domain-separation tag, not a secret key. The current shell-side phase prepends it to the BLAKE2b-512 message with a zero separator rather than using the BLAKE2 parameter block, so the mapping is explicit in the no-std shell code. The avatar selection is fully derivable from public account metadata; no MAC and no HKDF subkey derivation is involved. Bumping v1 to v2 would let us re-issue defaults across the fleet (e.g., if a future tile set deprecates an icon) without affecting any other hash that consumes the same identifier.
  • u32_be over the first four digest bytes is sufficient: the tile-set sizes (72) are far smaller than 2^32, and modulo bias on a 32-bit space against 72 buckets is below 2^-25 — visually irrelevant.
  • Collisions are fine and expected: with 72 tiles, two arbitrary accounts collide with probability ~1.4%; in a tenant of 36 users, the birthday probability of any collision is roughly 50%. The avatar conveys identity hint, not identity proof. Higher-assurance UIs combine the avatar with the display name and account id.

Active Set

Each system commits to one active set (set-flat or set-modern). The active set is a system-level configuration value, not a per-user choice, so that:

  • All accounts on a host look stylistically uniform.
  • Switching the system theme remaps every account’s default deterministically but consistently — every account moves to its set-modern tile of the same hash position, not to a random new one.

The active set is exposed via SystemInfo.avatarSet (extension to the existing SystemInfo capability). Future themes add new sets without reshuffling existing assignments.

The implemented shell-side phase is narrower: it compiles the current set-flat catalog directly into the native shell and does not add SystemInfo.avatarSet. That keeps the first proof off the schema serial surface while still making the default mapping visible to users. The later capability-carried phase should replace the compiled shell catalog with the system-advertised active set described above.

Override

A durable account can pin an explicit tile that wins over the hash-derived default. The override is a new optional field on AccountRecord:

struct AccountRecord {
  # ...existing fields @0..@18 from the identity proposal...
  avatarOverride @19 :AvatarRef;       # zero-length set/name means "no override"
}

struct AvatarRef {
  set  @0 :Text;   # e.g. "set-flat", "set-modern"
  name @1 :Text;   # e.g. "panda" (the bare semantic name, without NNN- prefix)
}
  • The override is mutated through the existing AccountStoreManager.update(recordId, expectedStoreEpoch, expectedRecordVersion, expectedHash, patch) compare-and-set protocol defined by the identity proposal. Setting or clearing an avatar bumps recordVersion, recomputes contentHash, and links to previousHash exactly like any other field change; nothing about avatar overrides bypasses the record-version, store-epoch, or hash-chain checks.
  • Validation: set must name a set the active build ships, and name must resolve to a tile within that set. Records that fail this check at load time fall back to the hash-derived default and emit an audit record; the record is not silently rewritten.
  • The override is checked first; the hash is the fallback.
  • update patches use the standard “absent field means unchanged” convention. Clearing an override is an explicit operation: the patch must contain avatarOverride with both set and name empty. An unrelated update that omits avatarOverride from its patch must not drop a previously pinned override.
  • The override is a name, not a tile blob. Storing only the name keeps the account record compact, makes shipped-asset replacement automatic (a re-rendered tile with the same name applies everywhere), and avoids embedding image data in identity records.
  • Account export/import carries the field unchanged: since the import path validates set/name against the importing host’s shipped tile catalog, an override that names a tile the importing host does not ship is downgraded to the hash-derived default at import time and audited, never silently dropped.
  • Anonymous and guest sessions cannot pin an override: they are short-lived and have nowhere durable to store it. Their hash result is the only avatar they get.

Capability Surface

UserSession gains:

interface UserSession {
  info         @0 () -> (info :SessionInfo);
  auditContext @1 () -> (sessionId :Data, principalId :Data);
  logout       @2 () -> ();
  avatar       @3 () -> (avatar :Avatar);
}

interface Avatar {
  # Stable, content-addressed handle for the chosen tile. `digest` is the
  # SHA-256 of the encoded WebP bytes, NOT the identity hash. Two accounts
  # that resolve to the same tile (whether through hash collision or
  # explicit override) return the same `digest`, so UIs can cache by it.
  ref  @0 () -> (set :Text, name :Text, digest :Data);

  # Bytes of the encoded WebP, when the caller is allowed to render it
  # locally. Same caps that grant `UserSession` are sufficient; no separate
  # avatar-read authority.
  read @1 () -> (image :Data, mime :Text);
}

The avatar ordinal @3 follows the existing info @0, auditContext @1, logout @2 ordinals on UserSession and slots into the next free position. A future schema change that lands ahead of this one must shift the avatar ordinal accordingly; the cap-name is the contract, not the ordinal number.

  • ref returns a content-addressed identifier suitable for caching across reboots without re-reading the bytes. The asset SHA-256 is computed once per shipped tile at build time and is identical across accounts that resolve to the same tile, so UI clients can key their local cache by digest and dedupe across many sessions. The identity-derived digest from the Hash and Mapping section is internal to the avatar resolver and is not exposed by ref.
  • read returns the WebP bytes from the active set’s tile. The ABI does not expose alternate formats — surfaces that need PNG can decode locally.

When the Avatar is Bound

Resolution happens lazily, the first time avatar() is called on a session:

  1. If the underlying account has an override, pick that tile.
  2. Otherwise, hash the account/session identity input, take index % len(set) in the active set, look up branding/user-icons/<set>/<NNN>-<name>.webp.
  3. Cache the result on the in-memory session object until the session is torn down.

There is no precomputation step at boot or login; the cost is one domain-separated BLAKE2b digest plus one filesystem read per session, both negligible.

Surfaces That Consume Avatars

  • Login UI (text shell login per boot to shell, future web login, future GUI): show the avatar next to the typed username> while waiting for the hidden password> prompt, so the user has a non-cryptographic visual confirmation that they are selecting the account they expect. The avatar itself is not a secret and exposing it pre-auth is intentional — the same is true of display names, and the boot-to-shell flow already accepts pre-auth account selectors.
  • Shell prompt and whoami: the current text shell prints a deterministic set-flat/<asset-stem> default in the existing session output. Future graphical terminals can render the referenced tile inline, and the shell can switch from the compiled catalog to the UserSession.avatar capability once that schema phase lands.
  • Remote-session client and Tauri wrapper: the bridge already receives a view model from the trusted backend; add avatar to the session view model so the browser/desktop UI never queries identity directly, and the same view-model field carries the operator’s chosen tile after login upgrades the anonymous session to an authenticated UserSession through SessionManager.login as described in boot to shell.
  • System monitoring / audit views: the avatar identifies the actor in human-readable timelines without leaking the underlying id; the audit trail for override edits flows through the same AccountStoreManager record-chain the identity proposal already audits.

Anonymous and Guest Sessions

Anonymous and guest sessions, in the sense the user identity and policy and boot to shell proposals already define them, get a hash-derived avatar with these constraints:

  • The input uses the four-byte class tags "anon" and "gst " from the framed-input table above, so an anonymous session and a durable account with the same entropy field do not collide.
  • The avatar lives only as long as the session-token entropy. Re-anonymizing through SessionManager.anonymous() produces a new tile.
  • The login UI distinguishes anonymous and guest sessions from durable accounts by a chrome accent (border colour, badge, label), not by reusing one fixed tile. Reusing a fixed tile would make every anonymous user look identical, which loses the “tell sessions apart at a glance” property.

Privacy and Security

  • The hash uses a public domain-separation tag, not a secret key. The tile derivation is fully reproducible from public account metadata; the privacy guarantee is “the avatar leaks no extra information beyond what the identity surface already exposes,” not “the underlying id is hidden by cryptography.” The identity digest never leaves the identity layer — only its modulo-N tile-index result reaches UI surfaces, embedded in the resolved set/name.
  • Cross-host correlation is intentionally observable. Because the hash has no per-host salt, the same durable or pseudonymous principal imported into two hosts produces the same set/name and digest on both. Anyone who watches the avatar surface on multiple hosts can correlate “same account is here too,” and combined with display name or session metadata the avatar acts as a low-entropy identifier. This is the same correlation surface the principal id and external-binding subjectHash already expose, so we treat it as acceptable for ordinary multi-host accounts and call it out explicitly so privacy-sensitive deployments can pin a generic override or set a per-host override policy.
  • Operators can audit-log avatar overrides as account-record edits, like any other identity field; the override mutation goes through AccountStoreManager.update and produces the same store-epoch / record-version / content-hash audit chain as other record changes.
  • The avatar is not authentication. Two accounts with the same tile are not equivalent; the system always uses the principal id internally. The avatar is an identification aid for humans, like a display name.
  • No network lookups. No Gravatar, no third-party calls.

Open Questions

  • Should the hash include a per-system salt so an account imported into two hosts does not always show the same default tile, similar to how Unix uid/gid space is host-local? This proposal currently says no — cross-host stability is more useful than host-local distinctness, since durable accounts already have a globally unique id.
  • Should Avatar.read expose only the active-set bytes, or both set-flat/set-modern so a UI can render adaptive variants? Current preference: only the active set. Adaptive themes are the surface’s job, not the identity layer’s.
  • How should the manifest seed an override for the operator account? A seed.operator.avatar = "set-flat:robot" field in system.cue is the natural extension, but only if operators express a need — the hash-derived default is already deterministic.

Relevant Research

Proposal: Delegated Subject Context

This proposal records the future model for acting on behalf of another subject. It was intentionally out of scope for the completed Session-Bound Invocation Context milestone and is treated as future work by the User Identity and Policy proposal and the Service Architecture proposal. The current state of the implemented session-bound model and its known residuals is tracked in Design Risks Register entries R2 (session-bound invocation context) and R14 (user identity / policy maturity).

The implemented milestone established the simpler rule first:

capability = authority to invoke
calling process session = who invokes

Cross-session capability transfer may delegate authority to invoke when the capability’s transfer policy permits it. That is not subject delegation. The User Identity and Policy proposal already carries a delegationChain field on SessionInfo that records when a session was minted through an AuthorityBroker approval flow or a federated IdP; that field is session provenance, not the per-call represented-subject context this proposal introduces.

Problem

Some workflows legitimately need a process to act on behalf of a different subject:

  • a user asks an agent process to send a chat message for them;
  • an operator grants a support session bounded access to perform one action;
  • a service account performs a maintenance action for a tenant;
  • an approval flow lets a worker complete a task in another principal’s name.

The system must support this without making the receiving process “become” the source subject. The caller’s own process session remains the invoker for audit, resource accounting, and privacy. The represented subject is separate, explicit, scoped, and revocable.

Design

Use a delegated-subject capability:

SubjectDelegation {
    source_subject,
    delegate_subject_or_session,
    target_service,
    allowed_methods_or_purpose,
    disclosure_scope,
    expires_at,
}

The exact ABI may be a SubjectDelegation interface, a broker result cap, or a service-specific delegation cap. The invariant is stable:

invoked service cap = authority to call
calling process session = invoker
SubjectDelegation = represented subject context

Holding a SubjectDelegation is not enough to call a service. The caller must also hold the service capability being invoked. This composes cleanly with the service architecture’s existing rule that authority to act flows through the service capability itself, not through ambient subject identity; see Service Architecture proposal.

Example

Bob process session = Bob
Bob holds ChatRoot
Bob holds SubjectDelegation(Alice -> Bob, target_service = ChatRoot, scope = post)

ChatRoot.post(channel = "ops", text = "...", represented = AliceDelegation)

The service records:

invoker = Bob session reference
represented_subject = Alice, through bounded delegation
authority_to_call = ChatRoot

Bob has not become Alice. Audit and abuse handling can still identify Bob as the actor while showing that Alice delegated a bounded representation. This preserves the audit identity model the User Identity and Policy proposal already specifies for UserSession.auditContext: the invoker session reference is the audit subject, and the represented-subject context is a separate facet on the call.

Privacy

A delegated-subject capability must not disclose all source-subject facts. It should carry or vend only the facts the issuer allowed for the target service, such as:

  • per-service display name;
  • guest/operator class;
  • a per-service audit pseudonym;
  • a narrow claim such as “may approve invoice 123”.

It should not expose account-store records, external IdP claim bags, credential identifiers, global principal ids, or unrelated profile attributes by default. The default-private endpoint subject-disclosure rule introduced by the session-bound milestone applies here too: explicit disclosure is opt-in per method and bounded by the delegation’s disclosure_scope. See User Identity and Policy proposal for the broader privacy posture and Session-Bound Invocation Context for the implemented baseline.

Relationship To Capability Transfer

Capability transfer and subject delegation are different operations:

cap transfer only:
    receiver gets authority to invoke;
    receiver invokes as its own process session.

delegated subject context only:
    receiver may present a represented subject;
    no service method is callable unless receiver also holds a service cap.

cap transfer + delegated subject context:
    receiver invokes the cap as its own process session;
    service also sees the represented subject through explicit delegation.

The first implementation path should not depend on this proposal. Implement session-bound invocation context, transfer scopes, and shared-service migration first; add delegated subject context only after those rules are observable and reviewed. The session-bound prerequisites are landed (see Session-Bound Invocation Context and R2 in Design Risks Register); durable identity, ABAC/MAC, and broker maturity tracked under R14 of the same register are still proposal-shaped, so a delegated-subject implementation should not be selected until those mature far enough to give it a stable issuer.

Open Questions

  • Whether the kernel should validate generic delegation metadata such as target_service and expiry, or whether services should validate the delegation cap through a method call.
  • Whether delegated-subject caps are broker-owned, service-owned, or both.
  • How revocation of delegated subject context composes with ordinary cap revoke/lease behavior.
  • Whether the disclosure scope should be encoded as schema-specific facets or as a common metadata envelope.
  • How SessionInfo.delegationChain (session provenance) and a future SubjectDelegation (per-call represented subject) compose without re-introducing ambient subject authority; the User Identity and Policy proposal owns the session-provenance side of that boundary.

Proposal: System Configuration and Operator Extensibility

Current operator-facing design authority now lives in Configuration. Manifest/startup authority lives in Manifest and Service Startup. This proposal is retained as the archival rationale and implementation history.

A small, layered CUE configuration model for the boot manifest that lets operators extend the default boot (system.cue) without forking it, unifies the host operator into a single principal regardless of which authentication method they use, and moves the per-user toolchain cache out of the repository root.

Problem

The default boot manifest (system.cue) and its focused-proof siblings (system-spawn.cue, system-shell.cue, the various system-ssh-*.cue, etc.) are each self-contained CUE files with a large shared scaffold copy-pasted across them. Three concrete pain points follow from that.

  • No clean operator extension surface. An operator who wants to add their own SSH public key, a second principal, or a different MOTD has to edit system.cue directly and carry that as a local diff against main. There is no documented “drop a small file, get an overlay” mechanism, so changes accumulate as untracked checkout-local state or get lost during git pull.
  • No host-user awareness. The default operator account in system.cue is hardcoded as name="operator" / displayName="operator". The host user typing make run sees a generic login identity, and adding their real SSH key requires manual conversion of the .pub file into the manifest’s hex format. The build environment already knows the host user ($USER), the SSH key (~/.ssh/id_ed25519.pub), and the typical operator preferences; none of that information reaches the manifest.
  • Superseded cache default: the original implementation used $(GIT_COMMON_DIR)/../.capos-tools, which created one pinned-tool cache per clone. The implemented default is now $(HOME)/.capos-tools through CAPOS_TOOLS_ROOT, with per-version subdirectories such as limine/<commit>/ and cue/<version>/.

Adjacent design pressure: the SSH Shell Gateway milestone needs a plausible answer to “where does the host operator’s SSH key go?” before its run-target/init-mandate Gate D can close, and the local-users backlog wants the host operator’s session to be a single account with multiple authentication bindings (password, SSH key, future passkey) rather than parallel operator/ssh-operator/passkey-operator seeds.

Design

The proposal is four small, independent moves that compose into one operator-facing extension surface.

1. Per-user toolchain cache

CAPOS_TOOLS_ROOT defaults to $(HOME)/.capos-tools instead of $(GIT_COMMON_DIR)/../.capos-tools. The override path stays available (set the variable explicitly to relocate). Existing per-version subdirectories (limine/<commit>/, cue/<version>/, etc.) keep multiple capOS clones from colliding on a single host. The first make after the change repopulates the new path; the old in-repo .capos-tools/ is left in place and can be removed manually.

Slice 2 must update every consumer that derives the pinned CUE path from the old default. At minimum:

  • tools/mkmanifest::expected_cue_path validates CAPOS_CUE against $(CAPOS_TOOLS_ROOT)/cue/<version>/bin/cue.
  • tools/check-generated-adventure-content.sh recomputes the same path in shell and is invoked by make generated-code-check. If the Makefile exports CAPOS_CUE to the new path but the script recomputes the old one, the generated-code gate will reject the pinned CUE binary.

Any future tool that pins repo-selected helpers must follow CAPOS_TOOLS_ROOT in lock step with the Makefile.

This change is independent of the rest of the proposal — it could ship on its own — but it is bundled because the same operator-extension narrative covers it: per-user state belongs in $HOME, not in the repository.

2. cue/defaults/ package, packaged-default directory, and overlay shape

A new cue/defaults/defaults.cue declares package defaults and exports #DefaultSystem capturing the shared scaffold. The manifest decoder reads root-level schemaVersion, binaries, initConfig, and kernelParams (with seed accounts, resource profiles, authorized SSH keys, MOTD, UART config, and log level all nested under kernelParams), so #DefaultSystem mirrors that exact shape — final fields are at the document root, with kernelParams holding the kernel-side config tree:

  • binaries declarations common to interactive boots
  • initConfig.init and initConfig.services skeletons for the password-login + anonymous-shell flow
  • kernelParams.consoleUart / kernelParams.terminalUart / kernelParams.logLevel
  • kernelParams.seedAccounts with a single canonical host-operator entry (32-byte fixed principalId)
  • kernelParams.resourceProfiles with a single canonical operator resource profile
  • kernelParams.motd
  • kernelParams.authorizedSshKeys (empty by default)
  • A documented set of appendable extension inputs (see below) that overlays use to extend lists. CUE list unification is element-wise conflict, not concatenation; CUE v0.16 also rejects the legacy [a] + [b] list-arithmetic form, requiring list.Concat from the standard library.

The repo’s cue.mod/module.cue declares module: "capos.local" with language v0.16.0. The defaults package lives at cue/defaults/ and uses package defaults (not package capos) so the root overlay can import it without a self-import.

The packaged default manifest stays at the repo root as system.cue, declaring package capos. The overlay companion is system.local.cue (repo root, package capos, gitignored). Focused-proof manifests migrate independently to their own packages so they can import the defaults package without joining package capos. Every repo-root system-*.cue manifest now declares its own CUE package and imports the defaults package, except system-paperclips.cue and system-adventure.cue (demo-owned, package-less but still importing defaults) and system-measure.cue (owned by the measure-mode-repair plan and intentionally not migrated yet). See the Slice-3 inventory table below for the full mapping.

Keeping the default manifest at the repo root preserves the current embed_binaries contract — tools/mkmanifest resolves binaries[].path relative to the manifest’s parent directory and rejects .., so the manifest must live in a directory from which existing repo-root-relative paths like init/target/... are reachable. Moving the default into a subdirectory would force a parallel binary-path-base change in mkmanifest; that is not worth the additional surface for the value of co-locating the overlay.

// system.cue (repo root, packaged default)
package capos

import defaults "capos.local/cue/defaults"

_user: string | *"operator" @tag(user)

#Manifest: defaults.#DefaultSystem & {
    user: _user
}

// Final manifest fields the decoder consumes are at document root.
// The decoder ignores any unused names like #Manifest.
schemaVersion: #Manifest.schemaVersion
binaries:      #Manifest.binaries
initConfig:    #Manifest.initConfig
kernelParams:  #Manifest.kernelParams

The default MOTD value lives only in the defaults package (motd: string | *_defaultMotd, where _defaultMotd is the multi-line capOS welcome with chat/adventure shell hints — see cue/defaults/defaults.cue). system.cue does not assign MOTD itself, so a cue export .:capos without an overlay still resolves to a complete value — two sibling string | *"..." defaults from different files would unify to “incomplete” in CUE v0.16. An overlay refines the field by declaring a concrete value (no *), which is more specific than the default and wins under unification:

// system.local.cue (overlay)
package capos

#Manifest: kernelParams: motd: "Hi alice — capOS dev box."

tools/mkmanifest today invokes cue export <file> against a single file path; CUE then loads only that file (plus its imports) and does not unify other root files even when they share a package name. Slice 2 adds a --package <name> flag that switches mkmanifest to cue export <dir>:<name> (where <dir> is the file’s parent and <name> is capos). The Makefile passes --package capos only for the default-boot recipe; focused make run-* targets keep single-file mode and are not affected by the new packaged default.

Two Makefile changes are required for slice 2 to be safe:

  1. The manifest.bin rule’s prerequisites must include the defaults package (cue/defaults/*.cue) and system.local.cue (when it exists). Otherwise, edits to those files leave a stale manifest.bin and make run boots the previous configuration.
  2. Tag-dependent builds (CAPOS_CUE_USER=$(USER) and optional CAPOS_CUE_DISPLAY_NAME=...) must invalidate cached manifest.bin when the tag value changes. The intended pattern is a sentinel file under target/ whose contents record the tag values; the manifest rule depends on the sentinel, and the sentinel is regenerated whenever the CUE tag environment changes. Without this, make run after a make run-smoke (different tag) silently boots the cached operator-tagged manifest.

3. @tag(user) injection contract

The host user name is injected into the manifest at cue export time via a CUE tag. Because the manifest’s authoritative tag site must be in a file that cue export actually reads, the tag is declared in the root overlay file (system.cue), not the imported defaults package. CUE evaluates tag attributes at the file where they are declared.

The tag site is in the packaged default manifest file (system.cue at the repo root, shown above) — that file declares _user @tag(user) and threads it into defaults.#DefaultSystem via the user field. The defaults package itself does not need a @tag because tags are evaluated where they appear in the input.

// cue/defaults/defaults.cue (excerpt)
package defaults

import "list"

// Fixed 32-byte principal ID — manifest validation rejects shorter
// or longer values. Only display strings vary by host user; the
// audit-correlatable principal stays stable.
_canonicalOperatorPrincipalId: "local-operator-principal-default"

#DefaultSystem: {
    user: string | *"operator"

    schemaVersion: 1
    binaries:      [...] // shared list
    initConfig:    {...} // anonymous-shell flow

    extraSeedAccounts:      [...#SeedAccount]      | *[]
    extraResourceProfiles:  [...#ResourceProfile]  | *[]
    extraAuthorizedSshKeys: [...#AuthorizedSshKey] | *[]

    kernelParams: {
        motd:        string | *"capOS default boot. Type 'login' or 'setup'."
        consoleUart: {...}
        terminalUart: {...}
        logLevel:    string | *"debug"

        seedAccounts: list.Concat([[{
            name:            user
            displayName:     userDisplayName
            principalId:     _canonicalOperatorPrincipalId
            kind:            "operator"
            // ...
        }], extraSeedAccounts])

        resourceProfiles: list.Concat([[{
            name: "default-operator-profile"
            // ...
        }], extraResourceProfiles])

        authorizedSshKeys: extraAuthorizedSshKeys
    }
}

tools/mkmanifest today invokes cue export <path> from Rust and does not pass --inject / -t flags. Slice 2 adds a tag pass-through: either a new mkmanifest --tag user=alice CLI option that mkmanifest forwards to the underlying cue export, or — simpler — mkmanifest reads environment tags and forwards each value as --inject key=value. The Makefile sets CAPOS_CUE_USER=$(USER) for make run only; mkmanifest derives displayName from that same account’s passwd comment unless CAPOS_CUE_DISPLAY_NAME is explicitly set. make run-smoke and CI-shaped targets leave them unset, so untagged system.cue continues to see account=operator / display=operator; focused smoke manifests may pin demo-specific account fixtures independently.

@tag is the standard CUE pattern for build-time string injection and is preferred over preprocessing the file with sed or generating a wrapper file. It generalizes: future tags can carry hostname, locale, timezone, or other build-environment-derived values without adding more mechanisms.

4. system.local.cue overlay hook

The overlay file is system.local.cue at the repo root, declaring package capos. It is gitignored explicitly. CUE in package mode ignores files whose names start with ., so a leading-period variant would not be loaded; the chosen filename has no leading dot.

In package mode (slice-2 mkmanifest invocation cue export .:capos), CUE unifies every non-hidden *.cue file in the directory that declares package capos — today that is just system.cue; once the operator adds system.local.cue, both files are unified automatically with no imperative include. Focused-proof manifests are not picked up because migrated variants use their own package names and unmigrated variants remain package-less.

A checked-in system.local.cue.example (repo root, package capos) documents the supported extension shapes with worked examples. The operator copies it to system.local.cue to activate.

Appendable extension inputs

CUE list unification is element-wise conflict, not concatenation, so an overlay cannot extend the defaults’ seedAccounts or authorizedSshKeys by re-assigning the same field. The defaults package therefore exposes named extension lists that it concatenates into the final manifest fields:

See the defaults excerpt above for the appendable inputs (extraSeedAccounts, extraResourceProfiles, extraAuthorizedSshKeys) and how they are concatenated via list.Concat — the form [a] + [b] is rejected by CUE v0.16.

The overlay populates the extra* fields on #Manifest (which is the named definition produced by the packaged-default file), never the final lists:

// system.local.cue (repo root, gitignored, copied from .example)
package capos

#Manifest: extraAuthorizedSshKeys: [{
    keyId:                "host-laptop-ed25519-2026-04"
    principalId:          "local-operator-principal-default"
    algorithm:            "ssh-ed25519"
    publicKey:            "hex:..."          // see how-to doc
    fingerprintSha256:    "..."
    allowedShellProfiles: ["operator"]
    source:               "manifest"
    comment:              "host laptop"
}]

The principal id stays the fixed 32-byte canonical value — the overlay does not derive a per-user principal id. Display strings change with @tag(user); the audit-correlatable identity does not.

Worked extension scope (slices 2 and 3)

The overlay ships supporting these operator extensions:

  • MOTD: re-declare #Manifest.kernelParams.motd in the overlay with a concrete string. The default is string | *"...", so a more concrete overlay value wins under CUE unification.
  • Console password verifier: override #Manifest.kernelParams.consolePasswordVerifierPhc (Argon2id PHC string) so the development verifier shipped by the defaults package is replaced for any non-research deployment.
  • Extra SSH keys for the host operator: append to extraAuthorizedSshKeys with principalId matching the canonical operator. Multiple keys allowed.
  • Extra non-operator principals: append to extraSeedAccounts with kind: "guest", kind: "service", or future kinds. Adding a second kind: "operator" is not supported in slice 2kernel/src/cap/mod.rs::operator_seed_account rejects manifests with more than one operator seed for password login. Multi-operator support is a separate change in the user-identity-and-policy track.
  • Extra resource profiles: append to extraResourceProfiles for custom quota templates referenced by extra accounts.
  • Extra boot binaries: append to extraBinaries with name and repo-relative path. The defaults package concatenates the list onto its _baseBinaries so mkmanifest embeds the operator binary into manifest.bin alongside the default service set.
  • Extra init-launched services: append to extraServices with name, binary (resolved against binaries), restart, and the cap graph the service should receive at spawn. The defaults concatenate operator extras after _baseServices, so init starts the operator service after the default chat server, remote-session gateway, and shell.

Task 4 closeout (2026-05-03 18:51 EEST): system.local.cue.example covers every extension above. The plan calls for make run as the verification target, but make run is interactive, so verification ran make manifest (default MANIFEST_SOURCE=system.cue, package mode --package capos) with the example copied to system.local.cue. The package-mode rebuild emitted the operator MOTD into manifest.bin (3 services, 12 binaries → 2551416 bytes, log target/manifest-refreshed-example.log); rebuilding the same target with the overlay absent produced 2553224 bytes, confirming the operator MOTD overrode the defaults’ default value. make run-smoke was not a useful overlay verification because that target builds manifest-smoke.bin from system-smoke.cue in single-file mode (no --package flag, no sibling-file unification); md5 of manifest-smoke.bin was identical with and without the overlay file present.

The proposal does not generate the SSH hex/fingerprint conversion in the Makefile — that lives in docs/configuration.md as a short ssh-keygen -lf ~/.ssh/id_ed25519.pub + xxd/base64 -d pipe. Keeping this manual avoids importing arbitrary host SSH keys into the boot manifest by default.

5. Single-account-multi-auth invariant

The host operator is one account with potentially many authentication bindings:

  • Password verifier — current consoleCredential PHC blob; bound to the host operator account by being declared at the same manifest scope (today there is no explicit principalId reference in the credential record, but the kernel resolves the operator principal from the seed account at session-mint time).
  • SSH public keys — multiple records in authorizedSshKeys, each carrying principalId matching the host operator’s seed account.
  • Future passkey/OIDC bindings — same pattern; the user-identity-and-policy proposal already shows ExternalIdentityBinding shaped this way.

The kernel’s operator_session_metadata already pulls the principal from the manifest seed account when present (see kernel/src/cap/session_manager.rs OperatorSeedAccount); the hardcoded compatibility fallback fires only when no seed account is declared. Once system.cue declares the host-operator seed account explicitly, both password login and SSH public-key login mint a session for the same principal. The AuthorityBroker.shellBundle path is unchanged — it already routes through the AccountStore by principal id (after the SSH AccountStore-bound auth slice landed at commit 33100f4).

Importantly: this is not a kernel change. It is a manifest-shape choice that makes the existing kernel resolution path the canonical one. The bootstrap fallback (no seed account → hardcoded operator principal) stays in place for focused proofs that intentionally test the no-account-store path.

Migration Plan

SliceScopeRisk
1 (this)Proposal + task ledger pointer + index entry. No code.None.
2Makefile (CAPOS_TOOLS_ROOT default, CAPOS_CUE_TAGS sentinel-file dependency for make run, manifest-rule prerequisites for the defaults package and system.local.cue); cue/defaults/defaults.cue; system.cue rewrite (stays at repo root, becomes package capos); system.local.cue.example (committed at repo root); tools/mkmanifest package-mode flag (--package capos switching to cue export <dir>:capos), tag pass-through, and updated expected_cue_path for the new tools-root default; docs/configuration.md; CLAUDE.md project-layout note.Medium — touches Makefile, mkmanifest CLI surface, the default boot manifest, and adds a new package directory. Smoke harness assertions on principal=operator must keep passing because slice 2 leaves the default tag at operator.
3Migrate focused-proof variants onto the defaults package. Closed at commit a50f610d (2026-05-03 21:54 UTC): Task 2 migrated the owned set (see the Slice-3 inventory table below), Task 3 tightened the manifest decoder to reject unknown root fields with regression tests at commit f3d89757 (see the Slice-3 Task-3 closeout below), Task 4 refreshed system.local.cue.example and docs/configuration.md to cover every defaults-package extension hook, and Task 5 stamped this status header, the task ledger System Configuration ad-hoc bullet, and the docs/changelog.md entry. One commit per variant or grouped by audit area.Low per variant once slice 2 is in. Coordinated with parallel agents to avoid worktree collisions.
4Add mkmanifest cue-to-capnp, a general host-side conversion path for CUE-authored data messages rooted at a caller-specified Cap’n Proto struct. The tool reuses the slice-2 CUE package/tag machinery, validates both CAPOS_CUE and CAPOS_CAPNP against the pinned per-user tool cache, checks cue version v0.16.0 and Cap'n Proto version 1.2.0, passes import paths through safe Command arguments, and writes the converted binary only after capnp convert json:binary succeeds.Low for boot behavior because the existing manifest pipeline is unchanged. Medium host-tool risk because schema, CUE, and JSON are hostile inputs; the implementation delegates Cap’n Proto type rules to the pinned upstream converter and keeps filesystem/process boundaries explicit.

Slice-3 manifest inventory

The table below records the migration state of every repo-root system-*.cue manifest at the slice-3 Task-2 closeout. “Imports defaults” means the file declares a CUE package and pulls in capos.local/cue/defaults. “Migration shape” distinguishes between manifests that unify the full defaults.#DefaultSystem scaffold (and inherit MOTD, seed accounts, resource profiles, the base service graph, etc.) and focused-proof manifests that intentionally reference the defaults package only as a constant lookup for schemaVersion, logLevel, and UART configuration. Both shapes are valid migration targets — focused proofs need a narrow cap graph and cannot inherit the default service tree.

ManifestPackageImports defaultsMigration shapeDriven by
system.cuecaposyesfull scaffoldmake run, make remote-session-ui
system-spawn.cuespawnyesconstant lookupmake run-spawn
system-shell.cueshellyesconstant lookupmake run-shell
system-terminal.cueterminalyesconstant lookupmake run-terminal
system-credential.cuecredentialyesconstant lookupmake run-credential
system-login.cueloginyesfull scaffoldmake run-login
system-login-setup.cueloginsetupyesfull scaffoldmake run-login-setup
system-local-users.cuelocalusersyesconstant lookupmake run-local-users
system-revocable-read.cuerevocablereadyesconstant lookupmake run-revocable-read
system-memoryobject-shared.cuememoryobjectsharedyesconstant lookupmake run-memoryobject-shared
system-restricted-shell-launcher.cuerestrictedshelllauncheryesconstant lookupmake run-restricted-shell-launcher
system-chat.cuechatyesfull scaffoldmake run-chat
system-smoke.cuesmokeyesfull scaffoldmake run-smoke, make run-diagnostics, make run-iommu-acpi, make run-acpi-pcie, make run-net, make run-uefi, make run-pci-nvme, make run-ringtap-failing-call
system-session-context.cuesessioncontextyesconstant lookupmake run-session-context
system-ipc-zerocopy.cueipczerocopyyesconstant lookupmake run-ipc-zerocopy
system-service-object-routing.cueserviceobjectroutingyesconstant lookupmake run-service-object-routing
system-tcp-listen-authority.cuetcplistenauthorityyesconstant lookupmake run-tcp-listen-authority
system-capnp-chat-interop.cuecapnpchatinteropyesconstant lookupmake run-capnp-chat-interop-vm
system-thread-scale.cuethreadscaleyesconstant lookupmake run-thread-scale
system-smp-process-scale.cuesmpprocessscaleyesconstant lookupmake run-smp-process-scale
system-remote-session-capset-interop.cueremotesessioncapsetinteropyesconstant lookupmake run-remote-session-capset-interop-vm
system-remote-session-adventure-interop.cueremotesessionadventureinteropyesconstant lookupmake run-remote-session-adventure-interop-vm
system-ssh-host-key.cuesshhostkeyyesconstant lookupmake run-ssh-host-key
system-ssh-authorized-key.cuesshauthorizedkeyyesconstant lookupmake run-ssh-authorized-key
system-ssh-public-key-session.cuesshpublickeysessionyesconstant lookupmake run-ssh-public-key-session
system-ssh-public-key-auth.cuesshpublickeyauthyesconstant lookupmake run-ssh-public-key-auth
system-ssh-feature-policy.cuesshfeaturepolicyyesconstant lookupmake run-ssh-feature-policy
system-paperclips.cuenoneyesdemo-owned scaffold usemake run-paperclips
system-adventure.cuenoneyesdemo-owned scaffold usemake run-adventure
system-measure.cuenonenounmigrated; owned by measure-mode-repair planmake run-measure

system-paperclips.cue and system-adventure.cue are demo-owned and not part of the slice-3 conflict surface. They already pull #DefaultSystem for the operator account fixture but stay package-less because their make run-* targets predate the package-mode flag. Migrating them onto a package paperclips / package adventure shape is a follow-up coordinated through the demo plans rather than slice 3. system-measure.cue waits for docs/backlog/scheduler-evolution.md to close, then can be migrated in its own batch.

All manifests added after the Slice-3 closeout (C payload manifests, DDF grant manifests, hardware-audit variants, POSIX adapter smokes, WASI smokes, wasm-host, thread-fairness variants, scheduler/scheduling-context, limit proofs, and remote-session variants) follow the same convention: each declares its own CUE package and imports capos.local/cue/defaults. The table above is a Slice-3 migration snapshot; it is not exhaustive of all current repo-root system-*.cue files.

Slice-3 Task 3 closeout

Closed 2026-05-03 20:22 UTC at commit f3d89757. The SystemManifest CUE decoder (capos-config/src/manifest.rs) now validates the document root against an explicit allow-list and returns Error::UnknownField { path, field, expected } for any other top-level name. The accepted set lives in the decoder (SYSTEM_MANIFEST_ROOT_FIELDS) and is schemaVersion, binaries, initConfig, kernelParams — adding a future field is a deliberate edit to that list. Two host-side tests in capos-config/src/manifest.rs (system_manifest_rejects_unknown_root_field and system_manifest_accepts_only_known_root_fields) pin both the rejection path and the positive case so a regression is caught by cargo test-config before any QEMU run. The Cap’n Proto schema for SystemManifest is closed by construction, so the strictness check only needs to live at the CUE/JSON boundary; capnp decode paths remain unchanged. The slice-3 inventory above guarantees that every owned focused-proof manifest already projects only those four fields at the document root, so the rule does not break any migrated manifest. docs/configuration.md records the operator-facing behavior of the new error.

Slice 2 is intentionally minimal so that any breakage shows up on the default make run / make run-smoke path immediately, rather than hidden behind a fan-out of converted variants.

Slice 4 deliberately does not make CueValue universal. CueValue remains the project-defined generic tree used inside SystemManifest.initConfig. The general converter has a different contract:

mkmanifest cue-to-capnp \
  [--package capos] [--tag key=value ...] \
  [--import-path schema ...] [--no-standard-import] \
  input.cue schema/example.capnp Example output.bin

input.cue is exported as JSON, then the pinned Cap’n Proto tool validates that JSON against schema/example.capnp and root struct Example. This covers normal Cap’n Proto data fields, nested structs, lists, enums, unions, defaults, and imports according to upstream capnp convert semantics. It does not serialize live capOS capability table entries or meaningful Cap’n Proto interface objects; authority still travels through capOS capability transfer mechanics, not through JSON-authored data files.

Cross-References

  • Manifest and Service Startup — describes the CUE evaluation, boot manifest build, and general cue-to-capnp host-tool flow that this proposal extends.
  • Local Users, Storage, and Policy — Gate 1 manifest-seeded accounts; this proposal shapes the default manifest’s seed account to match the single-account-multi-auth invariant the backlog calls for.
  • Run Targets, Init Mandate, and Default-Run Integration — Gate D (default-make run integration); this proposal makes Gate D closure for the SSH milestone tractable by giving the default manifest a clean place to absorb optional services and authorized keys.
  • SSH Shell Gateway — consumes the host-user authorized-key surface in a future slice once OpenSSH transport gates land.
  • User Identity and Policy — defines the principal/account/session model and ExternalIdentityBinding shape that this proposal’s single-account-multi-auth invariant relies on. Multi-operator support is tracked there.
  • Service Architecture — primary consumer of the layered manifest: initConfig.services and the extraServices extension hook described above feed the authority-at-spawn service graph. The defaults package owns the base service tree (chat server, remote-session gateway, shell); overlays append operator-owned services without forking it.
  • Userspace Binaries — defines the binary set the layered manifest embeds. The binaries/extraBinaries shape covers native Rust capos-rt binaries, libcapos C-substrate binaries, the POSIX adapter binaries, and the wasm-host binary uniformly; per-language payload conventions (for example, the wasm-host’s stable wasi-payload manifest name) are documented there.
  • POSIX Adapter — POSIX adapter smokes (make run-posix-dns-smoke, make run-posix-pipe-smoke, make run-posix-stdio-smoke) are driven by focused system-posix-*.cue manifests that live in the same package-mode/overlay regime as the rest of the migrated manifest set. Operator-installable POSIX-ported services attach through extraBinaries/extraServices and inherit the same authority-at-spawn grants the default service tree uses.
  • WASI Host Adapter — per-instance text grants (initConfig.init.wasiArgs, initConfig.init.wasiEnv) are CUE-authored manifest fields that flow through this proposal’s package-mode evaluation; the manifest-decoder strictness invariant closed in Slice 3 Task 3 is the same gate that catches mistyped WASI argv/env field names before a payload boots.
  • System Info Capability — adjacent precedent for “rename + structural cleanup + worked Phase 2”; this proposal adopts the same status-header and cross-reference shape.
  • Trusted Build Inputs — needs entries for the new cue/defaults/defaults.cue, the system.local.cue overlay surface, the CAPOS_CUE_TAGS environment variable (and the target/-side sentinel that records it), and the host $USER value injected via @tag(user) — all become trusted boot-manifest inputs once slice 2 lands.

Non-Goals

  • This proposal does not auto-ingest ~/.ssh/id_ed25519.pub into the manifest. The system.local.cue.example shows how the operator ingests their key explicitly. Auto-ingestion is a separate decision that has security implications (which keys count? how is the hex/fingerprint conversion validated?) and should not be bundled with the configuration-shape change.
  • This proposal does not auto-start ssh-gateway in system.cue. The SSH gateway service is added when its OpenSSH transport gates close (decomposed in docs/backlog/runtime-network-shell.md). Until then, an authorized SSH key declared in system.local.cue is plumbing-only.
  • This proposal does not introduce a CUE-level imperative “include if file exists” mechanism. CUE’s same-package unification already provides the overlay behavior; the operator’s only action is to drop a file with the right package capos header.
  • This proposal does not define a remote operator-extension delivery channel (cloud-metadata, fleet config). Those are addressed by cloud-metadata-proposal.md and stay separate.

Open Questions

  • Whether principalId should ever follow the host user. This proposal fixes principalId at 32 bytes (local-operator-principal-default) so audit history is stable even if $USER changes. A future per-user-derived principal id would need a deterministic, validated 32-byte derivation and a rollover plan; that is out of scope here.
  • Where system.local.cue lives. This proposal places it at the repo root next to system.cue. That scopes the overlay to the same package capos CUE loads in package-mode export, keeps binary path resolution unchanged, and is gitignored cleanly. Focused-proof manifests are not picked up by package capos export because migrated variants use separate package names and unmigrated variants declare no package directive — so this is settled.
  • Whether to migrate focused proofs to the defaults package. Slice 3 assumes yes because it removes copy-paste, but each variant must keep its proof shape and checks. The Slice-3 inventory table above records the migration state for every repo-root system-*.cue manifest. The intentionally divergent system-measure.cue is left for a follow-up batch keyed off the measure-mode-repair plan.
  • Tag injection for run-shell / run-terminal / focused interactive proofs. Slice 2 only wires make run. If make run-shell should also personalize, slice 3 adds it; if focused proofs should always use operator, slice 3 leaves them alone.

Proposal: Cryptography and Key Management

Capability-native abstractions for cryptographic keys and key sources. Keys are capability objects; key material never crosses cap boundaries. One interface serves every consumer — volume encryption, TLS, code signing, instance identity, authenticated backups, per-service secrets.

Implementation Status

This proposal is partially implemented. schema/capos.capnp now contains the minimal SymmetricKey, PrivateKey, and PublicKey ABI plus a RAM-only KeyVault subset needed by the TLS/ACME precursor. capos-tls provides host-tested RAM-only XChaCha20 plus HMAC-SHA256 authenticated encryption, HMAC-SHA256 MAC/verify, and P-256 signing cores. A development-only software KeySource bootstrap now mints TLS and ACME account key handles for local proofs, labels the source as non-production, and is rejected by production/public profiles. The implemented key surface requires an explicit requested KeyPurpose, exports only public material (spkiDer and P-256 JWK for ACME account JWS registration), lists non-secret vault/source metadata, and has no raw symmetric or private-key export surface. There is still no runtime key service, persistence, hardware/cloud custody, symmetric-key derivation or wrapping, ACME protocol, TLS server handshake, or production KeySource.

The first implementation chain is the narrow TLS/ACME precursor owned by Certificates / TLS:

  • crypto-privatekey-publickey-ram-signing-local-proof – done 2026-06-04: minimal PrivateKey / PublicKey schema and RAM signing proof for TLS server keys and ACME account JWS keys.
  • crypto-keyvault-ram-privatekey-custody-local-proof – done 2026-06-05: RAM-only KeyVault handles for those private keys, with generation, open/list/destroy, purpose separation, and stale-handle failure.
  • crypto-development-keysource-tls-acme-bootstrap-local-proof – done 2026-06-05: development-only software KeySource bootstrap for local TLS/ACME proofs, rejected for production/public profiles.

That precursor intentionally excludes persistent storage, TPM, cloud KMS, passphrase/passkey unlock, raw private-key import, ACME protocol, and TLS server handshakes. It also tightens the TLS/ACME invariant: raw private-key material is not written to manifests, boot images, logs, task records, or evidence.

The capability-infrastructure reconciliation (cap-infra-crypto-key-caps-phase1-reconcile-local-proof, done 2026-06-06) added the minimal RAM-only SymmetricKey ABI and local proof for XChaCha20 plus HMAC-SHA256 authenticated encryption and HMAC-SHA256 MAC/verify. It follows the same RAM-only rule for symmetric key bytes and adds no key export, persistence, wrapping, or production custody.

Problem

Nearly every forthcoming capOS subsystem wants cryptography. A partial list:

  • Volume encryption at rest (Volume Encryption).
  • TLS termination in the web text shell gateway (Boot to Shell).
  • Inter-service mTLS on a multi-host capability graph (Networking).
  • Instance identity tokens (signed JWTs) produced from cloud hypervisor metadata (Cloud Metadata).
  • WebAuthn/passkey public-key verification for login.
  • Signed audit logs (System Monitoring).
  • Signed boot manifests and measured boot (Storage and Naming Open Question #5).
  • Cloud KMS integration (envelope encryption for volumes and object stores).
  • Future: signed release artifacts, encrypted swap, session tokens.

Without a shared abstraction each of these invents its own key interface, its own “where does the key live” story, and its own audit trail. That is how Linux ended up with dm-crypt, fscrypt, keyctl, PKCS#11, ssh-agent, gpg-agent, systemd-creds, TPM tools, and cloud-specific SDKs as mutually-unaware silos. capOS is young enough to avoid that.

Design Principle: Keys Are Capabilities

In every Unix-lineage system, a key is a byte string — a secret stored somewhere (keyring, file, memory, HSM handle), protected by a mechanism orthogonal to the system’s main abstractions (syscalls + files + processes). Every new subsystem therefore invents a new protection mechanism.

In capOS, a key is a capability object. Holding a SymmetricKey or PrivateKey cap means “you may compute with this key.” It does not mean “you may see this key.” Key material lives in the address space of the service that implements the cap; callers reach it by invoking methods.

Consequences:

  • Attenuation falls out of the capability model. A decrypt-only SymmetricKey is a wrapper CapObject that rejects encrypt. A key bound to a single AAD domain is a wrapper that fixes the aad argument. A sign-only PrivateKey is a wrapper that rejects decrypt. No new kernel mechanism is needed.
  • Revocation is a cap drop. Drop the cap, the key is gone from that holder’s reach. Other holders are unaffected.
  • Audit is intrinsic. Every method invocation can flow through an audit cap. A malicious service granted decrypt authority generates audit records for every use; it cannot exfiltrate the raw key material silently.
  • Hardware isolation composes cleanly. A TPM-backed key service implements the same PrivateKey interface as an in-process software key service; callers cannot distinguish, and should not need to.

A service granted a SymmetricKey with both encrypt and decrypt can still run arbitrary oracle queries against the key. That is weaker than “the key material never leaves an HSM” and stronger than “the key is a byte string in the process heap.” When stronger containment is required, the key service is a thin process sitting on top of a hardware primitive (TPM, Secure Enclave, cloud KMS).

Schemas

Symmetric keys

interface SymmetricKey {
    # Authenticated encryption. The Phase-1 RAM implementation supports
    # `xchacha20HmacSha256` only: XChaCha20 stream encryption with HMAC-SHA256
    # authentication. It generates a fresh nonce internally and returns
    # ciphertext plus tag separately so callers cannot choose nonce reuse.
    encrypt @0 (plaintext :Data, aad :Data, purpose :KeyPurpose)
            -> (ciphertext :Data, nonce :Data, tag :Data);

    # Authenticated decryption. `aad`, `nonce`, and `tag` must match the values
    # from `encrypt`; failures return an application error, not plaintext.
    decrypt @1 (ciphertext :Data,
                nonce :Data,
                tag :Data,
                aad :Data,
                purpose :KeyPurpose)
            -> (plaintext :Data);

    # MAC-only modes for keys with `KeyPurpose.integrity`.
    mac    @2 (message :Data, purpose :KeyPurpose) -> (tag :Data);
    verify @3 (message :Data, tag :Data, purpose :KeyPurpose) -> (ok :Bool);

    info @4 () -> (algorithm :SymmetricAlgorithm,
                   purpose :KeyPurpose,
                   identifier :Data);
}

enum SymmetricAlgorithm {
    aes256Gcm         @0;
    aes256GcmSiv      @1;
    xchacha20Poly1305 @2;
    aes256Xts         @3;  # block-device only; no authentication
    hmacSha256        @4;  # mac/verify only
    hmacSha384        @5;
    hmacSha512        @6;
    xchacha20HmacSha256 @7;  # landed local proof construction
}

Subkey derivation and key wrap/unwrap remain outside the landed Phase 1 ABI. Later slices that add them must allocate new method ordinals after info @4 instead of reusing the Phase 1 slots.

Asymmetric keys

interface PublicKey {
    # Verify only for the requested purpose. A public key derived from a
    # TLS certificate key rejects an ACME account verification request, and
    # vice versa.
    verify    @0 (message :Data,
                  signature :Data,
                  scheme :SignatureScheme,
                  purpose :KeyPurpose)
              -> (ok :Bool);
    # Export raw public material (SPKI DER, JWK, OpenSSH, PGP) for
    # callers that need to distribute it. Public material is freely
    # shareable; the cap itself is an authority only to invoke
    # methods, not to "own" the public key.
    export    @1 (format :PublicKeyFormat) -> (encoded :Data);
    info      @2 () -> (algorithm :AsymmetricAlgorithm,
                        purpose :KeyPurpose,
                        identifier :Data);
}

interface PrivateKey {
    # Sign only for the requested purpose. The first implementation accepts
    # P-256 with `default` / `ecdsaSha256` and rejects other schemes.
    sign      @0 (message :Data,
                  scheme :SignatureScheme,
                  purpose :KeyPurpose)
              -> (signature :Data);
    public    @1 () -> (pk :PublicKey);
    info      @2 () -> (algorithm :AsymmetricAlgorithm,
                        purpose :KeyPurpose,
                        identifier :Data);
}

enum AsymmetricAlgorithm {
    ed25519      @0;
    x25519       @1;
    p256         @2;
    p384         @3;
    rsa2048      @4;
    rsa3072      @5;
    rsa4096      @6;
    # Post-quantum placeholders; added as capOS ships them.
    mlKem768     @7;  # ML-KEM (Kyber) for KEM
    mlDsa65      @8;  # ML-DSA (Dilithium) for signatures
}

enum SignatureScheme {
    default      @0;  # algorithm's natural default (Ed25519 pure, RSA-PSS, etc.)
    ecdsaSha256  @1;
    ecdsaSha384  @2;
    rsaPssSha256 @3;
    rsaPssSha512 @4;
    rsaPkcs1Sha256 @5;  # for compatibility only
}

enum PublicKeyFormat {
    spkiDer     @0;
    jwk         @1;
    opensshWire @2;
    pgpPacket   @3;
}

Shared metadata

enum KeyPurpose {
    generic       @0;
    blockVolume   @1;
    objectStore   @2;
    envelope      @3;   # KEK — only wraps/unwraps
    integrity     @4;   # MAC-only
    tls           @5;
    codeSigning   @6;
    instanceIdentity @7;
    authToken     @8;   # session tokens, JWTs
    webauthn      @9;
    audit         @10;
    oauthClientAssertion @11;  # RFC 7523 private_key_jwt client auth
    oidcIdToken   @12;          # IdP-side ID token signing (LocalIdentityProvider)
    dpopBinding   @13;          # RFC 9449 proof-of-possession keypairs
    acmeAccount   @14;          # RFC 8555 account JWS signing
}

identifier (bytes in info()) is an opaque, stable handle usable for logging, correlating audit records, and looking up the key in a KeyVault. It is not a secret. It is not a cryptographic hash of the key (that would let an attacker confirm a guessed key); it is a random ID chosen at key creation.

Key sources

A KeySource produces keys given some unlock context. Different implementations realize different trust models.

interface KeySource {
    # Produce a key given an unlock context (passphrase bytes, a
    # passkey assertion, a sealed blob, an attestation report, empty
    # for sources that hold keys directly).
    unlockSymmetric @0 (context :Data, purpose :KeyPurpose)
                    -> (key :SymmetricKey);
    unlockPrivate   @1 (context :Data, purpose :KeyPurpose)
                    -> (key :PrivateKey);

    # Seal a key under this source's policy. The returned blob can be
    # stored in the clear; unlock will refuse to produce the key
    # unless its policy is satisfied.
    sealSymmetric @2 (key :SymmetricKey, policy :SealPolicy)
                  -> (blob :Data);
    sealPrivate   @3 (key :PrivateKey, policy :SealPolicy)
                  -> (blob :Data);

    # Rewrap: unseal under current policy, reseal under new policy.
    # Used for KEK rotation without touching the underlying key.
    rewrap @4 (blob :Data, newPolicy :SealPolicy) -> (newBlob :Data);

    info @5 () -> (kind :KeySourceKind, identifier :Data);
}

enum KeySourceKind {
    manifestEmbedded @0;  # dev/CI only
    passphrase       @1;
    passkeyPrf       @2;  # WebAuthn PRF extension
    tpm2             @3;
    secureEnclave    @4;
    cloudKms         @5;
    attestation      @6;  # SEV-SNP / TDX / Nitro
    network          @7;  # Tang/Clevis-style
    softwareStored   @8;  # encrypted-at-rest in a KeyVault
    oidcFederated    @9;  # OIDC AccessToken -> KMS / remote unlock, no baked creds
}

struct SealPolicy {
    union {
        none          @0 :Void;
        pcr           @1 :PcrPolicy;
        kms           @2 :KmsPolicy;
        attested      @3 :AttestationPolicy;
        composite     @4 :List(SealPolicy);  # AND of sub-policies
        tokenExchange @5 :TokenExchangePolicy;  # OIDC/OAuth2-gated unlock
    }
}

struct TokenExchangePolicy {
    # The OIDC issuer whose tokens satisfy this policy.
    issuer          @0 :Text;
    # Required token audience (the KMS / STS endpoint).
    audience        @1 :Text;
    # Required subject predicate. Union allows exact or pattern matches
    # without growing this struct; see oidc-and-oauth2-proposal for the
    # full pattern grammar.
    subjectPattern  @2 :Text;
    # Additional required claims (e.g. `groups`, tenant ID, attestation
    # fields). Values are JSON-encoded bytes.
    requiredClaims  @3 :List(NamedClaim);
    # Acceptable LoA levels mapped from `acr`/`amr`.
    minAuthStrength @4 :UInt8;
}

struct NamedClaim {
    name  @0 :Text;
    value @1 :Data;
}

struct PcrPolicy {
    pcrMask   @0 :UInt32;            # bitmap of PCR indices
    pcrDigest @1 :Data;              # expected composite digest
    bank      @2 :TpmHashBank;
}

struct KmsPolicy {
    provider    @0 :Text;            # "aws", "gcp", "azure", "vault", ...
    keyId       @1 :Text;
    grantTokens @2 :List(Text);
}

struct AttestationPolicy {
    platform        @0 :AttestationPlatform;
    measurement     @1 :Data;
    signerPublicKey @2 :Data;
    allowedVariant  @3 :List(Data);  # e.g. permitted firmware versions
}

enum AttestationPlatform {
    sevSnp @0;
    tdx    @1;
    nitro  @2;
}

Key lifecycle — the KeyVault

A KeyVault is a stateful service that stores key material, issues key handles, handles rotation, and emits audit events. It is distinct from KeySource: a KeySource is a factory producing keys; a KeyVault is a registry tracking the keys a deployment knows about. The schema below is the landed RAM-only TLS/ACME subset. Future symmetric-key, import, seal-policy, unlock, persistence, and rotation methods append to this interface; they do not renumber the landed methods.

enum KeyMaterialSource {
    ramGenerated @0;
    imported     @1;
    keySource    @2;
}

interface KeyVault {
    generatePrivate @0 (
        algorithm :AsymmetricAlgorithm,
        purpose :KeyPurpose,
        createdAtEpochSeconds :UInt64,
        auditLabel :Text
    ) -> (handle :KeyHandle, key :PrivateKey);

    openPrivate @1 (handle :KeyHandle) -> (key :PrivateKey);

    list @2 (filter :KeyFilter) -> (entries :List(KeyEntry));

    destroy @3 (handle :KeyHandle, reason :Text) -> ();
}

struct KeyHandle {
    identifier @0 :Data;
    generation @1 :UInt64;
}

struct KeyEntry {
    handle @0 :KeyHandle;
    algorithm @1 :AsymmetricAlgorithm;
    purpose @2 :KeyPurpose;
    createdAtEpochSeconds @3 :UInt64;
    lastUsedEpochSeconds @4 :UInt64;
    source @5 :KeyMaterialSource;
    auditLabel @6 :Text;
}

struct KeyFilter {
    purposes @0 :List(KeyPurpose);           # OR
    algorithms @1 :List(AsymmetricAlgorithm); # OR
}

Concrete Key Sources

Not all of these ship on day one. Phases below give a sequence.

ManifestEmbeddedKeySource — development and CI only

Key material baked into SystemManifest. Unsealable. Boot-time validation refuses to build a production-profile image against this source. Used for QEMU smoke tests and hermetic CI.

Do not use manifest-embedded raw private keys for the TLS/ACME precursor chain. Those local proofs use a development-only software source that generates key handles at boot instead, so private key material does not enter manifests, images, logs, task records, or evidence.

PassphraseKeySource — interactive unlock

Consumes a passphrase from the console login flow (Boot to Shell), runs Argon2id with per-source parameters, derives a KEK, unwraps sealed blobs. No persistent state beyond the salt and KDF parameters (which are public).

PasskeyPrfKeySource — session unlock from WebAuthn

Consumes a WebAuthn assertion whose hmac-secret / PRF extension yields a per-credential symmetric secret. Derives a KEK from the PRF output; KEK unwraps the user’s sealed DEK. Key material never leaves the authenticator; the PRF output never leaves the key service process.

Tpm2KeySource — hardware-bound, measured-boot-gated

A TPM 2.0 driver service holds the TPM; this source wraps it. Seal policies bind keys to PCR digests; unseal succeeds only if the running boot chain matches. Enables unattended boot while keeping the key off the disk.

SecureEnclaveKeySource — platform key stores

Analog for Apple Secure Enclave, Android StrongBox, Intel CSE. Same interface shape as Tpm2KeySource; different backing primitive.

CloudKmsKeySource — cloud envelope encryption

Wraps a cloud KMS (AWS KMS, GCP KMS, Azure Key Vault, HashiCorp Vault, KMIP). Unlock calls the KMS Decrypt operation with a wrapped DEK and returns the plaintext DEK as a SymmetricKey cap. Seal calls KMS Encrypt under a named KEK.

Authentication to KMS uses the InstanceIdentity cap from Cloud Metadata; no long-lived credentials live in the capOS image.

Properties the system gets by following the envelope pattern:

  • Free KEK rotation (rewrap the DEK; volume data is untouched).
  • Revocation by disabling the KMS key or revoking the IAM grant.
  • Cross-account / cross-region access via KMS grants.
  • Every unwrap appears in the cloud provider’s audit log — observability comes for free.

AttestationKeySource — confidential computing

Consumes SEV-SNP, TDX, or Nitro attestation reports. unlock submits the report to a remote verifier (often cloud KMS with attestation policy) which returns the unwrapped DEK only if the report matches an approved measurement. Enables “only this specific capOS image, running on genuine attested hardware, can decrypt this volume.”

NetworkKeySource — Tang / Clevis-style

Unlock derives a key by interacting with one or more remote servers; no single server sees the plaintext key (when combined with secret sharing). Supports the “revoke access by taking the server offline” model without physical-access requirements.

SoftwareStoredKeySource — encrypted on disk, under another source

The recursive case: a source whose seal policy points at another source. Used to compose, e.g., a file-backed key store encrypted under a TPM-sealed master key. The outer source provides integrity (TPM seal); the inner source provides convenience (named key lookup).

OidcFederatedKeySource — token-exchange-gated unlock

Derives a key from a short-lived OIDC/OAuth2 access token. The source holds an OAuthClient or WorkloadIdentityFederation cap (from OIDC and OAuth2). unlock obtains a fresh token for the configured audience — either by exchanging a local InstanceIdentity JWT, a Kubernetes projected service-account token, or a user session’s access token — then presents it to a remote KMS / STS / custom key service which returns the wrapped DEK.

Two common shapes:

  1. Cloud KMS with workload identity federation. Audience is the cloud STS; after token exchange the resulting cloud credential calls KMS Decrypt. Replaces every baked long-lived cloud IAM credential in the image.
  2. Per-user volume. Audience is a capOS-internal key service; the user’s AccessToken cap proves the caller is Alice; the key service enforces TokenExchangePolicy and returns Alice’s DEK.

Properties the envelope + token-exchange pattern gets the system:

  • No long-lived credentials in any capOS image.
  • Per-principal KMS audit (the token sub appears in every KMS decrypt log).
  • Revocation by IdP account disable, token revocation, or KMS grant removal.
  • Step-up authentication gating: a TokenExchangePolicy requiring minAuthStrength >= loa3 means Alice must have MFA-backed acr/amr claims before her volume unlocks.

Consumers

A non-exhaustive list of how this interface is meant to be used. Each consumer either exists as a proposal or is called out as future work.

ConsumerInterfaceKey source
EncryptedBlockDevicesymmetricany
EncryptedNamespacesymmetricpassphrase / passkeyPrf / KMS
TLS termination (web gateway)bothpassphrase / KMS / cloud certs
SSH host key signingprivateKeyVault / softwareStored / KMS
SSH public-key loginpublicCredentialStore / authorized key store
mTLS between servicesbothKeyVault with KMS seal
Instance identity JWT signingprivatecloudKms / softwareStored
Signed audit logsprivateKeyVault, append-only policy
WebAuthn verificationpublicCredentialStore (public keys)
Signed boot manifestspublicpublic key baked into firmware
Encrypted swapsymmetricper-boot ephemeral (in-RAM)
Encrypted backupssymmetricdedicated KMS key
Session tokens (HMAC)symmetricKeyVault, rotated frequently

Relationship to CredentialStore

The CredentialStore in Boot to Shell stores verifiers — WebAuthn public keys, password hashes, recovery codes. Its job is authentication: matching a claim from a user against a stored verifier.

The KeyVault proposed here stores keys — symmetric DEKs, signing private keys, KEKs. Its job is cryptography: producing keys for use by capOS services.

Overlap happens at passkey unlock: the CredentialStore verifies the WebAuthn assertion; the resulting PRF output feeds a PasskeyPrfKeySource that produces a SymmetricKey usable by EncryptedNamespace. Two services, one flow.

Keeping these distinct matters because their audit, retention, and exposure models differ. A CredentialStore can expose every stored entry as metadata (public keys are public) without leaking secrets; a KeyVault cannot. A deployment may want different replication, backup, and recovery policies for authenticators vs. encryption keys.

Threat Model

Separate from the consumer-specific threat models, the crypto/key management service itself has these:

  1. Memory scraping of a live key service. The service holds plaintext keys in RAM. Mitigation: small trusted-computing-base (one crate, audited), mlock the heap (no swap leakage), zeroize on drop, no panic-induced core dumps, cap-scoped access so only callers with a Key cap can trigger operations. Against a kernel exploit, no defense; that is a separate threat.
  2. Oracle abuse. A malicious service granted a SymmetricKey cap uses it as a decryption oracle. Mitigation: granting callers attenuated caps (decrypt-only, aad-pinned). Audit records make abuse detectable.
  3. Side-channel leakage. Timing, cache, power. Mitigation: use constant-time implementations (aes crate’s hardware backend; chacha20poly1305 crate is constant-time), prefer AEAD modes that resist nonce-reuse gracefully (GCM-SIV), avoid bespoke crypto.
  4. Downgrade attacks on algorithm selection. A caller requests a weak algorithm on a key that supports stronger modes. Mitigation: info() records the canonical algorithm; KeyPurpose constrains the method set; algorithm negotiation is the caller’s job, not a feature of the key cap.
  5. Key persistence in unintended places. Kernel DMA buffers, swap, crash dumps, core files. Mitigations are deployment-level (no swap, or encrypted swap with a per-boot key; disable core dumps for the key service process; measure the boot chain so a tampered kernel is detectable).

Phases

Phases align with the subsystems that need keys. Crypto primitives come first; consumers follow their own proposals’ phases.

Future asymmetric-key methods such as public-key encryption, private-key decryption, and key agreement append after this implemented subset in later slices.

Phase 1 — Interfaces and RAM-only implementation

  • Landed first increment: minimal PrivateKey / PublicKey interfaces plus AsymmetricAlgorithm, SignatureScheme, PublicKeyFormat, and KeyPurpose in schema/capos.capnp, backed by host-tested RAM-only P-256 signing in capos-tls. This proves TLS-vs-ACME purpose separation and public export without raw private-key export.
  • Landed second increment: RAM-only KeyVault generation/open/list/destroy, KeyHandle, source metadata, audit labels, and stale-handle fail-closed behavior for TLS and ACME local proofs.
  • Landed third increment: development-only software KeySource bootstrap that mints TLS and ACME account keys into the RAM KeyVault without manifest or evidence private-key bytes, and rejects production/public profiles.
  • Landed fourth increment: minimal RAM-only SymmetricKey ABI plus XChaCha20 stream encryption with HMAC-SHA256 authentication and HMAC-SHA256 MAC/verify cores. The local QEMU proof covers encrypt/decrypt, tag failure, MAC verification, purpose failure, and operation denial without logging raw key material or generated metadata.
  • Remaining Phase 1 surface: production/runtime KeySource services, symmetric-key derivation and wrapping, and any broader enum/struct metadata those services need.
  • Implement a RAM-only key service using vetted Rust crates (aes-gcm-siv, chacha20poly1305, ed25519-dalek, x25519-dalek, p256, rsa, hmac, hkdf). No persistence. Pure interface exercise.
  • ManifestEmbeddedKeySource for dev/CI.
  • Host tests: AEAD round-trips, signature round-trips, key agreement, fuzz the decrypt/verify paths.

Phase 2 — KeyVault with in-memory storage

  • Landed local-proof subset: RAM-only key generation, handle-based lookup, metadata listing, destroy, and stale-handle refusal.
  • Remaining production-oriented surface: sealed blob storage.
  • rotateSeal implementation (metadata-only KEK rotation).
  • Policy enforcement for seal/unseal.
  • Audit cap integration (System Monitoring).

Phase 3 — Persistent KeyVault over the Store

  • Sealed blobs live in a Store or Namespace.
  • Access control: KeyVault cap is itself attenuable (read-only, purpose-filtered).
  • Cross-reboot survival requires the Store, which requires persistent storage tracked in docs/roadmap.md.

Phase 4 — PassphraseKeySource and PasskeyPrfKeySource

  • Passphrase flow wires into console login.
  • PasskeyPRF flow wires into WebAuthn assertions from the web text shell gateway.
  • Per-user EncryptedNamespace becomes implementable end-to-end.

Phase 5 — Tpm2KeySource

  • TPM 2.0 driver as a userspace service (separate crate; talks to the TPM over x86 platform TIS or a virtio passthrough in cloud VMs).
  • Seal policies bound to PCR digests.
  • Measured-boot chain definition (firmware → bootloader → kernel → init → key service). PCR composition documented.

Phase 6 — CloudKmsKeySource

  • AWS KMS first; GCP KMS, Azure Key Vault, HashiCorp Vault, KMIP follow.
  • Depends on InstanceIdentity from cloud-metadata and a functioning network stack.
  • Cross-region / cross-account grant handling documented.

Phase 6b — OidcFederatedKeySource

  • Depends on OAuthClient and WorkloadIdentityFederation from OIDC and OAuth2.
  • Workload identity federation to cloud KMS (no baked long-lived IAM credentials). Subject token sources: InstanceIdentity, attestation report envelope, Kubernetes projected token, GitHub Actions OIDC.
  • Per-user volume unlock via user AccessToken against a capOS-internal key service honoring SealPolicy.tokenExchange.
  • TokenExchangePolicy enforcement for seal/unseal.

Phase 7 — AttestationKeySource

  • SEV-SNP, TDX, or Nitro — whichever the first target cloud environment requires.
  • Verifier can be cloud KMS with attestation policy or a standalone service.

Phase 8 — Post-quantum migration

  • Add ML-KEM and ML-DSA to the algorithm enums when capOS picks its PQ stack. Primarily a schema evolution and an added sign / agree path; no change to the interface shape.

Relationship to Other Proposals

  • volume-encryption-proposal.md — primary first consumer. EncryptedBlockDeviceFactory.open(raw, key, format) and EncryptedNamespace both take a SymmetricKey cap defined here (KeyPurpose.blockVolume / objectStore, typically aes256GcmSiv / aes256Xts / xchacha20Poly1305). Per-user session unlock invokes PasskeyPrfKeySource.unlockSymmetric (Phase 4) to mint the user DEK; system volumes unwrap a DEK through Tpm2KeySource or CloudKmsKeySource (Phases 5–6). KeyVault owns the sealed DEK blob and applies SealPolicy on every unlock; rotateSeal is how that proposal achieves KEK rotation without rewriting volume data.
  • boot-to-shell-proposal.mdCredentialStore stores authenticator verifiers; PasskeyPrfKeySource here produces keys from assertions that pass CredentialStore verification.
  • networking-proposal.md — TLS and mTLS need PrivateKey/PublicKey; instance mTLS bootstraps from a CloudKmsKeySource or KeyVault-issued service identity key.
  • ssh-shell-proposal.md — SSH host keys are sign-only PrivateKey wrappers backed by KeyVault; accepted OpenSSH-format public keys are verifier material that map to sessions but never grant shell authority directly.
  • certificates-and-tls-proposal.md — layers X.509, trust stores, CT, OCSP, pinning, ACME, and TLS config on top of the keys defined here. TlsServerConfig.key() and TlsClientConfig.clientAuth() return a PrivateKey cap minted by this proposal, typically generated by KeyVault.generatePrivate( algorithm, KeyPurpose.tls, policy). ACME account JWS signing uses a purpose-separated KeyPurpose.acmeAccount key; ACME enrollment (AcmeClient.requestCertificate(orderId, certKey, ...)) consumes the TLS certificate PrivateKey from the same KeyVault. CA private keys live in KeyVault under a strict SealPolicy (typically pcr or composite KMS + attestation). Public material flows through PublicKey.export(PublicKeyFormat.spkiDer) into that proposal’s certificate chain and trust-store structures, so this proposal’s cap boundary is the only place TLS private material is reachable.
  • oidc-and-oauth2-proposal.md — OIDC/OAuth2 client, token, JWKS, JWT wrapper, DPoP, and workload identity federation caps compose with the keys defined here. OidcFederatedKeySource and SealPolicy.tokenExchange (with TokenExchangePolicy / NamedClaim / minAuthStrength) live in this proposal because they are key-source shapes; the token protocol frame, discovery, JWKS handling, grant types, and verifier live there. JwtSigner and JwtVerifier are thin wrappers defined there that hold a PrivateKey / PublicKey from here and bind it to a fixed (issuer, audience, claim_constraints) tuple before emitting compact-serialized JWTs. KeyPurpose.oauthClientAssertion tags the key that ClientAuthMethod.privateKeyJwt and localPrivateKeyJwt sign with (RFC 7523 §2.2 client assertion against the token endpoint or a local STS). KeyPurpose.oidcIdToken tags the IdP-side signing key held by LocalIdentityProvider and published in its Jwks rotation set. KeyPurpose.dpopBinding tags the per-client DPoP keypair surfaced as DpopKey so AccessToken results stay jkt-bound (RFC 9449). Token-exchange-gated unlock flows in Phase 6b consume AccessToken and WorkloadIdentityFederation caps from that proposal and feed the cloud KMS or capOS-internal key service named in TokenExchangePolicy.audience.
  • cloud-metadata-proposal.mdInstanceIdentity cap consumed by CloudKmsKeySource and AttestationKeySource.
  • user-identity-and-policy-proposal.md — per-user keys are bound to session identity; the same cap chain that says “you are Alice” yields Alice’s SymmetricKey via PasskeyPrfKeySource.
  • cloud-deployment-proposal.md — hardware abstraction for self-encrypting drives sets up a future SelfEncryptingBlockDevice cap with hardware-held keys, a distinct trust model from software-crypto keys here.
  • security-and-verification-proposal.md — crypto is a top target for tiered tooling: constant-time linting, AEAD fuzzing, Loom models of the unlock state machine, Kani-style proofs of nonce-uniqueness.
  • system-monitoring-proposal.md — every Key method call, every KeyVault operation, and every KeySource.unlock should flow through the audit cap. Schema for audit events is defined there; key-management produces a specific event family.
  • hardware-audit-persistence-proposal.md — the DDF audit step 1 schema (SegmentHeader and durable-path HardwareAuditRecord fields, landed in schema/capos.capnp) can use SymmetricKey.mac (HMAC, KeyPurpose.integrity) and PrivateKey.sign (asymmetric signing) to seal each audit segment. KeyPurpose.audit is the intended tag for signing keys held by the audit log service. Phase 1 of this proposal (RAM-only key service) is the minimum prerequisite for that signing path to become functional.
  • formal-mac-mic-proposal.md — includes GOST-style modeling. GOST symmetric (Kuznyechik, Magma) and asymmetric (Streebog-signed schemes) algorithms can be added to the enums when a deployment requires them.
  • storage-and-naming-proposal.md — Open Question #5 (manifest trust, secure boot) is a prerequisite for Tpm2KeySource to be meaningful.
  • ../design-risks-register.md — R14 (durable identity / session liveness) lists this proposal among its owners: per-user EncryptedNamespace unlock, session-token HMAC keys, and LocalIdentityProvider ID-token signing keys all live behind KeyVault and KeySource here, so durable identity work cannot land before persistent KeyVault (Phase 3) plus PassphraseKeySource / PasskeyPrfKeySource (Phase 4) do.

Open Questions

  1. Canonical algorithm set for v1. Overshooting the enum invites implementation sprawl; undershooting forces schema evolution early. Proposed minimum: aes256GcmSiv, xchacha20Poly1305, hmacSha256, ed25519, x25519. Add rsa*, p256, post-quantum as real consumers arrive.
  2. Does SymmetricKey expose raw encrypt-without-AAD? AEAD with empty AAD is trivially expressible, but some callers may want explicit guarantees that non-AEAD modes are unavailable. Decide whether the interface permits aad == Data() universally or whether KeyPurpose constrains it.
  3. Public key distribution. PublicKey is a cap, but public material is public — should there be a “public key is freely-shareable bytes” escape hatch outside the cap system? Probably yes; export() exists for exactly that reason. How does a caller obtain a PublicKey cap from raw bytes? Via a PublicKeyImporter factory that verifies format, or directly in KeyVault.importPublic?
  4. Revocation of in-flight caps. If a SymmetricKey cap is granted to 10 services and the key is compromised, can the issuer revoke it? capOS cap revocation is generally “drop at each holder”; this might warrant a KeyVault.revoke(handle) that breaks the server-side object so every encrypt/decrypt returns an error. Worth designing explicitly rather than leaving implicit.
  5. Audit record granularity. Logging every encrypt call for a high-throughput volume is noisy; logging only unseal events misses oracle abuse. Probably: unseal and policy-violation events are always logged; per-operation logging is a per-KeyVault policy, off by default.
  6. Key-use quotas. Rate-limit decrypt operations per cap-holder to contain oracle abuse? Nice to have; not clear whether it belongs at the Key interface or at a KeyVault policy.
  7. HSM integration. PKCS#11 is the de facto standard for HSM access. Does capOS grow a Pkcs11KeySource, or does each HSM vendor ship a capability-native driver? The cap-native path is cleaner but depends on vendor cooperation.
  8. Backwards compatibility with stored blobs. SealPolicy, algorithm IDs, and seal blob formats will evolve. Define a versioned envelope around every sealed blob from day one, so rolling upgrades are possible.
  9. Side-channel guarantees per implementation. Document the expectation for each KeyAlgorithm (e.g. “constant-time required for aes*; use the aes crate’s hardware backend on x86_64 and bit-sliced implementation elsewhere”). Without this, the security posture varies silently across builds.
  10. GOST and other jurisdiction-mandated algorithms. The formal-mac-mic-proposal.md carves out a GOST-style track. Adding Kuznyechik, Magma, and Streebog-signed schemes is an additive extension; what matters is that the enums stay forward- compatible so a GOST-capable build does not require a schema fork.

Proposal: Certificates, TLS, and Certificate Transparency

Capability-native abstractions for X.509 certificates, trust stores, chain verification, Certificate Transparency (CT), revocation, pinning, automated issuance (ACME), and the TLS contexts built from all of these.

Implementation Status

The schemas and Phase 1-9 ordering below are design beyond the landed Phase 1 subset: vendored WebPKI roots, capos-tls host verifier logic, and the Certificate / CertificateChain / TrustStore / CertVerifier schema surface. The remaining near-term work is decomposed into a bounded slice chain owned by Certificates / TLS and the Certificates / TLS track in docs/tasks/README.md. The cut lands the lowest-risk real logic first. The Phase 2 client local proof landed on 2026-06-08: a userspace TLS 1.3 client completes one handshake over a userspace-served TcpSocket cap with a vendored embedded-tls state machine while validating the peer chain with capos-tls. The key-management proposal now has the minimal PrivateKey / PublicKey ABI, RAM signing core, and RAM-only KeyVault custody plus a development-only software KeySource for local TLS/ACME proofs, but no production custody source yet. Production/public server-side TLS remains blocked on reviewed custody and a server cert source:

  • Phase 1 deps [DONE 2026-06-03]. vendor rustls-webpki + webpki-roots as no_std+alloc snapshots with provenance: cloud-tls-vendor-rustls-webpki-roots-no-std-provenance.
  • Phase 1 [DONE 2026-06-03]. Certificate / CertificateChain / TrustStore / CertVerifier schema + host-tested verify logic over a RAM-only webpki-roots store: cloud-tls-cert-truststore-certverifier-phase1-host-proof.
  • Phase 2 (client) [DONE 2026-06-08]. One userspace TLS client handshake over the Phase C userspace TcpSocket cap, validating the peer chain with the Phase 1 verifier and a vendored embedded-tls TLS 1.3 state machine: cloud-tls-client-handshake-over-tcpsocket-local-proof.
  • Phase 2 (server consumer) – capOS-terminated TLS for the self-hosted Web UI (the direct-termination successor to the provider-terminated bootstrap below, not the closeout path for the first public proof), blocked additionally on a sealed PrivateKey cap and a server cert source: cloud-tls-self-hosted-webui-terminated-endpoint.
  • Minimal TLS/ACME key custody [DONE for local proofs]. The TLS server key and ACME account key need a PrivateKey / KeyVault / KeySource subset. The minimal PrivateKey / PublicKey ABI and RAM signing proof landed 2026-06-04; RAM KeyVault custody landed 2026-06-05; development-only software KeySource bootstrap landed 2026-06-05.
  • Phase 3 (ACME successor chain) [PARTIAL]. The local ACME account/order core landed on 2026-06-08: capos-tls signs ES256 JWS requests through an AcmeAccount PrivateKey cap, submits a CSR signed by a TLS-purpose key cap, and parses a returned local test certificate chain. Remaining Phase 3 work is scoped http-01 challenge solving, CertificateStore.watch renewal/rotation, and then a public GCE capOS-terminated direct-termination proof. These are successor tasks after the provider-managed first public proof, not replacements for it: cloud-tls-acme-account-order-local-proof [DONE 2026-06-08], cloud-tls-acme-http01-challenge-solver-local-proof, cloud-tls-acme-renewal-certstore-rotation-local-proof, and cloud-gce-public-webui-letsencrypt-direct-termination-proof.

Phases 4-9 (OCSP, CT, pinning, CRL, private CA) remain undecomposed design.

Why a Separate Proposal

Keys and certificates are related but different concerns. Keys are secret material whose contract is “compute with me.” Certificates are public assertions whose contract is “believe this identity, if the chain and CT/revocation evidence pass policy.” The two failure modes (key compromise vs. mis-issuance, revocation vs. renewal, HSM custody vs. CA trust) barely overlap.

Cryptography and Key Management already covers SymmetricKey, PrivateKey, PublicKey, KeySource, and KeyVault. This proposal covers everything on top: certificates, trust anchors, CT logs, OCSP, CRLs, pinning, ACME, and TLS configuration. A TLS server is composed from a PrivateKey cap (from the key proposal) plus the certificate/verification/revocation caps defined here.

Two adjacent proposals draw their own trust boundaries instead of extending this one:

  • OIDC and OAuth2 tokens are not X.509. OIDC and OAuth2 covers short-lived bearer tokens (ID tokens, access tokens, DPoP proofs, client assertions) signed by JWKS-published keys, not by X.509 trust chains. Where an OIDC issuer’s private_key_jwt client assertion or workload-identity federation flow does need an X.509 cert, the signing key is a PrivateKey cap from the key proposal and the cert is a Certificate cap from this one. The token capability objects, JWKS verifier, and DPoP machinery live in the OIDC proposal; this proposal only supplies the verifier when an OIDC flow happens to land on an X.509 binding.
  • SSH host keys are not X.509 certs. SSH Shell Gateway uses raw SSH host-key signatures (SshHostKey.signExchangeHash) and TOFU/authorized-key trust, not WebPKI chains. The host key is a narrow wrapper around a PrivateKey cap from the key proposal, constrained to SSH host-key signing; this proposal’s Certificate, TrustStore, CertVerifier, and ACME flow are not consumed by the SSH transport. SSH and TLS/mTLS are intentional siblings — SSH for raw operator/agent access without a CA, TLS for PKI-integrated services.

Problem

capOS will need certificate and TLS infrastructure for:

  • TLS termination in the web text shell gateway (Boot to Shell).
  • mTLS between services on a multi-host capability graph (Networking). TLS wraps the TcpSocket cap defined there; in Phase A-B that socket state is kernel-resident smoltcp, and TLS sees it through the same cap boundary after Phase C migrates the stack to userspace.
  • WebAuthn attestation statement verification (Boot to Shell).
  • Code signing verification for binaries, boot manifests, update bundles (Storage and Naming Open Question #5).
  • Cloud KMS HTTPS API clients (Cryptography and Key Management CloudKmsKeySource).
  • Attestation report verification chains (Cryptography and Key Management AttestationKeySource).
  • Any outbound HTTPS client invoked from a service.

Without a shared abstraction each consumer invents its own “where do trust anchors live”, its own CT policy (or skips CT silently), its own revocation story (or skips revocation silently), and its own config surface for rustls. That is how the Linux ecosystem ended up with /etc/ssl/certs, NSS, GnuTLS’ own store, OpenSSL’s SSL_CTX, update-ca-certificates, and per-language HTTPS clients with divergent trust policies. capOS is young enough to avoid that.

Design Principle: Certificates Are Typed Capabilities

A certificate in capOS is a Certificate CapObject, not an opaque byte blob flowing between services. Trust evaluation, CT and revocation policy, and TLS configuration are expressed as cap compositions — never as well-known paths (/etc/ssl/certs) or library singletons (rustls::RootCertStore::load_native_certs()).

Consequences mirror the key-cap case:

  • Attenuation by scope. A service that only needs to verify one signer receives a TrustStore cap containing that one anchor, not the full Mozilla root bundle. A service that must not bypass CT receives a CertVerifier whose policy has minScts >= 2; no method on that cap lets the caller lower the bar.
  • Revocation is a cap drop. A compromised anchor is removed from the TrustStore it lives in; holders of a stale restricted view that still trusts it keep trusting it until they pick up the new version. No library’s “just reload the roots” ambient step.
  • Audit is intrinsic. Every verifyChain, every addAnchor, every OCSP query flows through the audit cap. A service that bypasses revocation shows up in the audit log as a service that stopped calling OcspResponder.status.
  • Rotation without restart. A TLS server holds a CertificateStore.watch subscription; when an ACME renewal lands a fresh chain under the server’s handle, the TLS stack swaps chains on the next handshake. No filesystem signaling, no SIGHUP, no “reloaded 0 of 1 certs” log lines.
  • Composition, not configuration. A TlsServerConfig is a cap that encapsulates the key, chain source, stapler, client-auth verifier, and cipher policy. Building a TLS server means acquiring those caps and composing them, not filling in a struct with raw bytes.

Schemas

Certificates and chains

interface Certificate {
    # Raw DER encoding — for logging, CT submission, export.
    der             @0 () -> (encoded :Data);

    # Structured fields — callers should prefer these over re-parsing.
    subject         @1 () -> (name :DistinguishedName);
    issuer          @2 () -> (name :DistinguishedName);
    serial          @3 () -> (bytes :Data);
    notBefore       @4 () -> (epochSeconds :Int64);
    notAfter        @5 () -> (epochSeconds :Int64);
    subjectAltNames @6 () -> (names :List(GeneralName));

    # Public key as a cap — callers verify signatures through this.
    publicKey       @7 () -> (pk :PublicKey);

    # Extensions the platform cares about. Returning typed views
    # forces the implementation to parse once.
    keyUsage        @8 () -> (usage :KeyUsageFlags);
    extendedKeyUsage @9 () -> (ekus :List(ExtendedKeyUsage));
    basicConstraints @10 () -> (ca :Bool, pathLenConstraint :Int32);
    nameConstraints  @11 () -> (constraints :NameConstraints);

    # Embedded SCTs (RFC 6962 §3.3). Callers that only allow
    # CT-qualified certs filter on this.
    embeddedScts    @12 () -> (scts :List(SignedCertificateTimestamp));

    # Must-staple marker (RFC 7633).
    mustStaple      @13 () -> (required :Bool);

    # Fingerprint used for pinning, logging, and human display.
    fingerprint     @14 (hash :HashAlgorithm) -> (digest :Data);

    info            @15 () -> (kind :CertificateKind,
                               algorithm :AsymmetricAlgorithm);
}

interface CertificateChain {
    # Leaf first, root (or closest-to-root) last. Length-one chains are
    # permitted (self-signed leaf).
    certificates    @0 () -> (chain :List(Certificate));
    leaf            @1 () -> (cert :Certificate);

    # Convenience: verify this chain against a trust store using a
    # given verifier. Shortcuts the CertVerifier flow for simple cases.
    verify          @2 (against :TrustStore,
                        verifier :CertVerifier,
                        atEpochSeconds :Int64,
                        hostname :Text)
                    -> (outcome :VerificationOutcome);
}

enum CertificateKind {
    endEntity    @0;
    intermediate @1;
    trustAnchor  @2;
    crossSigned  @3;
}

GeneralName, DistinguishedName, KeyUsageFlags, ExtendedKeyUsage, and NameConstraints are plain struct/enum definitions mirroring RFC 5280 (omitted here for brevity).

Trust stores

interface TrustStore {
    # List anchors as WebPKI trust-anchor records. Mozilla/WebPKI roots may not
    # be representable as full Certificate caps.
    anchors         @0 () -> (anchors :List(TrustAnchorInfo));

    # Attenuate to a subset (e.g. only WebPKI roots, only corporate
    # CAs, only a specific CA). The resulting cap is a fresh
    # TrustStore that no longer references anchors outside the filter.
    restrict        @1 (filter :TrustFilter) -> (subset :TrustStore);

    # Add a trusted anchor. Only holders with write authority succeed;
    # read-only TrustStore caps reject this method.
    addAnchor       @2 (cert :Certificate, pin :AnchorPin) -> ();

    # Remove an anchor. Matches either fingerprint or subject DN.
    removeAnchor    @3 (selector :AnchorSelector) -> ();

    # Monotonic version bumped on every mutation; consumers cache by
    # version to avoid revalidating unchanged trust chains.
    version         @4 () -> (n :UInt64);
}

struct TrustFilter {
    purposes        @0 :List(CertPurpose);     # Only anchors usable for these
    fingerprints    @1 :List(Data);            # Allow-list by SHA-256
    subjects        @2 :List(Data);            # Allow-list by subject DN
    excludeFingerprints @3 :List(Data);        # Deny-list
}

struct AnchorPin {
    spkiHash        @0 :Data;                  # SHA-256 of SPKI
    hashAlgorithm   @1 :HashAlgorithm;
}

struct AnchorSelector {
    union {
        fingerprint @0 :Data;
        subject     @1 :DistinguishedName;
    }
}

enum CertPurpose {
    tlsServerAuth   @0;
    tlsClientAuth   @1;
    codeSigning     @2;
    emailSmime      @3;
    clientIdentity  @4;
    ctLog           @5;   # TrustStore of CT log public keys
    ocspSigning     @6;
    webauthnRoot    @7;   # FIDO metadata / attestation roots
}

Verifier

interface CertVerifier {
    verifyChain     @0 (chain :CertificateChain,
                        trust :TrustStore,
                        purpose :CertPurpose,
                        atEpochSeconds :Int64,
                        hostname :Text)
                    -> (outcome :VerificationOutcome);

    # Thin wrapper over a single signature check against a cert's
    # public key. Useful for WebAuthn attestation, signed manifests,
    # signed audit records.
    verifySignature @1 (cert :Certificate,
                        message :Data,
                        signature :Data,
                        scheme :SignatureScheme)
                    -> (ok :Bool);

    policy          @2 () -> (policy :VerificationPolicy);
}

struct VerificationPolicy {
    minScts                 @0 :UInt8;
    ctLogs                  @1 :TrustStore;  # which logs count
    allowedAlgorithms       @2 :List(AsymmetricAlgorithm);
    allowedSignatureSchemes @3 :List(SignatureScheme);
    requireOcsp             @4 :Bool;
    maxChainLength          @5 :UInt8;
    permitNameConstraints   @6 :Bool;
    clockSkewSeconds        @7 :UInt32;
    # When set, certificates not carrying the must-staple extension
    # are still required to deliver a stapled OCSP response.
    staplingRequired        @8 :Bool;
}

struct VerificationOutcome {
    union {
        valid   @0 :ValidChain;
        invalid @1 :VerificationFailure;
    }
}

struct ValidChain {
    anchor      @0 :TrustAnchorInfo;
    sctCount    @1 :UInt8;
    ocspStatus  @2 :OcspStatus;
    notAfter    @3 :Int64;     # min notAfter across the verified path
}

struct VerificationFailure {
    reason      @0 :FailureReason;
    detail      @1 :Text;
}

enum FailureReason {
    unknownAnchor           @0;
    expired                 @1;
    notYetValid             @2;
    signatureMismatch       @3;
    nameMismatch            @4;
    insufficientScts        @5;
    revoked                 @6;
    ocspUnavailable         @7;
    weakAlgorithm           @8;
    policyViolation         @9;
    badEku                  @10;
    chainTooLong            @11;
    nameConstraintViolation @12;
    mustStapleMissing       @13;
    pinMismatch             @14;
}

Default VerificationPolicy presets:

  • webPkiStrictminScts = 2, requireOcsp = true, allowed algorithms and schemes drawn from Mozilla’s “modern” profile.
  • webPkiLenientminScts = 0, requireOcsp = false. Used by low-value clients where misrouting is acceptable.
  • privateMtlsminScts = 0, requireOcsp = true, maxChainLength = 3. Used between capOS services holding CA-issued identity certs.
  • codeSigningminScts = 0, long notAfter tolerances, narrow allowed EKU set.

Certificate Transparency

capOS treats CT as a first-class verification input, not an add-on. Consumers that need WebPKI trust configure a CertVerifier with minScts >= 2 and a ctLogs trust store; verification fails closed if the leaf lacks that many valid SCTs signed by logs the policy accepts.

struct SignedCertificateTimestamp {
    logId              @0 :Data;    # SHA-256 of the log's public key
    timestamp          @1 :UInt64;  # ms since epoch
    extensions         @2 :Data;
    signature          @3 :Data;
    hashAlgorithm      @4 :HashAlgorithm;
    signatureAlgorithm @5 :SignatureScheme;
    origin             @6 :SctOrigin;
}

enum SctOrigin {
    embedded      @0;   # X.509 extension (RFC 6962 §3.3)
    ocspStapled   @1;   # OCSP response extension
    tlsExtension  @2;   # TLS handshake extension
}

interface CtLog {
    # Submission — used by ACME responders and capOS-internal CAs to
    # obtain SCTs before serving newly issued certs.
    addChain          @0 (chain :CertificateChain)
                      -> (sct :SignedCertificateTimestamp);
    addPreChain       @1 (precert :CertificateChain)
                      -> (sct :SignedCertificateTimestamp);

    # Monitoring — STH, entries, consistency proofs.
    signedTreeHead    @2 () -> (sth :SignedTreeHead);
    entries           @3 (start :UInt64, count :UInt32)
                      -> (entries :List(LogEntry));
    consistencyProof  @4 (first :UInt64, second :UInt64)
                      -> (proof :List(Data));

    info              @5 () -> (name :Text,
                                publicKey :PublicKey,
                                url :Text);
}

interface CtMonitor {
    # Watch for certificates issued under a subject-name pattern (for
    # phishing / mis-issuance detection). Events flow to the audit cap.
    watchSubject      @0 (pattern :Text) -> (subscription :CtSubscription);
    listWatched       @1 () -> (subscriptions :List(CtSubscription));
}

interface CtSubscription {
    events            @0 () -> (events :List(CtEvent));  # since last call
    cancel            @1 () -> ();
}

struct SignedTreeHead {
    treeSize       @0 :UInt64;
    timestamp      @1 :UInt64;
    rootHash       @2 :Data;
    signature      @3 :Data;
}

struct LogEntry {
    index          @0 :UInt64;
    timestamp      @1 :UInt64;
    entryType      @2 :CtEntryType;
    certificate    @3 :Data;   # ASN.1 TimestampedEntry payload
}

enum CtEntryType {
    x509Entry      @0;
    precertEntry   @1;
}

struct CtEvent {
    union {
        observed   @0 :CtObservation;
        error      @1 :CtWatchError;
    }
}

struct CtObservation {
    log          @0 :Text;           # log name or URL
    index        @1 :UInt64;
    certificate  @2 :Certificate;
    matched      @3 :Text;           # matched pattern
}

CT integration depends on networking and audit being available. A capOS build without networking falls back to minScts = 0 and skips monitoring. The CtMonitor service is optional — its absence means capOS does not detect mis-issuance against its own domains but does not affect leaf verification, which uses only embeddedScts and any SCTs delivered in the TLS handshake.

The log trust store (the ctLogs field of VerificationPolicy) is itself a TrustStore cap, populated from Chrome’s CT log list with the same bundling and signing approach used for WebPKI roots. CT logs are rotated regularly; the log list is the first place a deployment without fresh updates starts failing in a visible way, which is the intended failure mode.

Revocation

interface OcspResponder {
    # Query an OCSP responder for status. `issuer` supplies the cert
    # used to verify the responder signature chain back to a trust
    # anchor.
    status    @0 (cert :Certificate,
                  issuer :Certificate,
                  atEpochSeconds :Int64)
              -> (response :OcspResponse);
}

interface OcspStapler {
    # TLS server side: fetch and cache an OCSP response for the
    # server's own certificate. The TLS stack staples the cached
    # response into every handshake.
    currentResponse  @0 () -> (response :OcspResponse);
    refresh          @1 () -> ();
    setCertificate   @2 (chain :CertificateChain,
                         responder :OcspResponder) -> ();
}

interface CrlStore {
    # Look up a CRL for a given issuer DN; fallback when OCSP is
    # unavailable. Discouraged; CRLs do not scale.
    crlFor    @0 (issuer :DistinguishedName) -> (crl :Data);
    contains  @1 (issuer :DistinguishedName, serial :Data)
              -> (revoked :Bool);
}

struct OcspResponse {
    der         @0 :Data;        # RFC 6960 DER-encoded response
    status      @1 :OcspStatus;
    thisUpdate  @2 :Int64;
    nextUpdate  @3 :Int64;
}

enum OcspStatus {
    good                 @0;
    revoked              @1;
    unknown              @2;
    stapledAbsent        @3;   # handshake carried no stapled response
    responderUnreachable @4;
}

Policy choices capOS bakes into the defaults:

  • VerificationPolicy.requireOcsp = true means OCSP-unreachable is a hard verification failure. Default for CertPurpose.tlsClientAuth on services facing untrusted networks; soft-fail otherwise.
  • A certificate carrying the id-pe-tlsfeature must-staple extension fails verification if no stapled response is present, regardless of requireOcsp.
  • VerificationPolicy.staplingRequired = true extends must-staple behavior to all certs checked under that verifier, not only the ones that set the extension.
  • CRL support exists for legacy compatibility and explicit code-signing fallback. Services that can choose prefer OCSP stapling, which pulls revocation latency to handshake time without leaking the client’s identity to the responder.

Pinning

interface PinSet {
    # A pin set is a list of (SPKI-hash, algorithm) pairs. Verification
    # succeeds only if at least one cert in the chain has an SPKI hash
    # matching a pin.
    pins        @0 () -> (entries :List(Pin));
    enforce     @1 (chain :CertificateChain) -> (outcome :VerificationOutcome);
    addPin      @2 (pin :Pin) -> ();
    removePin   @3 (pin :Pin) -> ();
    info        @4 () -> (mode :PinMode, expires :Int64);
}

struct Pin {
    spkiHash        @0 :Data;
    hashAlgorithm   @1 :HashAlgorithm;
}

enum PinMode {
    enforce     @0;   # fail closed on mismatch
    reportOnly  @1;   # succeed; emit audit event
}

A PinSet restricts an already-trusted chain; it does not add trust. Composition is intersection: trust + CT + OCSP + pins must all pass for verification to succeed. Pin sets are per-consumer; the web shell gateway’s client-side ACME challenge fetches do not share a pin set with the fleet mTLS layer.

Issuance and renewal

ACME is the only supported issuance protocol for v1. Challenge solvers are caps so the ACME client has no ambient authority over DNS or the HTTP server. Self-signing and internal-CA use cases are covered by a separate CertificateAuthority cap (future work, see Open Questions).

interface AcmeClient {
    # Register or rediscover an account using an account key cap.
    register    @0 (accountKey :PrivateKey, contact :List(Text))
                -> (account :AcmeAccount);

    # Order a certificate for a list of identifiers.
    order       @1 (account :AcmeAccount,
                    identifiers :List(AcmeIdentifier),
                    certKey :PrivateKey,
                    solver :ChallengeSolver)
                -> (chain :CertificateChain);

    # Renew a previously-issued chain when notAfter is near.
    renew       @2 (chain :CertificateChain,
                    certKey :PrivateKey,
                    solver :ChallengeSolver)
                -> (chain :CertificateChain);

    # Revoke a cert.
    revoke      @3 (cert :Certificate, reason :RevocationReason) -> ();

    directory   @4 () -> (url :Text, meta :AcmeDirectoryMeta);
}

interface ChallengeSolver {
    # Publish a challenge token and wait for the ACME server to
    # validate. The solver owns whatever authority is required —
    # DNS record write, HTTP server handler registration, TLS-ALPN
    # responder slot — and nothing more.
    solve       @0 (challenge :AcmeChallenge) -> (ok :Bool);
    cleanup     @1 (challenge :AcmeChallenge) -> ();
    supports    @2 () -> (types :List(AcmeChallengeType));
}

enum AcmeChallengeType {
    http01      @0;
    dns01       @1;
    tlsAlpn01   @2;
}

struct AcmeIdentifier {
    type        @0 :Text;    # "dns", "ip", ...
    value       @1 :Text;
}

interface CertificateStore {
    # Store a certificate chain under a stable handle; used by TLS
    # servers to retrieve the current chain on handshake.
    put         @0 (handle :Text, chain :CertificateChain) -> ();
    get         @1 (handle :Text) -> (chain :CertificateChain);
    list        @2 () -> (handles :List(Text));
    delete      @3 (handle :Text) -> ();
    watch       @4 (handle :Text) -> (subscription :CertSubscription);
}

interface CertSubscription {
    events      @0 () -> (events :List(CertRotationEvent));
    cancel      @1 () -> ();
}

struct CertRotationEvent {
    handle      @0 :Text;
    newChain    @1 :CertificateChain;
    rotatedAt   @2 :Int64;
}

The CertificateStore.watch subscription is the point at which an ACME renewal service notifies a TLS server to rotate its chain. The TLS server does not poll files, no filesystem signaling is involved, and rotation is atomic from a handshake’s perspective.

TLS configuration

interface TlsServerConfig {
    key             @0 () -> (k :PrivateKey);
    chainSource     @1 () -> (store :CertificateStore, handle :Text);
    stapler         @2 () -> (s :OcspStapler);

    # Optional: require client auth against these verifier + trust
    # caps. If unset, the server accepts any client or no client.
    clientVerifier  @3 () -> (v :CertVerifier, trust :TrustStore);

    alpn            @4 () -> (protocols :List(Text));
    minVersion      @5 () -> (v :TlsVersion);
    cipherPolicy    @6 () -> (policy :CipherPolicy);
}

interface TlsClientConfig {
    verifier        @0 () -> (v :CertVerifier);
    trust           @1 () -> (t :TrustStore);
    pins            @2 () -> (p :PinSet);     # null for no pinning
    clientAuth      @3 () -> (k :PrivateKey, chain :CertificateChain);
    alpn            @4 () -> (protocols :List(Text));
    minVersion      @5 () -> (v :TlsVersion);
    serverNameOverride @6 () -> (host :Text);
}

enum TlsVersion {
    tls12           @0;
    tls13           @1;
}

enum CipherPolicy {
    modern          @0;  # TLS 1.3 + AEAD only; Mozilla "modern"
    intermediate    @1;  # TLS 1.2 + 1.3; Mozilla "intermediate"
    legacy          @2;  # Explicit opt-in for ancient peers
}

The TLS stack consumes a TlsServerConfig or TlsClientConfig cap plus a raw TcpSocket and produces a TlsSocket. The first landed local client proof uses embedded-tls directly over a userspace-served TcpSocket; the broader config-cap service surface remains the Phase 2 TLS-service design. The TlsSocket draft interface lives in the “TLS Layering” section of Networking; this proposal only defines the configuration surface. While TcpSocket state remains kernel-resident through Phase A-B of the networking proposal, the TLS stack itself is a userspace consumer of that cap and does not move into the kernel — the certificate parser, path builder, and TLS state machine all run in the userspace TLS service.

Trust Anchor Bootstrap

The v1 trust anchor bundle is Mozilla’s NSS store, synthesized from the webpki-roots crate data embedded in the boot manifest. Rationale: the bundle is well-curated, auditable (Mozilla’s CA Certificate Program publishes policy and meeting minutes), and already the de facto default for every Rust TLS stack. capOS does not invent a new root program.

CT log lists follow the same pattern, drawn from Chrome’s published CT log list.

Update policy:

  • Root-store bundles are versioned and signed. addAnchor on the system TrustStore is restricted to the trust-admin service, which accepts bundles whose signature chains to a build-time key embedded in the boot manifest.
  • Deployment overrides (corporate CAs, explicit Mozilla-root removal) compose with the Mozilla bundle via TrustStore.restrict and addAnchor on an override store. Overrides are themselves signed and manifest-addressable.
  • Replacement ships as a manifest update (see Storage and Naming Open Question #5 on manifest signing).

The manifest-embedded root store has no background network update path by design. A compromised root requires a new signed manifest, which requires the measured-boot chain. Root updates are a deliberate operational event, not a silent refresh. This is a deliberate trade-off against the Linux-style ca-certificates package that updates on every apt run.

Bootstrap TLS for the First Public GCE Web UI

The schemas above are no longer entirely pre-implementation design: the Phase 1 verifier, Phase 2 client handshake over a userspace TcpSocket, local key-custody precursors, and the local ACME account/order/finalize core have landed. The server-side TlsServerConfig / TlsSocket consumer, scoped http-01, CertificateStore.watch renewal, production key custody, and later CT/OCSP/pinning surfaces remain future or blocked work. The first time the self-served capOS Web UI (remote-session-web-ui) is exposed to a public operator browser on GCE, capOS therefore still does not terminate TLS itself. The reviewed first ingress terminates HTTPS at the GCP external load balancer’s Google front end against a provider-managed certificate; capOS serves only plain HTTP/1.1 on a backend port reachable solely from the load balancer and health-check source ranges. The full posture (firewall scope, browser session rules, evidence, teardown) is recorded in the “Public Web UI Ingress Policy” section of Cloud Deployment and the on-hold public Web UI ingress task; this note records only where TLS terminates and who holds the key.

Bootstrap consequences specific to this proposal:

  • No capOS private-key custody in the first proof. The TLS private key stays on the provider side. No PrivateKey cap, KeyVault, or KeySource from Cryptography and Key Management is consumed for the first public Web UI endpoint, and no key material is written into the disk image, manifest, or evidence directory.
  • No capability-native verification on the public hop. Because the Google front end performs TLS, the Certificate, TrustStore, CertVerifier, OcspStapler, and ACME flows defined here are not exercised by the first public Web UI proof. Provider-managed certificate lifecycle (issuance, renewal, revocation) is the provider’s, not capOS’s.
  • Successor path is the direct-termination shape. When this proposal’s TlsServerConfig plus an AcmeClient / ChallengeSolver (Phases 2-3) ship over the userspace TLS stack, a direct-external-IP, capOS-terminated ingress becomes a separately reviewed second option. At that point the certificate is a CertificateChain cap, the key is a sealed PrivateKey cap, CertificateStore.watch drives rotation, and the load-balancer-terminated path becomes one deployment choice rather than the only buildable one. The bootstrap step does not foreclose the capability-native model; it precedes it.
  • Let’s Encrypt is successor-only until the remaining prerequisites land. The Certificates / TLS backlog now names the landed local key-custody precursor (PrivateKey / KeyVault / development KeySource), landed TLS client over the userspace TcpSocket, and landed local ACME account/order/finalize core. Remaining successor prerequisites are the capOS-terminated Web UI TLS endpoint, scoped http-01 solver, CertificateStore.watch renewal and rotation, and then the on-hold Let’s Encrypt direct-termination GCE proof. Local ACME proofs use a local Let’s Encrypt-compatible directory. A real GCE or Let’s Encrypt staging/production run additionally needs a controlled public DNS name and explicit billable/public-ingress and CA authorization. Raw key material must not be written to manifests, images, logs, or evidence.

This mirrors the trust-anchor bootstrap above: capOS ships a pragmatic, reviewed interim posture (here, provider-terminated TLS) and migrates to the capability-native model as the implementing subsystems land, rather than blocking the first public proof on the full stack.

Consumers

ConsumerUses
Web text shell gatewayTlsServerConfig + OcspStapler; cert from AcmeClient
Inter-service mTLSTlsServerConfig + TlsClientConfig with private-PKI TrustStore
Outbound HTTPS clients (KMS, IMDS)TlsClientConfig with WebPKI-strict verifier
WebAuthn attestation verificationCertVerifier.verifySignature with FIDO MDS TrustStore
Code signing verificationCertVerifier with codeSigning trust store + OCSP
Signed manifest verificationCertVerifier.verifySignature + pinned build-time root
CT mis-issuance monitoringCtMonitor.watchSubject on capOS-owned domains

Threat Model

Specific to this subsystem, independent of the crypto/key threat model:

  1. Bogus CA in the trust store. Compromise of any CA in the trust store compromises every cert the verifier accepts. Mitigations: restrict the trust store as narrowly as each consumer permits (private-PKI services use a private-PKI-only store, not WebPKI); require CT for tlsServerAuth; enable CtMonitor for capOS-owned subject patterns.
  2. CT log compromise or collusion. A log signs a non-existent certificate. Mitigations: require SCTs from multiple independent logs (minScts >= 2); enforce log list freshness (policy rejects SCTs from retired or disqualified logs); monitor STH inclusion proofs for capOS-issued certs.
  3. OCSP responder compromise. The responder signs “good” for a revoked cert. Mitigations: OCSP response signature chains back to a trust anchor via the OCSP-signing EKU; short nextUpdate windows limit stale “good” responses; fail-closed when requireOcsp is set.
  4. Stapling stripping. A MITM strips OCSP staples between a compliant server and the client. Mitigations: must-staple extension on the server cert forces closed-fail; client-side staplingRequired policy extends this to all certs.
  5. Name-constraint bypass. An intermediate CA issues for names outside its constrained scope. Mitigations: permitNameConstraints always on; verifier enforces name constraints before reporting success.
  6. Pin brittleness. A pin prevents legitimate rotation, locking out users. Mitigations: short pin expiries, reportOnly mode for rollout, pins bound to SPKI (not to full certificates).
  7. ACME challenge hijack. A challenge solver with excessive authority forges validation tokens. Mitigations: each solver is a scoped cap (one DNS zone, one HTTP path prefix, one ALPN slot); solvers are per-consumer, not shared.
  8. Revocation denial-of-service. An attacker saturates the OCSP responder, forcing soft-fail everywhere. Mitigations: OCSP stapling (server-side caching takes the responder off the hot path); CRL fallback under deployment policy only.
  9. Clock skew attacks. A client with a wrong clock accepts expired certs or rejects valid ones. Mitigations: clockSkewSeconds has a tight default; consumers requiring hard-fail use an attested time source (Cloud Metadata).

Phases

Phases follow the consumers that need this infrastructure.

Phase 1 — Certificate, CertificateChain, TrustStore, CertVerifier

  • Add the schemas above to schema/capos.capnp.
  • Implement a RAM-only trust store seeded from webpki-roots.
  • Implement a CertVerifier using rustls-webpki for path building and signature verification.
  • Host tests: chain verification against known-good and known-bad samples, name constraints, algorithm gating.

Phase 2 — TLS server and client configs

  • Add TlsServerConfig and TlsClientConfig schemas.
  • Wire a userspace TLS state machine into the networking stack as a TlsSocket over TcpSocket; defined in Networking.
  • CertificateStore with in-memory backing for the web shell gateway.

Phase 3 — ACME client and challenge solvers

  • AcmeClient core speaking the local RFC 8555 account/order/finalize flow has landed in capos-tls; served capability wiring and public CA transport remain future.
  • ChallengeSolver implementations for http-01 (against the web shell gateway’s HTTP listener) and tls-alpn-01. dns-01 follows once a DNS cap exists.
  • CertificateStore.watch subscription drives TLS rotation without gateway restart.

Phase 4 — OCSP stapling

  • OcspResponder + OcspStapler services.
  • Must-staple enforcement in CertVerifier.
  • Cached stapled responses refresh in the background.

Phase 5 — Certificate Transparency (submission + verification)

  • SCT verification in CertVerifier (both embedded and TLS-extension SCTs).
  • CtLog client for submission; ACME flows submit precertificates to required logs before handing the cert to the caller.
  • Chrome CT log list bundled and signed like the WebPKI bundle.

Phase 6 — CT monitoring

  • CtMonitor service with watchSubject subscriptions.
  • Observations flow to the audit cap (System Monitoring).
  • Proof verification: STH signatures, inclusion proofs for capOS-issued certs, consistency proofs across STHs.

Phase 7 — Pinning

  • PinSet service with enforce and report-only modes.
  • Per-consumer pin policy plumbing.
  • Audit records on mismatch.

Phase 8 — CRL fallback and legacy compat

  • CrlStore implementation for code-signing flows that require CRL.
  • Policy knob to enable CRL fallback for OCSP-unreachable cases.

Phase 9 — Private CA

  • CertificateAuthority cap for capOS-internal issuance (mTLS fleet bootstrapping without an external ACME dependency).
  • CA keys live in KeyVault with strict seal policy.
  • Internal CT log (optional) for mis-issuance detection within a private fleet.

Relationship to Other Proposals

  • Cryptography and Key Management — supplies the key primitives this proposal consumes. Its minimal PrivateKey / PublicKey ABI, RAM signing core, RAM-only KeyVault handle custody, and development-only software KeySource bootstrap exist for the local TLS/ACME precursor. Persistence and production custody remain future. A TLS server’s key cap, an ACME account key, and an internal CA signing key all live in a KeyVault sealed under a KeySource (typical choices: Tpm2KeySource for fleet mTLS identities, PasskeyPrfKeySource or PassphraseKeySource for operator client-auth, CloudKmsKeySource for cloud-anchored CAs, development-only software sources for local ACME accounts). TLS certificate keys and ACME account JWS keys remain purpose-separated; the key proposal names that split as KeyPurpose.tls and KeyPurpose.acmeAccount.
  • Networking — defines the TcpSocket this proposal wraps and the draft TlsSocket interface that consumes TlsServerConfig / TlsClientConfig. In the proposal’s Phase A-B the socket state is kernel-resident smoltcp; the TLS stack consumes that cap from userspace and does not move into the kernel even before Phase C. mTLS between services uses this proposal’s verifier and trust store on top of that same cap.
  • OIDC and OAuth2 — separate trust model (JWKS-signed bearer tokens, not X.509 chains). The two proposals meet only at the corners where OIDC flows do bind to an X.509 cert: private_key_jwt client assertions and tls_client_auth/self_signed_tls_client_auth OAuth2 client authentication consume a PrivateKey from the key proposal plus a Certificate / CertificateChain from this one; workload-identity federation (RFC 8693) and outbound HTTPS to IdP/JWKS endpoints consume a TlsClientConfig with a webPkiStrict verifier. Inbound bearer-token verification stays in the OIDC proposal.
  • SSH Shell Gateway — explicitly a non-consumer. SSH uses raw host-key signatures and TOFU/authorized-key trust, not WebPKI; the host key wraps a PrivateKey from the key proposal directly, not a Certificate from this one. SSH and TLS/mTLS coexist as the two operator-facing remote-shell paths: SSH for CA-free operator/agent access, TLS/mTLS (the web text shell gateway plus future Telnet-over-TLS paths) for PKI-integrated environments.
  • Boot to Shell — web text shell gateway consumes TlsServerConfig; ACME via this proposal provides the cert. WebAuthn attestation verification uses CertVerifier.verifySignature.
  • Cloud MetadataInstanceIdentity is often expressed as a signed JWT or X.509 certificate; verifying attestation statements uses CertVerifier.
  • Storage and Naming — Open Question #5 (manifest trust, secure boot) is the source of the build-time key that signs root-store and CT-log bundles.
  • System Monitoring — every verifyChain, addAnchor, OCSP query, CT observation, and pin mismatch flows through the audit cap.
  • Security and Verification — the certificate parser, path builder, and policy engine are top targets for fuzzing and property testing. The landed client path uses embedded-tls plus rustls-webpki-backed capos-tls verification; capOS-specific policy glue (CT, stapling, pinning) gets its own tier of tooling.
  • User Identity and Policy — client-auth certs and per-user mTLS identity consume TlsClientConfig with a per-session PrivateKey cap.

Open Questions

  1. Canonical default policy. Should webPkiStrict require `minScts

    = 2` from day one, or is that too aggressive before CT log list curation ships? Chrome requires 2; Apple requires varying counts by cert lifetime. Probably match Chrome initially and revisit.

  2. CRL scope. Is CRL support worth the footprint at all, or should capOS ship OCSP-only and refuse to verify against CRL-only CAs? Leaning “CRL for code signing only”, not for TLS.
  3. Private CA surface. A CertificateAuthority cap with issue, revoke, and listIssued methods is straightforward, but the policy for issuance (SAN constraints, lifetime caps) deserves its own schema pass. Deferred to Phase 9.
  4. Trust-store delta signing. Signing every bundle replacement is expensive. A delta format (add/remove anchors with signed manifest patches) would be lighter; worth it only once bundle churn becomes a real operational cost.
  5. OCSP nonce support. Nonces prevent replay but most responders do not honor them. Ship without and revisit if a deployment needs replay-resistance.
  6. webpki-roots crate churn. The crate publishes a new version on Mozilla NSS changes, which is frequent. capOS needs a clean bump story — probably “new release triggers a trust-store bundle rebuild”, automated in CI.
  7. Stapling cache persistence. Must the OcspStapler cache survive reboot? Surviving reboot avoids a refresh storm at startup but risks serving very stale responses. Probably: cache is per-boot, with a short pre-refresh window before nextUpdate.
  8. Client-cert private key reuse. If a client uses one mTLS identity across many outbound connections, does each TlsClientConfig hold its own PrivateKey cap (wasteful) or share one (safe, since the cap’s sign method is the only surface)? Probably share by default; make duplication explicit if needed.
  9. Integration with CredentialStore. Some WebAuthn authenticators return attestation certs that must be chain-verified against FIDO MDS. The verification uses CertVerifier; the MDS trust store is a TrustStore maintained separately from the WebPKI bundle. How does MDS update cadence fit the no-background-update policy? Probably: MDS updates ride manifest updates, same as root bundles.
  10. GOST trust chains. The Formal MAC/MIC GOST track implies GOST-signed certificate chains. The CertVerifier algorithm enum is already open-ended; the work is algorithm implementation, not schema evolution.

Proposal: OIDC and OAuth2

Capability-native abstractions for OpenID Connect identity providers, OAuth 2.0 clients, issued tokens, workload identity federation, and the authentication/authorization flows that every modern cloud and enterprise deployment depends on.

Why a Separate Proposal

OIDC and OAuth2 are related but distinct from certificates and keys. Keys (Cryptography and Key Management) are secret material. Certificates (Certificates and TLS) are public assertions of identity binding validated against a PKI trust store. OIDC/OAuth2 is a delegated authority protocol family: tokens are short-lived bearer credentials or proof-of-possession handles issued by an identity provider after authenticating a subject, scoped by a set of permissions, and consumed by a relying party or resource server.

The failure modes barely overlap. Key compromise vs. IdP compromise vs. CA mis-issuance require different detection and recovery stories. Revoking a TLS certificate, revoking an access token, and revoking a KEK are three different operations with three different operational tempos (manifest update / IdP admin action / KMS grant edit).

Putting this in a separate proposal also matches the cross-cutting nature of the feature: OIDC/OAuth2 shows up in login (Boot to Shell), session state (User Identity and Policy), key unlock (Volume Encryption), cloud KMS access (Cryptography and Key Management CloudKmsKeySource), and service-to-service authentication (Networking). Threading it through every touchpoint without a shared definition would be a silo-per-consumer repeat of the Linux ssh-agent/gpg-agent/keyctl story.

Problem

capOS needs to:

  • Accept federated authentication for console and web-terminal login (corporate IdP, Google, GitHub, Okta, Azure AD, Keycloak, Dex) so sessions do not depend on capOS storing or managing primary user credentials.
  • Run as a cloud workload without baked-in long-lived IAM credentials, using the modern workload identity federation pattern (RFC 8693 token exchange) to obtain short-lived cloud provider credentials from an attested instance identity or a local keypair.
  • Authenticate service-to-service calls using OAuth2 client credentials, private_key_jwt, DPoP, or mTLS, chosen by policy rather than hard-wired per consumer.
  • Consume OAuth2 access tokens from external clients (an HTTP API running under capOS verifies bearer tokens against an issuer’s JWKS) without every service writing its own JWT parser.
  • Expose scopes and OIDC claims as policy input to AuthorityBroker / PolicyEngine without letting them act as ambient authority.
  • Map external subjects to local principals, accounts, sessions, policy profiles, and resource profiles through explicit admission configuration rather than treating provider claims as local roles.

Without a shared abstraction each consumer would invent its own JWT parser, JWKS cache, issuer list, discovery-document fetcher, refresh scheduler, and token storage. That is roughly the OAuth/OIDC mess already visible in most operating systems and app stacks.

Scope

In scope:

  • Relying-party (RP) role for OIDC. capOS is the RP; the IdP is external.
  • Client role for OAuth2. capOS acts as confidential client, public client (PKCE), or federated workload client.
  • Resource-server role for OAuth2. A capOS service can validate inbound bearer tokens.
  • Token capability objects for ID tokens, access tokens, refresh tokens, DPoP proofs, and client assertions.
  • Integration with SessionManager, CredentialStore, AuthorityBroker, CloudKmsKeySource, EncryptedBlockDevice unlock flows.

Out of scope for v1:

  • OAuth2 authorization server / OIDC provider role. capOS does not issue tokens to third parties in the first iteration. A LocalIdentityProvider that issues tokens to other capOS services is possible later work; it sits on top of the same primitives.
  • SAML. The modern direction is OIDC; a deployment that needs SAML can add a second SessionManager.login adapter without reshaping the model.
  • OAuth 1.0a. Dead.
  • UMA 2.0. Out until a concrete consumer appears.
  • CIBA (Client-Initiated Backchannel Authentication). Useful for step-up on mobile devices; revisit once the web shell gateway ships.

Design Principle: Tokens Are Typed Capabilities

In the OAuth/JWT world a token is a byte string. Possession equals authority. Every library re-parses, re-validates, and re-caches the token; every log line risks leaking it; every service that needs to forward it must hand it over unattenuated. That is the same architectural failure mode as “a key is a byte string” — the protection mechanism (TLS in transit, DPoP, audience binding) is orthogonal to the system’s main abstraction.

In capOS, a token is a capability object. Holding an AccessToken cap means “you may present this token for one outbound request, read a bounded subset of its claims, or exchange it for a more specific token.” The raw token bytes live in the address space of the OAuth/OIDC service; callers reach them by invoking typed methods.

Consequences mirror the key-cap case:

  • Attenuation by scope. A caller that only needs to read sub receives a TokenClaims facet that does not expose the raw JWT. A caller that only needs to call one resource server receives a BoundToken that rejects use against other audiences.
  • Revocation is a cap drop plus server-side revocation. Dropping the cap prevents that holder from using the token; revoking the token at the IdP prevents any holder from using it. Both paths exist; neither requires a kernel mechanism.
  • Audit is intrinsic. Every token present, every refresh, every token exchange flows through the audit cap. Bearer-token leakage to logs becomes harder because the raw string is never returned.
  • Composition, not configuration. An OAuth client is a cap that encapsulates client_id, client authentication method, allowed grants, scopes, and target IdP. Building one means composing caps, not stringifying URLs into config files.
  • Per-consumer issuance by default. When a service needs a token for a downstream call, it asks its OAuthClient cap; the OAuth service issues a per-consumer down-scoped token. Transfer between processes is possible but explicit.

This also gives capOS a natural story for the agent shell’s “agent never sees secrets” rule: the agent holds an ApprovalGrant cap that internally holds the access token; the agent invokes the wrapped resource server cap; the token never appears as data in the model’s prompt window.

Schemas

Identity provider

interface OidcIdentityProvider {
    # The stable issuer URL (RFC 8414 / OIDC Discovery).
    issuer            @0 () -> (url :Text);

    # Discovery document (cached; refreshed on a schedule or on
    # signature-verification failure). Returning the parsed metadata,
    # not raw JSON, forces the implementation to validate it once.
    metadata          @1 () -> (meta :OidcProviderMetadata);

    # JWKS fetched from the provider's `jwks_uri`, exposed as a set of
    # PublicKey caps keyed by `kid`. Key rotation is invisible to
    # callers: they ask for a `kid`, they get a current PublicKey or
    # an error.
    jwks              @2 () -> (set :Jwks);

    # Verify an ID token fully — signature against jwks, issuer match,
    # audience match against the registered client, nonce, exp/nbf/iat
    # with clockSkewSeconds from policy, and required `acr`/`amr`
    # predicates if set by the OAuthClient.
    verifyIdToken     @3 (jwt :Data,
                          policy :IdTokenPolicy)
                      -> (claims :IdTokenClaims);
}

struct OidcProviderMetadata {
    issuer                           @0 :Text;
    authorizationEndpoint            @1 :Text;
    tokenEndpoint                    @2 :Text;
    userinfoEndpoint                 @3 :Text;
    jwksUri                          @4 :Text;
    endSessionEndpoint               @5 :Text;
    deviceAuthorizationEndpoint      @6 :Text;
    revocationEndpoint               @7 :Text;
    introspectionEndpoint            @8 :Text;

    responseTypesSupported           @9  :List(Text);
    grantTypesSupported              @10 :List(Text);
    tokenEndpointAuthMethodsSupported @11 :List(Text);
    scopesSupported                  @12 :List(Text);
    idTokenSigningAlgValuesSupported @13 :List(Text);
    codeChallengeMethodsSupported    @14 :List(Text);
    dpopSigningAlgValuesSupported    @15 :List(Text);
    requestObjectSigningAlgValuesSupported @16 :List(Text);
    # Present when the IdP advertises OAuth 2.0 Token Exchange (RFC 8693).
    tokenExchangeSupported           @17 :Bool;
}

struct IdTokenPolicy {
    expectedAudience     @0 :Text;           # registered client_id
    expectedAzp          @1 :Text;           # authorized party, if set
    requiredAcr          @2 :List(Text);     # any-of
    requiredAmr          @3 :List(Text);     # any-of
    maxAgeSeconds        @4 :UInt32;         # per OIDC core `max_age`
    clockSkewSeconds     @5 :UInt32;
    nonceMustMatch       @6 :Data;           # empty = no nonce check
    requireAtHashMatch   @7 :Bool;           # if accessToken present
    requireCHashMatch    @8 :Bool;           # if code present
}

struct IdTokenClaims {
    issuer     @0 :Text;
    subject    @1 :Text;
    audience   @2 :List(Text);
    issuedAt   @3 :Int64;
    expiresAt  @4 :Int64;
    notBefore  @5 :Int64;
    nonce      @6 :Data;

    acr        @7  :Text;
    amr        @8  :List(Text);
    azp        @9  :Text;
    authTime   @10 :Int64;

    email      @11 :Text;
    emailVerified @12 :Bool;
    preferredUsername @13 :Text;
    name       @14 :Text;
    groups     @15 :List(Text);

    # Opaque claim map for everything else (profile-specific fields,
    # custom claims). Values are JSON-encoded bytes.
    additional @16 :List(NamedBlob);
}

struct NamedBlob {
    name  @0 :Text;
    value @1 :Data;
}

IdTokenClaims is intentionally read-only metadata. Possessing a serialized copy must not grant authority. The durable external subject key is subjectHash = hash(providerKind, issuer, tenant, subject); admission policy maps it to a local principal, account, and profiles before a UserSession is minted.

External identity admission

OIDC authentication produces verified claims. It does not by itself create a local account, select local roles, or grant capabilities. After verifyIdToken succeeds, SessionManager resolves the external subject through one of the identity proposal’s admission sources:

  • a manifest-seeded external admission rule for bootstrap, recovery, or early console login before durable storage exists;
  • a local account-store ExternalIdentityBinding that maps hash(providerKind, issuer, tenant, subject) to an existing local principal; or
  • an explicit auto-creation rule that creates a pseudonymous or tenant-scoped account with named policy and resource profiles.

The binding shape belongs with the account model, but OIDC consumers depend on its semantics:

struct ExternalIdentityBinding {
    bindingId @0 :Data;
    provider @1 :Text;        # OIDC issuer or configured provider name
    subjectHash @2 :Data;     # hash(provider kind, issuer, tenant, subject)
    principalId @3 :Data;     # local or pseudonymous principal
    tenant @4 :Text;
    acceptedClaims @5 :List(Text);
    expiresAtMs @6 :UInt64;
    policyProfile @7 :ProfileRef;
    resourceProfile @8 :ProfileRef;
    schemaVersion @9 :UInt32;
    storeEpoch @10 :UInt64;
    recordVersion @11 :UInt64;
    policyEpoch @12 :UInt64;
    previousHash @13 :Data;
    contentHash @14 :Data;
}

OIDC groups, roles, acr, amr, tenant IDs, device posture, source network, and token age are normalized ABAC inputs. A binding rule may map a provider group to a local role only for a named provider/tenant, expiry, and policy version. Imported claims are discarded or refreshed when stale, and roles selected from them remain broker inputs rather than authority.

An external session receives durable storage only when a binding or auto-creation rule maps it to a local principal and a resource profile. Without that mapping, the session is guest, anonymous, or one-shot pseudonymous policy with narrow temporary resources.

OAuth client

interface OAuthClient {
    # Bound configuration.
    info              @0 () -> (meta :OAuthClientMetadata);

    # Authorization Code + PKCE (OAuth 2.1 default). The caller owns
    # the redirect side-channel; this cap drives the token exchange
    # after the code returns.
    startAuthCode     @1 (requested :TokenRequest)
                      -> (authUrl :Text, state :AuthCodeState);
    completeAuthCode  @2 (state :AuthCodeState,
                          code :Data)
                      -> (bundle :TokenBundle);

    # Device Authorization Grant (RFC 8628). Appropriate for serial
    # consoles, embedded displays, and TVs — anywhere the capOS
    # process has no browser.
    startDeviceCode   @3 (requested :TokenRequest)
                      -> (userCode :Text,
                          verificationUri :Text,
                          verificationUriComplete :Text,
                          expiresIn :UInt32,
                          interval :UInt32,
                          state :DeviceCodeState);
    pollDeviceCode    @4 (state :DeviceCodeState)
                      -> (outcome :DeviceCodePoll);

    # Client Credentials (RFC 6749 §4.4). Backend-to-backend; the
    # caller principal is the client itself.
    clientCredentials @5 (requested :TokenRequest)
                      -> (bundle :TokenBundle);

    # Refresh an existing bundle (RFC 6749 §6). Fails if the stored
    # refresh token is expired or revoked.
    refresh           @6 (token :RefreshToken,
                          requested :TokenRequest)
                      -> (bundle :TokenBundle);

    # JWT Bearer (RFC 7523). Caller presents a signed assertion about
    # a subject; issuer returns a token for that subject. Used for
    # service delegation and some IdP federation flows.
    jwtBearer         @7 (assertion :Data,
                          requested :TokenRequest)
                      -> (bundle :TokenBundle);

    # Token Exchange (RFC 8693). Foundation of modern workload
    # identity federation: exchange a subject token (e.g. a signed
    # instance-identity JWT or an attestation report envelope) for
    # an access token at the remote issuer.
    tokenExchange     @8 (subjectToken :Data,
                          subjectTokenType :Text,     # RFC 8693 §2.1
                          actorToken :Data,
                          actorTokenType :Text,
                          requested :TokenRequest)
                      -> (bundle :TokenBundle);

    # Revoke (RFC 7009). Best-effort; not all IdPs honor it.
    revoke            @9 (token :TokenRef, reason :Text) -> ();
}

struct OAuthClientMetadata {
    clientId          @0 :Text;
    issuer            @1 :Text;
    authMethod        @2 :ClientAuthMethod;
    defaultScopes     @3 :List(Text);
    defaultAudience   @4 :Text;
    redirectUris      @5 :List(Text);
    dpopRequired      @6 :Bool;
    pkceRequired      @7 :Bool;     # true for public clients
}

enum ClientAuthMethod {
    none              @0;   # public client with PKCE
    clientSecretBasic @1;   # HTTP Basic; confidential clients
    clientSecretPost  @2;   # form-encoded; legacy
    privateKeyJwt     @3;   # RFC 7523 §2.2 — JwtSigner over a PrivateKey with KeyPurpose.oauthClientAssertion;
                            # when the IdP requires a bound X.509 cert the signer carries a Certificate cap
                            # from certificates-and-tls-proposal.md
    tlsClientAuth     @4;   # RFC 8705 — TlsClientConfig with a client Certificate + PrivateKey from
                            # certificates-and-tls-proposal.md (PKI-rooted) and key-management
    selfSignedTlsClientAuth @5;  # RFC 8705 §2.2 — same shape with a self-signed Certificate published in
                                 # OAuthClientMetadata, no PKI chain required
}

struct TokenRequest {
    scopes            @0 :List(Text);
    audience          @1 :Text;
    resource          @2 :List(Text);   # RFC 8707
    acrValues         @3 :List(Text);
    maxAgeSeconds     @4 :UInt32;
    prompt            @5 :List(Text);   # "login", "consent", "select_account", "none"
    loginHint         @6 :Text;
    nonce             @7 :Data;         # empty = generate fresh
    requestedExpirySeconds @8 :UInt32;  # hint; IdP has final say
    dpopKey           @9 :PrivateKey;   # optional DPoP binding
    extraParams       @10 :List(NamedBlob);
}

struct AuthCodeState {
    opaque @0 :Data;    # server-held state; PKCE verifier, nonce, etc.
}
struct DeviceCodeState {
    opaque @0 :Data;
}

struct DeviceCodePoll {
    union {
        pending     @0 :Void;
        slowDown    @1 :Void;
        expired     @2 :Void;
        denied      @3 :Void;
        granted     @4 :TokenBundle;
    }
}

Tokens

interface AccessToken {
    # Claims view (parsed once; opaque tokens return empty claims and
    # let the IdP's introspection endpoint be the source of truth).
    claims            @0 () -> (claims :TokenClaims);

    # Present the token for a single outbound HTTP request. The token
    # service inserts the Authorization / DPoP headers into the
    # outbound request built by the caller. Raw bytes do not leave
    # the token service through this path.
    authorize         @1 (request :OutboundHttpRequest)
                      -> (prepared :OutboundHttpRequest);

    # Down-scope to a narrower scope set or audience. Fails if the
    # requested scopes are not a subset of the current token's scopes.
    # Implementation can either return a wrapper cap that performs
    # client-side attenuation (for simple bearer tokens) or call the
    # IdP's token-exchange endpoint (for cross-audience narrowing).
    attenuate         @2 (scopes :List(Text),
                          audience :Text)
                      -> (narrower :AccessToken);

    # Explicit export for rare cases where the caller truly needs the
    # raw token (e.g. a capOS HTTP client that has no token-aware
    # stack). Emits an audit event naming the reason. Excluded by
    # default from attenuated caps returned by `attenuate`.
    exportRaw         @3 (reason :Text) -> (bytes :Data);

    # Token reference for revocation, introspection, or logging. Always
    # a hash or opaque ID; never the raw token.
    reference         @4 () -> (ref :TokenRef);

    # Expiry information for client-side backoff and pre-refresh.
    expiry            @5 () -> (notBefore :Int64, notAfter :Int64);
}

interface RefreshToken {
    # Refresh tokens are always longer-lived secrets. They do not
    # expose an `authorize` path — their only use is through
    # OAuthClient.refresh. `reference` and `expiry` match AccessToken.
    reference         @0 () -> (ref :TokenRef);
    expiry            @1 () -> (notBefore :Int64, notAfter :Int64);
    # Export is available but guarded and audited; used for migration
    # between token stores, not for ordinary operation.
    exportRaw         @2 (reason :Text) -> (bytes :Data);
}

interface IdToken {
    claims            @0 () -> (claims :IdTokenClaims);
    raw               @1 (reason :Text) -> (bytes :Data);
}

struct TokenBundle {
    access    @0 :AccessToken;
    refresh   @1 :RefreshToken;   # may be null
    id        @2 :IdToken;        # may be null
    expiresIn @3 :UInt32;
    tokenType @4 :Text;           # "Bearer" or "DPoP"
    scopes    @5 :List(Text);
    resource  @6 :List(Text);
}

struct TokenClaims {
    issuer     @0 :Text;
    subject    @1 :Text;
    audience   @2 :List(Text);
    scope      @3 :List(Text);
    clientId   @4 :Text;
    issuedAt   @5 :Int64;
    expiresAt  @6 :Int64;
    notBefore  @7 :Int64;
    jwtId      @8 :Text;
    # Confirmation (cnf, RFC 7800) for proof-of-possession tokens.
    confirmation @9 :TokenConfirmation;
    additional @10 :List(NamedBlob);
}

struct TokenConfirmation {
    union {
        none     @0 :Void;
        jkt      @1 :Data;        # DPoP: thumbprint of holder public key
        x5tS256  @2 :Data;        # mTLS client cert thumbprint
    }
}

struct TokenRef {
    kind      @0 :TokenRefKind;
    value     @1 :Data;           # hash, jti, or opaque server ref
}

enum TokenRefKind {
    jti       @0;    # JWT `jti`
    sha256    @1;    # SHA-256 of the raw token
    serverId  @2;    # opaque IdP-side identifier
}

struct OutboundHttpRequest {
    method    @0 :Text;
    url       @1 :Text;
    headers   @2 :List(NamedBlob);
    body      @3 :Data;
}

Token verifier (resource-server side)

interface TokenVerifier {
    # Validate an inbound bearer token: signature against the
    # provider's JWKS (or introspection endpoint for opaque tokens),
    # issuer, audience, expiry, required scopes, confirmation claim
    # (DPoP proof or mTLS peer cert).
    verifyAccess      @0 (token :Data,
                          policy :TokenVerificationPolicy,
                          proof :VerificationProof)
                      -> (outcome :TokenVerificationOutcome);
}

struct TokenVerificationPolicy {
    expectedIssuer    @0 :Text;
    expectedAudience  @1 :Text;
    requiredScopes    @2 :List(Text);     # all-of
    anyRequiredScopes @3 :List(Text);     # any-of if non-empty
    clockSkewSeconds  @4 :UInt32;
    requireConfirmation @5 :ConfirmationKind;   # none / dpop / mtls
    allowedAlgorithms @6 :List(SignatureScheme);
    allowIntrospection @7 :Bool;          # opaque token support
}

enum ConfirmationKind {
    none @0;
    dpop @1;
    mtls @2;
}

struct VerificationProof {
    union {
        none      @0 :Void;
        dpopProof @1 :DpopProof;
        peerCert  @2 :Certificate;
    }
}

struct DpopProof {
    jwt       @0 :Data;                   # the DPoP header value
    httpMethod @1 :Text;
    httpUrl   @2 :Text;
    nonce     @3 :Data;                   # server-issued DPoP nonce (RFC 9449 §8)
}

struct TokenVerificationOutcome {
    union {
        valid   @0 :ValidToken;
        invalid @1 :TokenVerificationFailure;
    }
}

struct ValidToken {
    claims     @0 :TokenClaims;
    algorithm  @1 :SignatureScheme;
    keyId      @2 :Text;
}

struct TokenVerificationFailure {
    reason     @0 :TokenFailureReason;
    detail     @1 :Text;
}

enum TokenFailureReason {
    badSignature        @0;
    unknownKeyId        @1;
    unexpectedIssuer    @2;
    audienceMismatch    @3;
    expired             @4;
    notYetValid         @5;
    insufficientScopes  @6;
    missingConfirmation @7;
    dpopMismatch        @8;
    mtlsThumbprintMismatch @9;
    revoked             @10;
    malformed           @11;
    introspectionDenied @12;
    weakAlgorithm       @13;
}

JWKS

interface Jwks {
    # Public keys exposed as PublicKey caps keyed by `kid`. Key
    # rotation is invisible to callers: they ask for a `kid`, they
    # get a current PublicKey or `unknownKeyId`.
    keyById    @0 (kid :Text) -> (key :PublicKey);

    # Enumerate keys (for diagnostics and admin). Returns metadata
    # only; PublicKey caps come from keyById.
    listKeyMeta @1 () -> (keys :List(JwkMeta));

    # Monotonic version bumped on every refresh that changes the key
    # set; consumers cache validated signatures by (kid, version).
    version    @2 () -> (n :UInt64);

    # Force a refresh. Audited. Called automatically on
    # verification-time `unknownKeyId`; exposed for admin use.
    refresh    @3 () -> ();
}

struct JwkMeta {
    kid        @0 :Text;
    algorithm  @1 :AsymmetricAlgorithm;
    use        @2 :Text;        # "sig" or "enc"
    scheme     @3 :SignatureScheme;
    createdAt  @4 :Int64;
}

Workload identity federation

interface WorkloadIdentityFederation {
    # Produce a fresh remote access token by exchanging a local
    # subject token (instance-identity JWT, attestation report,
    # Kubernetes projected token, GitHub Actions OIDC token, ...)
    # at the remote issuer per RFC 8693.
    exchange   @0 (requested :TokenRequest)
               -> (bundle :TokenBundle);

    info       @1 () -> (meta :WorkloadFederationMeta);
}

struct WorkloadFederationMeta {
    remoteIssuer        @0 :Text;
    audience            @1 :Text;
    subjectSource       @2 :SubjectSource;
    allowedScopes       @3 :List(Text);
    minRefreshInterval  @4 :UInt32;
}

enum SubjectSource {
    instanceIdentityJwt    @0;   # from CloudMetadata.InstanceIdentity
    attestationReport      @1;   # SEV-SNP / TDX / Nitro
    projectedServiceAccount @2;  # e.g. Kubernetes projected token
    githubActionsOidc      @3;   # ci/cd trust anchor
    localPrivateKeyJwt     @4;   # private_key_jwt against remote STS
}

DPoP

DPoP (RFC 9449) binds a token to a client-held key. The binding is expressed as a TokenConfirmation.jkt on the access token plus a short-lived proof JWT per outbound request.

interface DpopSigner {
    # Returns a fresh DPoP proof JWT for an outbound request. The
    # signer holds the PrivateKey; the access token is `jkt`-bound to
    # that key's thumbprint.
    newProof   @0 (method :Text,
                   url :Text,
                   accessTokenHash :Data,
                   serverNonce :Data)
               -> (proofJwt :Data);

    publicKey  @1 () -> (pk :PublicKey);
}

A deployment requiring DPoP composes AccessToken + DpopSigner in a wrapper cap so ordinary callers just invoke authorize and receive a request with both Authorization: DPoP ... and DPoP: headers set.

JWT wrappers over key-management primitives

These live here rather than in Cryptography and Key Management because JWT is a protocol frame, not a crypto primitive. JwtSigner and JwtVerifier are thin adapters over PrivateKey / PublicKey caps issued by a KeyVault sealed under a KeySource from that proposal. The signing key’s KeyPurpose is oauthClientAssertion for private_key_jwt client authentication or oidcIdToken for a LocalIdentityProvider mint path; KeyVault refuses to bind a key whose declared purpose set does not include the JWT use, so a key minted for TLS server identity cannot silently sign client assertions.

interface JwtSigner {
    # Sign a compact-serialized JWT. The key lives in a KeyVault;
    # JwtSigner is the schema-aware wrapper.
    sign       @0 (header :JwtHeader, claims :Data) -> (jwt :Data);
    publicKey  @1 () -> (pk :PublicKey);
    keyId      @2 () -> (kid :Text);
}

interface JwtVerifier {
    verify     @0 (jwt :Data, policy :JwtVerifyPolicy)
               -> (outcome :JwtVerifyOutcome);
}

struct JwtHeader {
    algorithm  @0 :SignatureScheme;
    keyId      @1 :Text;
    type       @2 :Text;            # "JWT", "at+jwt", "dpop+jwt"
    contentType @3 :Text;
}

struct JwtVerifyPolicy {
    expectedIssuer    @0 :Text;
    expectedAudience  @1 :Text;
    expectedType      @2 :Text;
    allowedAlgorithms @3 :List(SignatureScheme);
    clockSkewSeconds  @4 :UInt32;
    jwksSource        @5 :Jwks;     # preferred
    staticKey         @6 :PublicKey; # for single-signer cases
}

struct JwtVerifyOutcome {
    union {
        valid   @0 :JwtValid;
        invalid @1 :JwtInvalid;
    }
}

struct JwtValid {
    header     @0 :JwtHeader;
    claims     @1 :Data;
}

struct JwtInvalid {
    reason     @0 :TokenFailureReason;
    detail     @1 :Text;
}

SignatureScheme and AsymmetricAlgorithm are reused from Cryptography and Key Management; adding ps256/ps384/ps512 and es256/es384/es512 aliases there covers the JWT algorithm registry without duplicating the enum here. Token verifiers receive the allow-list as a List(SignatureScheme) from that proposal so a deployment can refuse HS* family algorithms uniformly across JWT, TLS, and signing without a parallel OIDC-specific configuration knob.

Grant Types in Detail

Authorization Code + PKCE

Used by the web text shell gateway and any capOS native app with a browser. PKCE is mandatory (OAuth 2.1); code_challenge_method = S256. The AuthCodeState.opaque blob carries the PKCE verifier, nonce, and original TokenRequest; callers do not see or store these values. Redirect URIs are validated against OAuthClientMetadata exactly, no partial matching.

Device Authorization Grant (RFC 8628)

Used on serial consoles and other no-browser surfaces. The console prints verification_uri and user_code; the user completes the flow in a separate device’s browser; the console polls the token endpoint. pollDeviceCode honors the slow_down response and caps its polling rate at the IdP-advertised interval. Expiration is a hard fail; the console must restart the flow.

This is the primary OIDC path for boot-to-shell on headless hosts and for interactive cloud-VM serial consoles.

Client Credentials

Used for backend-to-backend service identity. The calling service holds an OAuthClient cap configured with privateKeyJwt or tlsClientAuth. No user is involved; the subject of the issued token is the client itself.

Refresh

Used to rotate a short-lived access token without re-authenticating the user. Refresh tokens are long-lived secrets and live in the RefreshToken cap; they never appear as bytes in session state. Rotated refresh tokens (IdPs that issue a new refresh token on every refresh) are installed into the same cap transparently.

JWT Bearer (RFC 7523)

Used for federation between systems that trust a common signing key. A capOS service holding a JwtSigner can mint an assertion identifying a subject and exchange it at the IdP for a token acting on behalf of that subject. Used sparingly; the delegation implication is strong.

Token Exchange (RFC 8693)

The foundation of modern workload identity federation. Described in its own subsystem (WorkloadIdentityFederation) because the subject-token source is platform-specific. Concrete mappings:

  • AWS IRSA / IAM OIDC provider: subject token is a Kubernetes projected service-account JWT; IdP is sts.amazonaws.com; AssumeRoleWithWebIdentity returns AWS-scoped credentials.
  • GCP Workload Identity Federation: subject token is an InstanceIdentity JWT from the GCE metadata service or a Kubernetes projected token; IdP is sts.googleapis.com; returned token is usable against GCP APIs including Cloud KMS.
  • Azure federated identity credentials: subject token is an OIDC token from a trusted IdP (GitHub, GitLab, Kubernetes, another capOS instance); IdP is Azure AD; returned token is a standard AAD access token.
  • SEV-SNP / TDX / Nitro attestation: subject token is the attestation report envelope; IdP is the cloud KMS or a standalone attestation verifier; returned token authorizes a KMS Decrypt against an attestation-policy-gated KEK.

In every case the capOS image contains no long-lived credentials. The boot path produces a local InstanceIdentity cap, passes it to a WorkloadIdentityFederation configured for the target cloud, and receives short-lived tokens that KMS and other services accept.

Trust Bootstrap

An OidcIdentityProvider cap is created by a trusted service (the OAuth service) from a provider configuration record. The record includes:

  • The canonical issuer URL.
  • One of:
    • a fixed JWKS snapshot baked into the manifest (for air-gapped or hermetic deployments), or
    • the discovery URL plus one or more pinned root certs / SPKI hashes for the TLS connection that fetches discovery and JWKS (via TlsClientConfig and PinSet from the certificates proposal).
  • Acceptable algorithms (allowedAlgorithms in policy).
  • Minimum token lifetime and maximum clock skew.
  • Whether the IdP advertises token exchange, DPoP, and PAR (pushed authorization requests).
  • Client registrations allowed to use this IdP.

The trust root for OIDC verification is ultimately the TLS trust chain back to a certificate authority plus the discovery document’s signing policy. OidcIdentityProvider therefore depends on a TlsClientConfig from Certificates and TLS, not on raw sockets, and IdP pinning composes with a PinSet cap from the same proposal so that issuer-specific roots and SPKI hashes never share state with the ambient WebPKI trust store. Issuer URL mismatches and JWKS failures are hard errors; neither falls back to unauthenticated HTTP.

For enterprise IdPs that rotate signing keys frequently, Jwks caches keys with a short TTL and refreshes on verification-time unknownKeyId. A deployment that wants to forbid automatic refresh (to pin a specific key set) configures Jwks with refresh disabled; key rotation then requires a manifest update.

Authentication Strength Mapping

SessionInfo.authStrength from user-identity uses X.1254 LoA tiers. OIDC acr/amr claims map as follows (deployment-configurable):

  • loa1 — self-asserted, amr includes pwd only, acr absent or low.
  • loa2 — single-factor, amr includes pwd, pin, or email-link style.
  • loa3 — multi-factor with a hardware-backed credential: amr contains hwk, swk with device attestation, face+pwd, fpt+pwd, or equivalent; acr typically names a named MFA policy.
  • loa4 — high-assurance, typically requires identity proofing plus tamper-resistant hardware: amr contains pop with attested device, hwk plus face, in-person proofing claims, or vendor- specific high-assurance acr values.

The mapping table lives in OAuthClientMetadata policy, not hard- coded. loa0 (anonymous) is capOS-specific and has no matching OIDC claim; anonymous sessions do not use OIDC.

Consumers

ConsumerUses
Console login (device code)OAuthClient.startDeviceCode + verifyIdToken
Web text shell loginOAuthClient.startAuthCode + verifyIdToken; TLS from certs
Cloud KMS access (no baked creds)WorkloadIdentityFederation.exchangeAccessToken.authorize
CloudKmsKeySource unlockWraps AccessToken.authorize; no ambient cloud credentials
Service-to-service outbound HTTPOAuthClient.clientCredentials + AccessToken.authorize
Inbound API token validationTokenVerifier.verifyAccess
Per-user EncryptedNamespaceOidcFederatedKeySource derives KEK from user’s AccessToken
Audit / telemetry exportService identity via client credentials + DPoP
CI/CD runtime trustWorkloadIdentityFederation from GitHub Actions OIDC

Threat Model

Specific to this subsystem:

  1. Token leakage via logs. The classic OAuth failure mode. Raw tokens never leave the OAuth service through claims, references, or audit records. exportRaw is the only escape hatch and is audited with a mandatory reason string.
  2. Refresh token theft. Refresh tokens are long-lived secrets. Mitigations: storage in the same service that holds the cap (not in session state, not in cookies readable by shells), optional rotation on refresh, revocation on logout.
  3. Replay of bearer tokens. A stolen bearer token is usable until expiry. Mitigations: short TTLs; require DPoP (ConfirmationKind.dpop) for sensitive resources; mTLS for service-to-service. Nonce-bound DPoP proofs (RFC 9449 §8) with server-issued nonces where the resource server supports them.
  4. Mixed-IdP confusion. A token issued by IdP A is presented to a verifier expecting IdP B. Mitigations: strict expectedIssuer match; audience binding; IdP-specific OAuthClient caps so services cannot confuse two OIDC providers; RFC 9207 iss parameter on authorization responses verified before the token exchange.
  5. Discovery-document tampering. An attacker on the TLS path returns a forged discovery document or JWKS. Mitigations: pinned TLS roots or SPKI hashes per IdP; JWKS fetched over the same pinned TLS client; signature algorithm allow-list rejects downgrades; manifest-defined acceptable discovery URL prevents runtime redirect to attacker IdP.
  6. PKCE downgrade. A public client accepts a token without proving possession of the code verifier. Mitigations: PKCE is mandatory (pkceRequired = true is not a bit the caller can clear from a derived cap); code_challenge_method = S256 only.
  7. Authorization code replay. A leaked code is redeemed by an attacker. Mitigations: PKCE binds the code to the verifier the browser holds; codes are single-use; redirect URI exact match.
  8. Open redirector via redirect URI. Mitigations: exact-match redirect URIs per registration; no substring matching; validated at both startAuthCode and completeAuthCode.
  9. Cross-site request forgery on the authorization request. Mitigations: state parameter generated from EntropySource, stored in AuthCodeState.opaque, checked on completion; PKCE adds a second CSRF-resistant binding.
  10. OIDC nonce omission. Missing nonce on the ID token allows replay of an ID token from another session. Mitigations: IdTokenPolicy.nonceMustMatch is mandatory for interactive logins; verifyIdToken refuses an ID token whose nonce does not match the one baked into AuthCodeState.opaque.
  11. Mis-issued sub claim. An IdP reuses a sub across tenants or rebinds it. Mitigations: the external subject key includes provider kind, issuer, normalized tenant, and subject; it is never sub alone. Tenant-scoped IdPs (Azure AD per-tenant, Google Workspace) still record the tenant explicitly before hashing.
  12. JWKS flooding. An attacker forces repeated unknownKeyId failures to trigger JWKS refreshes. Mitigations: refresh rate- limited per Jwks cap; audit events recorded; repeated failures fail closed rather than refresh-in-loop.
  13. Token exchange policy evasion. An attacker with a narrow subject token exchanges it for a broader one. Mitigations: the remote issuer enforces its own policy on token exchange; capOS cannot prevent a misconfigured STS. Defense is to pin WorkloadFederationMeta.remoteIssuer and inspect returned scopes against allowedScopes.
  14. Clock skew attacks. Old tokens accepted, new tokens rejected. Mitigations: clockSkewSeconds is small by default; consumers that require hard bounds use an attested time source.

Security Verification in ITU-T and GOST Terms

OIDC and OAuth2 are IETF/OpenID protocols, but capOS’s broader security vocabulary is ITU-T/ISO-IEC plus, where a deployment requires it, GOST. The same verification surface this proposal defines (signature checks on JWTs, discovery and JWKS integrity, token binding, claim validation, scope enforcement, auth strength, logout semantics) maps onto those frameworks cleanly; it does not require a parallel vocabulary.

ITU-T X.805 security dimensions

X.805 (“Security architecture for systems providing end-to-end communications”) decomposes security into eight dimensions. The relevant mapping for the OIDC subsystem:

X.805 dimensionWhere it lives
Access controlAuthorityBroker + CapObject::call (ADF/AEF per X.812); scopes/claims as ABAC input
AuthenticationOidcIdentityProvider.verifyIdToken; TokenVerifier.verifyAccess
Non-repudiationSigned JWTs + audit records of verifyIdToken, issuance, exchange, revocation
Data confidentialityTLS for every IdP call; encrypted audit payloads where applicable
Communication securityTlsClientConfig from the certificates proposal; issuer-pinned roots; HSTS-equivalent pins
Data integrityJWT signature verification against Jwks; kid rotation handled without trust decay
AvailabilityFailure semantics defined for JWKS refresh, device-code expiry, introspection outage
PrivacyIdTokenClaims exposes only claims the client actually requires; scope minimization

Each cell is a discrete thing to verify, test, and review. Keeping the dimensions explicit makes gaps visible: for example, “what is our availability story if the IdP’s JWKS endpoint is down for 2 hours” is a concrete X.805 question; the answer is Jwks cache TTL + refresh audit + fail-closed behavior on unknown kid.

ITU-T X.812 ADF/AEF

Already inherited from user-identity-and-policy-proposal.md. The OIDC-specific instances:

  • AEF (enforcement point): CapObject::call and wrapper caps (AccessToken.authorize, TokenVerifier.verifyAccess). Bypass requires subverting the cap graph, not forging a claim.
  • ADF (decision point): the OAuth service when issuing a token, AuthorityBroker when returning scoped caps, TokenVerifier when accepting a bearer. A decision returns a capability (or denial); it does not return a boolean that downstream code might ignore.

ITU-T X.1254 / ISO/IEC 29115 LoA

Already built into the mapping: IdTokenClaims.acr/amrAuthStrength enum (loa1..loa4). The mapping table lives in OAuthClientMetadata so each deployment can specify which IdP acr/amr values count as each tier. SealPolicy.tokenExchange carries minAuthStrength so unlock policy can say “require LoA 3+” without knowing any specific IdP’s acr taxonomy.

ITU-T X.1252 identity management terms

X.1252 defines identity, credential, entity, enrolment, identity provider, relying party, and identity assurance. The proposal’s entities map directly:

X.1252 termcapOS realization
Identity providerOidcIdentityProvider
Relying partyAny service holding an OAuthClient cap
CredentialBearer, DPoP-bound, or mTLS-bound AccessToken; IdToken; RefreshToken
EnrolmentCredentialStore bootstrap of IdP trust records + subject allow-list
Assurance levelAuthStrength (= X.1254 LoA)
Attribute authorityIdP via claims; optionally PolicyEngine for derived ABAC attributes
Identity bindingCanonical subjectHash mapped to a local PrincipalInfo.id; never sub alone

ITU-T X.1255 and federated-identity discovery

X.1255 is the discovery framework for federated identity. The closest IETF analog is OIDC Discovery (RFC 8414 + OpenID Discovery 1.0). OidcIdentityProvider.metadata is the capOS surface for both. The manifest-declared discovery URL + pinned TLS root closes the “federated discovery must be trustworthy” requirement that X.1255 leaves to deployment.

ITU-T X.813 / ISO/IEC 10181-4 non-repudiation

Non-repudiation of authentication comes from the IdP’s signed ID token. Non-repudiation of authorization decisions comes from AuditLog records that include the decision inputs (claim summaries, policy IDs, outcome). The framework deliberately does not promise non-repudiation of shell commands — the agent shell is a planner, not a signer of operator intent.

IETF OAuth security BCPs

For completeness, the proposal tracks:

  • RFC 6819 — OAuth 2.0 Threat Model and Security Considerations. Covered by the threat-model section above.
  • RFC 9700 — OAuth 2.0 Security Best Current Practice. The “PKCE mandatory”, “exact redirect URI match”, “mix-up defense via iss parameter”, “no implicit grant”, and “rotate refresh tokens” items are all baked in rather than opt-in.
  • FAPI 2.0 (OpenID Foundation) — financial-grade API profile. Useful as a pre-packaged high-assurance profile: DPoP or mTLS sender-constrained tokens, PAR, signed authorization requests, strict algorithms. OAuthClientMetadata is deliberately shaped so a “FAPI profile” is a set of required fields, not a separate interface.

GOST MAC/MIC (ГОСТ Р 59383-2021, ГОСТ Р 59453.1-2021)

capOS’s mandatory-access-control and mandatory-integrity-control story is described at two levels in User Identity and Policy and Formal MAC/MIC:

  • a pragmatic level where userspace brokers and wrapper caps enforce labels at grant paths, and
  • a formal level where an abstract automaton (subjects, objects, containers, hold edges, rights, accesses, information flows) carries explicit safety predicates and proof obligations in the shape ГОСТ Р 59453.1-2021 requires.

OIDC integration must fit both levels without introducing a second authority channel. Concretely:

Federated principals and subjects

An OIDC-authenticated UserSession creates a subject in the formal automaton with:

  • subjectHash = hash(providerKind, iss, tenant, sub) — the durable external subject key, not reusable across IdPs or tenants. The local session principal may be a pseudonymous principal created for this external key or the local principal named by an ExternalIdentityBinding.
  • confidentiality_label and integrity_label resolved by LabelAuthority from the policy profile plus optional claim-derived refinement (e.g. groups = ["ops"] narrows to a specific compartment). Claims influence labels at mint time; they are not authority downstream.
  • authStrength from acr/amr, already folded into LoA tiers.

The create_session transition in the formal automaton therefore has one additional precondition when the login method is OIDC:

create_session(principal, policy_profile, resource_profile, oidc_proof):
  pre:
    verify_id_token(oidc_proof) succeeds with IdTokenPolicy
    IdP trust record in CredentialStore permits (subjectHash, policy_profile)
    manifest seed or AccountStore admission permits the binding
    acr/amr satisfy policy_profile's minimum AuthStrength
    subject allow-list or ExternalIdentityBinding admits subjectHash
  effect:
    new subject s with labels derived from policy_profile + claims
    Hold(s, session_bundle) per AuthorityBroker(
        session,
        policy_profile,
        resource_profile,
    )

This is the same precondition shape as password and passkey login — the safety proof does not branch on authentication method. It only requires that verify_id_token is modeled as a trusted verifier that rejects inputs failing the IdP’s published policy.

Integrity labels on IdP trust

IdP trust records carry an integrity level. An IdP configured as the corporate operator IdP can mint sessions with higher integrity than an IdP configured for guest/partner access. LabelAuthority encodes this in the trust-record metadata; SessionManager refuses to mint a session whose policy profile claims higher integrity than the admitting IdP’s integrity level.

The formal invariant:

integrity(session) <= integrity(admitting_IdP_trust_record)

This closes the federated analog of “any trusted login path can mint a maximally trusted session” — a gap that is easy to introduce by accident when enterprises add a second, looser IdP.

Flow classes for token capabilities

Each method on the typed token interfaces needs a flow class per the formal-mac-mic-proposal.md table. Proposed classifications:

OidcIdentityProvider.verifyIdToken    ReadLike + NoFlow   (pure verification)
OidcIdentityProvider.metadata         ObserveLike
Jwks.keyById                          ObserveLike
Jwks.refresh                          ControlLike         (on the Jwks object)

OAuthClient.startAuthCode             ObserveLike         (emits a URL; no subject-bearing data crosses)
OAuthClient.completeAuthCode          TransferLike        (materializes new authority as a token cap)
OAuthClient.startDeviceCode           ObserveLike
OAuthClient.pollDeviceCode            TransferLike        (on "granted")
OAuthClient.clientCredentials         TransferLike
OAuthClient.refresh                   TransferLike
OAuthClient.jwtBearer                 TransferLike + ControlLike  (delegation)
OAuthClient.tokenExchange             TransferLike        (see narrowing below)
OAuthClient.revoke                    ControlLike

AccessToken.claims                    ReadLike            (claims are metadata of the token object)
AccessToken.authorize                 WriteLike           (outbound side-effect under the token's authority)
AccessToken.attenuate                 TransferLike        (narrower cap minted)
AccessToken.exportRaw                 Declassify          (trusted, audited, restricted)
AccessToken.reference/expiry          ObserveLike

RefreshToken.*                        as above; exportRaw is Declassify
IdToken.raw                           Declassify
IdToken.claims                        ReadLike

TokenVerifier.verifyAccess            ReadLike + NoFlow
DpopSigner.newProof                   WriteLike           (produces a short-lived authenticator bound to a request)
WorkloadIdentityFederation.exchange   TransferLike

The key formal-level consequences:

  • AccessToken.exportRaw, RefreshToken.exportRaw, and IdToken.raw are Declassify transitions. They must be modeled as trusted transitions with explicit audit. Excluding them by default from attenuated caps is consistent with the formal model’s requirement that declassification go through explicit trusted subjects.
  • OAuthClient.tokenExchange and AccessToken.attenuate are TransferLike; they cannot widen authority. The safety predicate is “issued-token scope ⊆ input-token scope ∩ policy permits” — exactly the wrapper-narrowing rule from user-identity-and-policy-proposal.md. The proof obligation is a scope-monotonicity lemma on the server side; capOS verifies the result by comparing TokenClaims before accepting the returned cap.
  • AccessToken.authorize is WriteLike against the external resource. In the formal model this is an outbound information flow from the subject to an object whose label is the label of the downstream service the broker wired into the request. Deployments needing a MIC proof must ensure the broker refuses to bind a low-integrity session’s token into a request against a high-integrity service — the integrity(src) >= integrity(dst) rule applied through the broker.

Token attenuation as the wrapper-cap discipline

ГОСТ Р 59453.1 requires that every transfer either preserves or narrows rights; capability attenuation is the capOS mechanism for that. OIDC’s scope is a list of strings; treating scope narrowing as cap attenuation means the verifier at issuance time must reject any attenuate / tokenExchange result whose claimed scope is not a subset of the source token’s scope. This is already the spec’s behavior — the point is that the capOS implementation must enforce it locally as well, because a misbehaving STS would otherwise be a covert widening channel.

Subject-controls-subject and delegation

OAuthClient.jwtBearer (RFC 7523) lets a client speak on behalf of another principal. That is a ControlLike transition in the formal model: the invoking subject is exercising control over the minted subject. The safety predicate is:

supervise_allowed(invoker, delegated):
  integrity(invoker) >= integrity(delegated)
  and invoker holds a delegation capability for the target IdP
  and confidentiality/compartment labels are compatible

This is the formal reason a jwtBearer cap is not a default session authority — it must come from a broker that checks the control relation.

Endpoint declarations

formal-mac-mic-proposal.md requires every endpoint to declare its flow policy. For OIDC-facing services that is:

  • OidcIdentityProvider endpoints declare ObserveLike on metadata calls and ReadLike on verification.
  • OAuthClient endpoints declare the flow classes above and bind the output token’s label to min(session.label, target_audience.label).
  • TokenVerifier endpoints declare ReadLike and bind the verified claims to the caller’s object label (the claims flow into an object owned by the calling service).

Declaring these up-front lets the formal-mac-mic-proposal.md review gate apply without a separate OIDC-specific checker.

ГОСТ Р 58833-2020 — identification and authentication

Beyond MAC/MIC, ГОСТ Р 58833-2020 defines organizational and technical requirements for identification and authentication. OIDC integration satisfies its technical baseline:

  • Identifiers use subjectHash = hash(providerKind, issuer, tenant, subject); subject reuse across IdPs or tenants is disallowed by construction.
  • Credentials (tokens, refresh tokens, DPoP keys) are held inside the OAuth service; raw material does not reach the model, the shell, or audit.
  • Issuance and revocation (OAuthClient.startAuthCode/ startDeviceCode/clientCredentials/..., revoke, SessionManager.logout) are audited.
  • Credential-strength policy is selectable per resource via minAuthStrength on seal policies and broker decisions, aligned to X.1254 / ISO/IEC 29115 LoA.

Organizational measures (credential lifecycle, incident response, operator training) remain a deployment responsibility the OS cannot enforce alone.

Proof-obligation checklist

A deployment aiming for a GOST-style MAC/MIC claim with OIDC federation must add these obligations to the formal-mac-mic-proposal.md proof. The checklist is explicit so reviewers can point at individual items, and so each obligation maps to one of the tools listed in the next subsection.

  1. verify_id_token totality and policy soundness. Modeled as a trusted total function. Accepts only well-formed tokens under the configured IdTokenPolicy (issuer, audience, acr/amr, exp/nbf/iat with bounded skew, nonce, at_hash / c_hash when applicable). Returns IdTokenClaims or a failure reason; never silently downgrades.
  2. PKCE binding. No completeAuthCode(state, code) succeeds unless state was produced by a prior startAuthCode and the PKCE verifier stored in AuthCodeState.opaque hashes to the code_challenge the IdP recorded for code.
  3. Nonce binding. verifyIdToken accepts an ID token only if claims.nonce equals the nonce stored in AuthCodeState.opaque for the matching state. Missing-nonce ID tokens on interactive logins are rejected.
  4. state binding. The authorization response’s state matches the one minted from EntropySource at startAuthCode.
  5. Scope monotonicity. Every TransferLike token transition (AccessToken.attenuate, OAuthClient.tokenExchange, OAuthClient.refresh, OAuthClient.jwtBearer) produces a result whose scope is a subset of the input scope intersected with the broker/IdP-permitted set. No transition widens scope.
  6. JWKS live-set invariant. A token signed under kid = k is accepted iff k was present in Jwks at some time t with iat - clockSkew ≤ t ≤ now. Rotation that removes k does not retroactively invalidate tokens already verified under it; rotation that adds k does not accept tokens older than its introduction.
  7. Device-code polling discipline. pollDeviceCode honors the IdP-issued interval; slow_down responses monotonically increase the local backoff; expired and denied are terminal.
  8. Refresh rotation invariant. A successful OAuthClient.refresh that rotates the refresh token marks the prior RefreshToken cap broken; any subsequent use returns revoked (no parallel use of two generations of the same refresh-token family).
  9. Session-creation MAC/MIC predicate. create_session with an OIDC proof establishes integrity(session) ≤ integrity(admitting_IdP_trust_record) and confidentiality(session) ⊑ confidentiality_ceiling(policy_profile, claims).
  10. Broker-outbound MAC/MIC predicate. When the broker binds an AccessToken to an outbound request, the call site label satisfies integrity(src) ≥ integrity(dst) and the confidentiality flow is permitted. jwtBearer delegations additionally satisfy supervise_allowed(invoker, delegated).

Additional implicit obligations:

  • Declassify transitions (AccessToken.exportRaw, RefreshToken.exportRaw, IdToken.raw) are restricted to trusted subjects and produce audit records with the mandatory reason argument.
  • Endpoint flow declarations for OIDC services cover every method in the schemas above; adding a new method without a declaration is a review failure.

These are additive to the obligations in formal-mac-mic-proposal.md. None require a new kernel mechanism; they extend the same wrapper-cap / endpoint-flow-declaration discipline to OIDC-backed subjects and token-typed capabilities.

Tool assignment

ObligationPrimary toolNotes
1. verify_id_token totalityTLA+ + KaniTLA+ models the trusted function; Kani proves the Rust impl is total
2. PKCE bindingTLA+3-state machine (started/completed/failed); invariant on state
3. Nonce bindingTLA+Joint state with PKCE, same module
4. state bindingTLA+Joint with 2/3; plus Alloy for EntropySource uniqueness
5. Scope monotonicityAlloy + PrustiAlloy for the attenuate/exchange graph; Prusti as post-condition
6. JWKS live-set invariantTLA+Temporal property; Apalache if TLC state-space explodes
7. Device-code polling disciplineZ.100 SDL + TLA+SDL for state + timer structure; TLA+ for liveness
8. Refresh rotation invariantTLA+Safety + single-generation liveness
9. Session-creation MAC/MIC predicateAlloyExtends the hold-edge graph model from formal-mac-mic
10. Broker-outbound MAC/MIC predicateAlloySame model; predicate over outbound endpoint declarations
Declassify auditingKaniRust-level: every exportRaw path writes an audit record
Endpoint flow declarationsReview gate + AlloyEnumerate methods, check coverage relationally

Supporting artifacts (useful even before the full proof lands):

  • Z.120 MSC sequence charts for the three primary flows (authorization code + PKCE, device code, token exchange). MSC traces from a running capOS are already shaped like the sequence dumps tools/ccs produces for capability rings, which makes property-checking “no RETURN without a matching CALL” a straightforward analog to “no token issuance without a matching authorization event.”
  • Proptest/fuzz harnesses for the JWT parser, claim validator, PKCE verifier hash, DPoP proof parser, and discovery-document parser. These are not formal proofs but are the first line of defense for obligations 1 and 2. Tracked under the existing security-and-verification-proposal.md tiered tooling plan.
  • Loom model for the concurrent Jwks refresh path: multiple verification requests racing with a refresh triggered by unknownKeyId. Obligation 6’s live-set invariant is the correctness condition.

Out of scope for the formal track

  • The external IdP’s own correctness. The model treats the IdP as a trusted oracle that emits signed tokens matching its published policy; bugs in the IdP itself are not capOS-provable.
  • Network-layer adversaries. TLS authentication of the IdP and the token endpoint is assumed; that proof lives in the certificates proposal’s track.
  • Timing and microarchitectural side channels on signature verification and DPoP checks. Treated as deployment-level mitigations (constant-time libraries, cache partitioning) rather than modeled flows.
  • User behavior. Phishing, social engineering, and operator credential sharing are outside the model.
  • IdP key compromise. Modeled as an assumption violation; the formal proof cannot recover from a signing-key compromise at the IdP.

Note on GOST cryptographic primitives

Separately from MAC/MIC, a deployment may also require GOST cryptographic algorithms (GOST R 34.10-2012 signatures, GOST R 34.11-2012 Streebog hashes, GOST R 34.12-2015 symmetric ciphers) throughout the JWT/JWS and TLS stack. Those are additive enum extensions in SignatureScheme, HashAlgorithm, AsymmetricAlgorithm, and SymmetricAlgorithm across the key, certificates, and OIDC proposals plus a certified cryptographic library. The interface shape does not change; the MAC/MIC analysis above is independent of the algorithm choice.

ФСТЭК threat modeling

The threat-model section above enumerates OAuth/OIDC-specific attacks (leakage, replay, mixed-IdP confusion, discovery tampering, PKCE downgrade, code replay, redirect hijack, CSRF, nonce omission, sub confusion, JWKS flooding, token-exchange evasion, clock skew). Mapping that enumeration to ФСТЭК’s “Методика оценки угроз” taxonomy is a deployment-specific documentation exercise; the raw facts are already here.

How this combines

A capOS deployment choosing a high-assurance profile selects:

  • X.805 dimensions to audit explicitly (all eight for a regulated service).
  • X.1254 LoA floor per resource (via minAuthStrength on seal policies and broker bundles).
  • A label lattice (confidentiality + integrity) and which IdP trust records can mint sessions at which labels.
  • Which token transitions are modeled as Declassify / Transfer / Control in the MAC/MIC automaton.
  • A concrete IdP trust bootstrap (manifest-pinned JWKS snapshot vs. discovery with pinned TLS root).
  • A concrete audit redaction and retention policy consistent with applicable regulation (ITU-T X.816, ФСТЭК guidance, GDPR, or sector-specific rules).

No kernel change is required to land any of these. Each choice narrows the behavior of userspace services — OAuthClient, OidcIdentityProvider, TokenVerifier, CredentialStore, SessionManager, AuthorityBroker, LabelAuthority, AuditLog — inside the same capability model.

Interaction with capOS Authority Model

OIDC and OAuth2 decide which external subject was authenticated and which scopes apply to this call. Admission policy decides which local principal, account, policy profile, and resource profile that external subject maps to. They do not decide which caps exist in the process. That remains the job of AuthorityBroker.

Practical flow:

  1. User authenticates to capOS via OIDC. SessionManager.login verifies the ID token and computes subjectHash = hash(providerKind, iss, tenant, sub).
  2. SessionManager resolves subjectHash through manifest seed admission, a local account-store ExternalIdentityBinding, or an explicit auto-creation rule. The result is a local or pseudonymous principal plus selected policy and resource profiles.
  3. SessionManager mints a UserSession whose PrincipalInfo.id is the resolved principal and whose authStrength derives from acr/amr.
  4. AuthorityBroker.request receives the session and any relevant access token. Scopes and OIDC claims are inputs to the RBAC/ABAC/MAC decision. They are never sufficient authority on their own.
  5. The broker returns a capability bundle (or denial). The access token is delivered inside an ApprovalGrant or a wrapper cap when the caller needs to invoke an external service; the raw bytes remain inside the OAuth service.
  6. For outbound calls to an OAuth-protected resource, the capOS service holds an AccessToken cap; it does not see the token string.
  7. For inbound calls, a capOS service configured as an OAuth2 resource server holds a TokenVerifier cap plus its AuthorityBroker cap; verification yields claims, and the broker converts claims into narrower caps for the call.

This is the same “decision returns a capability” pattern the user-identity proposal already uses for Cedar/OPA. OIDC just provides one more input shape.

Phases

Phases follow the consumers.

Phase 1 — IdP and client schemas, JWT verification

  • Add the schemas above to schema/capos.capnp.
  • Implement a RAM-only IdP cache that can load a discovery document and JWKS from a static test fixture and verify a sample ID token.
  • Implement JwtVerifier over PublicKey primitives from the key proposal using a vetted Rust crate (jsonwebtoken, biscuit, or a purpose-built verifier on top of rsa / ed25519-dalek / p256).
  • Host tests: signature verification across RS256/ES256/EdDSA, issuer/audience/exp checks, clock skew, algorithm allow-list.

Phase 2 — OAuth client and device code

  • OAuthClient with clientCredentials, refresh, and deviceCode grants.
  • Outbound HTTPS via the networking and certificate stacks (requires those to be real).
  • Console OIDC login proof: QEMU serial starts startDeviceCode, an operator completes the flow out-of-band, pollDeviceCode returns a bundle, verifyIdToken succeeds, a manifest-seeded external admission rule selects policy/resource profiles, and a UserSession is minted.

Phase 3 — Authorization code + PKCE

  • Web text shell gateway redirects to the IdP and consumes the returned code.
  • startAuthCode / completeAuthCode integrated with the gateway’s HTTP listener.
  • Per-session nonce, state, and PKCE verifier all live in AuthCodeState.opaque.

Phase 4 — Resource server verification

  • TokenVerifier.verifyAccess with JWKS refresh and introspection-endpoint fallback for opaque tokens.
  • Policy enforcement: required scopes, audience binding, cnf confirmation (DPoP or mTLS).

Phase 5 — Workload identity federation

  • WorkloadIdentityFederation with subject sources for GCP and AWS.
  • Depends on InstanceIdentity from cloud-metadata and a working outbound TLS client.
  • CloudKmsKeySource gains a no-baked-credentials unlock path.

Phase 6 — Private key client auth and DPoP

  • ClientAuthMethod.privateKeyJwt using JwtSigner.
  • DpopSigner + ConfirmationKind.dpop in TokenVerifier.
  • RFC 9449 nonces when the resource server supports them.

Phase 7 — mTLS-bound tokens and extended federation

  • ClientAuthMethod.tlsClientAuth per RFC 8705.
  • Attestation-report-backed federation (SubjectSource.attestationReport) for confidential computing.
  • CIBA grant (RFC 9126 + OpenID CIBA) if a deployment needs it for step-up on mobile devices.

Phase 8 — Token exchange as a first-class broker input

  • AuthorityBroker accepts an AccessToken or IdToken plus scopes as policy input; decisions can return narrowed access tokens alongside narrower caps.
  • Account-store-backed ExternalIdentityBinding records replace manifest-only external admission for ordinary federated logins. Unknown external subjects are denied unless an explicit auto-creation rule names policy and resource profiles.
  • Per-user EncryptedNamespace unlock via OidcFederatedKeySource (defined in the key-management proposal) using the user’s current access token as unlock context.

Phase 9 — Local IdP (optional, deferred)

  • A LocalIdentityProvider cap that issues tokens to other capOS services on the same host or fleet, signed by a JwtSigner backed by a KeyVault-stored PrivateKey. Useful for air-gapped deployments and for bootstrapping workload federation between two capOS instances. Not in v1.

Relationship to Other Proposals

  • Cryptography and Key Management — supplies PrivateKey / PublicKey / KeyVault / SignatureScheme / AsymmetricAlgorithm, the KeySource family this proposal’s JwtSigner binds to, and the KeyPurpose.oauthClientAssertion (RFC 7523 private_key_jwt) and KeyPurpose.oidcIdToken (LocalIdentityProvider mint path) values that constrain how those keys may be used. That proposal also defines KeySourceKind.oidcFederated and the corresponding OidcFederatedKeySource (Phase 6b there), plus SealPolicy.tokenExchange, which together let EncryptedNamespace and other sealed payloads unlock against an AccessToken minted here instead of a baked credential.
  • Certificates and TLS — supplies the TlsClientConfig consumed by OIDC discovery, JWKS, token, introspection, revocation, and IdP admin endpoints, and the PinSet composed in per-IdP trust records so issuer roots and SPKI hashes stay isolated from the ambient WebPKI store. The two proposals meet at three X.509 corners: ClientAuthMethod.privateKeyJwt carries an X.509 Certificate cap when the IdP requires a cert-bound assertion; ClientAuthMethod.tlsClientAuth (RFC 8705) consumes a PKI-rooted client Certificate plus PrivateKey; and ClientAuthMethod.selfSignedTlsClientAuth (RFC 8705 §2.2) uses a self-signed Certificate published in OAuthClientMetadata. The certificates proposal handles X.509 verification; this proposal owns the resulting token-typed capabilities.
  • Boot to Shell — device code and authorization code grants are SessionManager.login methods. CredentialStore stores IdP trust records (issuer URL, JWKS, allowed audiences) alongside password verifiers and passkey public credentials.
  • Shell — the authority broker consumes access tokens as ABAC input; the agent shell holds ApprovalGrant wrappers, not raw tokens.
  • User Identity and Policy — owns the canonical external subject key subjectHash = hash(providerKind, issuer, tenant, subject), the ExternalIdentityBinding record (mapped from subjectHash to a local or pseudonymous PrincipalInfo.id plus named policy and resource profiles), and the three admission sources SessionManager consults after verifyIdToken succeeds: manifest seed admission, local account-store bindings, and explicit pseudonymous auto-creation rules. OAuth scopes and OIDC claims (acr, amr, groups, tenant) are normalized ABAC attributes fed to AuthorityBroker / PolicyEngine, never authority on their own; AuthStrength derives from acr/amr through the deployment-configured mapping in OAuthClientMetadata.
  • Volume Encryption — OIDC-gated KMS unlock replaces baked IAM credentials; per-user EncryptedNamespace unlock uses OidcFederatedKeySource.
  • Cloud MetadataInstanceIdentity is the primary subject token source for workload identity federation. The current proposal’s own SubjectSource.instanceIdentityJwt is implemented by that cap.
  • Networking — outbound OAuth calls use a userspace HTTP/TLS client built over the networking stack. Service-to-service OAuth coexists with mTLS as two delegation patterns rather than competing ones.
  • System Monitoring — every verifyIdToken, token issuance, refresh, exchange, and verifyAccess flows through the audit cap. Redaction rules from the boot-to-shell proposal apply: claim summaries and token references, never raw tokens.
  • Security and Verification — JWT/JWS/JWE parsers are classic fuzz targets; PKCE and device-code state machines are Loom candidates; token-exchange policy evaluation is a Kani candidate.
  • Live Upgrade — the OAuth service holds sensitive live state (refresh tokens, DPoP private keys, PKCE verifiers). Live upgrade needs a state-transfer path that does not leak tokens through shared memory.

Open Questions

  1. Do we ship our own OIDC RP implementation or wrap an existing Rust crate? openidconnect-rs, oauth2-rs, and biscuit are candidates. The schema boundary is independent; the implementation choice affects TCB size and audit surface.
  2. Opaque access token handling. Some IdPs issue opaque tokens validated only by introspection (RFC 7662). Latency and load on the introspection endpoint are operational concerns; caching introspection responses is fiddly (when is the cache allowed to serve stale “active”?). Probably: support introspection with short cache TTL and per-policy opt-in.
  3. PKCE-less legacy clients. A deployment against an old IdP that cannot do PKCE. Do we allow a config escape hatch, or do we refuse to boot? Leaning “refuse” given OAuth 2.1 guidance.
  4. DPoP nonce plumbing. Server-issued nonces (RFC 9449 §8) require the caller to retry after the first 401 with the returned nonce. Fits naturally in a wrapper cap around AccessToken, but the retry policy on non-idempotent methods needs a clear rule.
  5. Device code on air-gapped consoles. Device code presumes the user has another device with a browser. Pure-air-gapped hosts must fall back to password + passkey; what about console-only OIDC without internet? Probably: no-op; offline OIDC is an oxymoron, use local auth.
  6. How do tokens transfer across capability boundaries? Per- consumer down-scoped issuance is the default. Should AccessToken.attenuate be a kernel-level badge, a userspace wrapper cap, or both depending on whether attenuation is server-side (token exchange) or client-side (scope subset)?
  7. Logout semantics. OIDC end-session endpoints are optional and frequently inconsistent across IdPs. When UserSession.logout fires, what is the best-effort expectation: local session drop + IdP revoke + RP-initiated logout redirect? Document a clear failure mode for each step.
  8. Default audiences for AuthorityBroker decisions. When the broker down-scopes an access token, what audience does the narrower token target — always the resource server the broker just returned a cap for? Or a list, for broker decisions that return a compound bundle? Probably: one audience per CapRequest, bundles emitted as multiple broker responses.
  9. External auto-creation policy. Which OIDC providers may create pseudonymous local accounts, which policy/resource profiles may they name, and what rollback/recovery record proves the mapping was not replayed from stale account-store state?
  10. Support for JAR / PAR / JARM. Pushed Authorization Requests and JWT-Secured Authorization Response Mode are increasingly expected by enterprise IdPs. Phase 3 should support PAR; JAR and JARM can follow.
  11. Clock source. OIDC verification depends on a reliable clock. Before the Timer capability and a cloud attested-time source exist, verifyIdToken must either fail closed or consume a bootstrap clock from the manifest. Document the first-boot behavior.
  12. Key binding for user sessions. Should a UserSession be bound to a DPoP key by default (so a leaked session ID is useless without the key), or is that overkill for console sessions? Probably: yes for web gateway sessions; no for direct local console sessions where session state never leaves the host.
  13. GOST / jurisdictional OIDC. Some deployments mandate GOST-signed JWTs (GOST R 34.10-2012 on the JWT signature). Adding the algorithms to SignatureScheme is schema-level; validating a GOST-signed discovery document requires matching trust-store support in the certificates proposal. Track, do not block.

Proposal: Volume Encryption

Encrypting system and user volumes in a capability OS where storage is already a stack of typed capabilities and keys can be first-class capability objects.

Problem

capOS currently has no persistent storage, no crypto, no TPM driver, and no block-device drivers. That is the right moment to decide what encryption-at-rest looks like, before storage interfaces and service graphs harden around plaintext assumptions.

Traditional OSes bolt encryption on as a kernel subsystem (dm-crypt/LUKS, BitLocker, FileVault, fscrypt). That choice follows from those kernels’ architecture: the kernel owns block I/O, the filesystem, the keyring, and the trust domain between processes, so encryption logically lives there too. capOS has made the opposite bet — the kernel is a capability router, block I/O lives in userspace services, filesystems are userspace services, and there is no ambient keyring because there is no ambient anything.

Putting crypto in the kernel would contradict Design Principle 5 (“the kernel is becoming a capnp-rpc router”) and Principle 7 (“pragmatic reuse” — let userspace crates do what they already do well). Putting it nowhere leaves the system unable to protect data at rest. The proposal below places encryption in userspace services expressed as capabilities, with no new kernel mechanism.

Threat Model

Four attackers worth distinguishing up front, because the defenses differ:

  1. Offline disk theft. Attacker has the storage medium, no live system, no running key service, possibly no hardware attestation. Ciphertext must reveal nothing about plaintext beyond length and block boundaries.
  2. Ciphertext tampering at rest. Attacker can write to the medium and hopes to flip ciphertext bits to produce attacker-chosen plaintext changes (classic XTS malleability). Modification must be detected, not merely scrambled.
  3. Peer userspace service holding the raw BlockDevice cap. The virtio-blk driver, a backup agent, a telemetry exporter, or any service that is on the physical I/O path. They hold authority to read sectors but must not see plaintext for volumes whose key they do not hold.
  4. Compromised session with a live key cap. Once an attacker is inside a user’s session and holds the user’s SymmetricKey cap, that user’s data is lost. The goal is lateral containment: no cross-user leverage, no escalation to the system volume, no access to other sessions’ keys.

Out of scope for a first pass:

  • Cold-boot RAM attacks and side channels (mitigation: use TPM-bound keys when available, but physical memory reads against a running host are not defended).
  • Evil-maid attacks on the unencrypted portion of the boot image (addressed separately by secure boot / measured boot — see Storage and Naming Open Question #5).
  • Traffic analysis against encrypted backups or encrypted replication.
  • Key escrow for legal recovery. capOS takes no position; a deployment can add an escrow KeySource without changing the model.

Keys Are Capabilities

Key material never crosses cap boundaries. Callers hold SymmetricKey or PrivateKey capabilities whose methods run inside the service that holds the key; the holder gets encrypt/decrypt/sign authority, not the bytes. Attenuation (decrypt-only, AAD-pinned, purpose-bound) is wrapper CapObjects, the same mechanism that builds read-only Files.

This proposal does not define those interfaces. They belong to Cryptography and Key Management, which covers SymmetricKey, PrivateKey/PublicKey, KeySource, KeyVault, algorithm and purpose enums, seal policies, and the set of concrete key sources (manifest-embedded, passphrase, passkey PRF, TPM 2.0, cloud KMS, attestation, network, software-stored). Volume encryption is one consumer among many.

Layer Placement

Two layers exist, and a first-class design uses both.

Layer A — EncryptedBlockDevice (LUKS analog)

A userspace service holds two caps — BlockDevice (raw) and SymmetricKey — and exports a new BlockDevice cap that looks identical to its input but encrypts writes and decrypts reads transparently. Everything above the wrapper (filesystems, the Store service, content-addressed backends) is oblivious.

Raw block device
  → virtio-blk / NVMe driver → BlockDevice cap (ciphertext)
    → EncryptedBlockDevice service holds [BlockDevice + SymmetricKey]
      → BlockDevice cap (plaintext-view)
        → FAT / ext4 / Store service
          → File / Directory / Namespace caps
            → App

Properties:

  • One key per volume (or per-range, see “Key hierarchy” below).
  • Granularity is a sector/block. Metadata in the filesystem layer is encrypted along with data — the shape of the directory tree is invisible to threat #3.
  • Incompatible with zero-copy device DMA into user pages (see “SharedBuffer” below).

Layer A defends against threats #1, #2, and #3.

Layer B — per-user Namespace / Directory encryption (fscrypt analog)

Layered above a filesystem or Store, Layer B encrypts object contents and, optionally, object names, using a per-user key. The underlying block device may or may not also be encrypted.

BlockDevice (ciphertext or plaintext)
  → Store service → Store/Namespace caps (ciphertext objects)
    → EncryptedNamespace service holds [Namespace + UserKey]
      → Namespace cap (plaintext-view)
        → User's session services

Properties:

  • One key per user (or per session, per device, per tenant).
  • Metadata at the filesystem/Store layer is visible to threat #3 unless Layer A is also in place.
  • Cap boundaries are naturally per-user — revocation is “drop the cap,” no filesystem rekeying.
  • Compatible with shared filesystems across users (per-entry encryption).

Layer B defends primarily against #4-lateral (a compromise of user Bob’s session does not reveal user Alice’s data) and against a compromised shared filesystem service when the underlying block layer is unencrypted.

Recommendation

Use both. Layer A for the system volume and for the per-tenant block substrate in multi-tenant deployments; Layer B for per-user data on top of a shared filesystem or store. Users who run single-tenant desktops can skip B. Cloud VMs that rely on provider-side encryption of block storage (see “Cloud integration”) can skip A and keep B. The proposal does not mandate either layer; it standardizes the interface so both compose.

Volume-Specific Schemas

SymmetricKey, KeySource, KeyAlgorithm, KeyPurpose, and SealPolicy are defined in Cryptography and Key Management. This proposal adds only the wrapper-factory and on-disk-format schemas.

EncryptedBlockDevice

Exposes nothing new — it implements the existing BlockDevice interface. The distinction is where it sits in the cap graph. A factory cap creates it:

interface EncryptedBlockDeviceFactory {
    open @0 (raw :BlockDevice, key :SymmetricKey, format :VolumeFormat)
         -> (plain :BlockDevice);
    format @1 (raw :BlockDevice, key :SymmetricKey, params :FormatParams)
           -> (plain :BlockDevice);
}

struct VolumeFormat {
    superblock     @0 :Data;  # read from raw device during open()
    algorithm      @1 :SymmetricAlgorithm;  # defined in key-management proposal
    sectorSize     @2 :UInt32;
    tagAreaLayout  @3 :TagAreaLayout;
}

Cryptographic Construction

Two separate questions — block layer and object layer — with different answers.

Block layer (Layer A)

Requirement: authenticate every block. XTS alone is not enough; it defends against #1 but not #2.

Shortlist:

  • AES-256-GCM-SIV with LBA-derived nonce + separate tag area. The nonce is HMAC(K_nonce, LBA) (deterministic, no extra storage). The tag (128 bits) is stored in a reserved tag area, either a sidecar journal (dm-integrity style) or a reserved footer per block group. Cost: ~3% storage overhead for the tag, one extra read/write to the tag area per I/O (usually absorbed by sector grouping). Defends against #1 and #2.
  • XChaCha20-Poly1305 with random nonce + tag. Same tag-storage problem as GCM-SIV; XChaCha’s 192-bit nonce removes nonce-reuse concerns entirely. Slower than AES on hardware that has AES-NI, faster on hardware that doesn’t (e.g. low-end ARM).
  • AES-256-XTS alone. The LUKS1/LUKS2 default. Reject this as the sole defense; it fails #2. May still be useful as a building block under an external MAC (dm-integrity + dm-crypt in Linux).
  • Wide-block constructions (HCTR2, Adiantum). Length-preserving, no MAC. Better diffusion than XTS but still fail #2. Useful only when storage overhead for tags is unacceptable and tamper-detection is being provided elsewhere.

Recommendation: AES-256-GCM-SIV with LBA-derived nonce and a dedicated tag area, fallback to XChaCha20-Poly1305 on hardware without AES-NI. Document the tag-area layout in VolumeFormat; don’t invent a scheme per deployment.

Object layer (Layer B)

Requirement: per-object authentication; compatibility with content-addressed storage where possible.

Options, with the honest tradeoffs:

  • Per-tenant keys, hash(ciphertext) as address. Each user’s Store encrypts with their key. Dedup works within a volume, not across. Metadata (object size, access patterns) is visible to a peer holding the backing BlockDevice. This is the recommended default.
  • Per-tenant keys, HMAC(K, plaintext) as address. Address derived deterministically from plaintext allows a user to look up their own objects by plaintext hash without scanning. Same cross-tenant properties as above.
  • Convergent encryption (key = hash(plaintext)). Global dedup across users, but leaks equality: “user X holds the same file as user Y.” Rejected as a default; too much leakage for a capability-based OS that treats ambient authority as a bug.

All three use an AEAD (GCM-SIV or XChaCha20-Poly1305) per object with a random nonce stored with the object.

System Volume Flow

  1. Boot firmware loads Limine, which loads the kernel + init + boot services from an unencrypted boot partition.
  2. Kernel spawns init. Init spawns a minimal service graph: block device driver, console service, KeySource service (one of passphrase / TPM / cloud KMS / manifest-embedded), and the EncryptedBlockDeviceFactory service.
  3. Init obtains the unlock context. For interactive boot: read a passphrase via the console login flow in Boot to Shell. For unattended boot: invoke TPM unseal, KMS decrypt, or an attestation protocol. Contexts that require networking (cloud KMS, Tang) come up after the network stack.
  4. Init hands (BlockDevice, SymmetricKey) to EncryptedBlockDeviceFactory.open and receives a plaintext-view BlockDevice.
  5. Init hands that BlockDevice to the filesystem or Store service, which becomes the system storage root.
  6. Init pivots to the services graph baked in the now-readable system volume. Services that do not need direct I/O never see a raw BlockDevice and therefore never see ciphertext.

Analogous to Linux’s initramfs pattern, but with capabilities instead of /dev paths.

User Volume Flow

  1. User authenticates through the login flow in Boot to Shell. Success yields a session and a CredentialStore response.
  2. SessionManager invokes the user’s KeySource — passkey PRF, password-derived, or cloud-held — yielding a user SymmetricKey.
  3. SessionManager hands (UserNamespace, UserKey) to an EncryptedNamespaceFactory.open and receives a plaintext-view Namespace.
  4. The plaintext Namespace is installed in the session’s CapSet. Services in the session see only the user’s decrypted view.
  5. On logout, the session is torn down; the user SymmetricKey cap is released; the key service’s in-process material is zeroized. EncryptedNamespace stops decrypting. Ciphertext remains intact on disk.

Revocation is a cap-drop, not a filesystem rekey.

SharedBuffer and DMA

SharedBuffer (docs/roadmap.md Stage 6 / MemoryObject) exists so devices can DMA directly into app pages. Software block encryption is inherently incompatible with that: the device writes ciphertext; the app expects plaintext.

Three honest answers:

  1. Extra copy. Driver DMAs into a scratch page held by the EncryptedBlockDevice service, which decrypts into the app’s SharedBuffer. One extra copy per I/O. Simple; correct; first implementation. Cost is dominated by the crypto itself, not the copy, for typical I/O sizes.
  2. Decrypt in place. Device DMAs ciphertext into the app’s SharedBuffer; the service decrypts it in-place before completion is posted. Saves a copy, keeps CPU crypto on the hot path, and complicates reuse of the buffer (the app sees ciphertext briefly, then plaintext). Viable once the buffer lifetime is well-specified.
  3. Hardware inline crypto. NVMe OPAL, SED drives, Intel CSE, AES-XTS block engines on some ARM SoCs. Device sees the key; DMA paths see plaintext; software sees an unencrypted-looking device. Different trust model — the device is now in the TCB — and different key-provisioning story (IEEE 1667 / TCG Opal PSID). Note for future work; not a first-implementation target.

First implementation: #1. Revisit #2 when I/O performance matters. Treat #3 as a separate capability shape (SelfEncryptingBlockDevice) rather than a flag on the main interface.

Boot Order and the Unencrypted Boot Partition

By construction there must be an unencrypted partition containing at least: Limine, kernel, init, the block device driver, the key-source service(s), the encrypted block device factory, and — if the key source requires it — a minimal networking stack.

This partition is the trust root for the whole system. It does not need to be encrypted, because its contents are either integrity-protected by a measured-boot chain or considered public anyway (the capOS binaries are open source). It does need to be integrity-protected, which is secure boot / measured boot — addressed in Storage and Naming Open Question #5 and not duplicated here.

Relationship to that question: a TPM-sealed KeySource requires measured boot to be useful. Without measurement, a tampered boot partition can unseal the key under attacker-controlled code. A passphrase KeySource does not require measured boot, only the expectation that the user will notice if the boot UI looks wrong. A cloud KMS KeySource relies on cloud-provider instance identity, which is a parallel trust story (see below).

Cloud Integration

Cloud environments change every part of this picture: the block device is virtual, the key store is a network service, instance identity is provider-signed, object storage exists as a first-class primitive, and backups are a product, not a script. capOS should treat each of these as a capability and reuse them.

Cloud block storage (EBS, GCP Persistent Disk, Azure Disk)

These volumes are already encrypted at rest by the provider. The question is whose key performs the encryption:

ModelProvider sees plaintext?Customer controls key?Customer does crypto?
Provider-managed (default)Yes (plaintext in volume)NoNo
Customer-managed (CMEK)Yes (plaintext in volume)Yes (via KMS)No
Customer-supplied (CSEK)Briefly, during requestYesNo
Client-side (Layer A)NoYesYes

capOS’s BlockDevice cap is indifferent to which of the first three the provider is doing. For the fourth — client-side encryption — capOS wraps the provider’s BlockDevice cap in its own EncryptedBlockDevice. The provider sees only ciphertext and cannot read the volume even with a compelled-disclosure order.

Deployment guidance:

  • Untrusted provider / compliance-driven: Layer A over cloud block storage. Provider-side encryption becomes a belt-and-braces redundancy.
  • Trusted provider / operational simplicity: rely on CMEK, skip Layer A. Capability model still contains peer services — a compromised capOS service does not get raw block I/O unless it holds the cap.
  • Confidential-computing VMs (SEV-SNP / TDX / Nitro): use Layer A with an attestation-gated KeySource. The attestation report proves the VM is genuine and running approved code; KMS releases the DEK only against a valid report.

Cloud KMS (AWS KMS, GCP KMS, Azure Key Vault, Vault, …)

Envelope encryption is the universal pattern: the cloud KMS holds a key-encrypting key (KEK) with tight IAM-bound access; the actual data-encrypting key (DEK) is generated by capOS, wrapped by the KEK, stored alongside the ciphertext, and unwrapped by KMS at unlock time.

Map to capabilities:

  • A CloudKmsKeySource service implements KeySource. unlock(blob) sends the wrapped DEK to KMS for Decrypt, receives the plaintext DEK, constructs a local SymmetricKey cap around it, and returns it.
  • The service authenticates to KMS using the VM’s instance identity, obtained from a CloudMetadata-derived InstanceIdentity cap (see Cloud Metadata). No long-lived credentials are baked into the image.
  • seal(key, KmsPolicy{kmsKeyId, grant}) calls KMS Encrypt to wrap the key under the named KEK and returns the opaque blob.
  • KMS audit logs record every unwrap. This is a free observability win capOS inherits by delegation; nothing in the OS needs to log key usage separately.

Benefits of envelope encryption that capOS gets by following the pattern:

  • Free KEK rotation. Rotating the KEK requires only re-wrapping the DEK (fast, metadata-only). The DEK itself stays; the volume is not rewritten. A rewrap method on KeySource makes this explicit.
  • Revocation. Disable the KMS key or revoke the IAM grant; the next unlock fails. Running instances with a cached DEK continue until reboot — matches Linux behavior.
  • Cross-region / cross-account access. KMS grants move ciphertext-readable capability between accounts without handing over the key material. capOS reads that as “the receiving account holds a KeySource cap whose policy the grant satisfies.”

Non-AWS KMS providers (Vault, HSM clusters, KMIP devices) fit the same interface. The CloudKmsKeySource service name is a placeholder; production likely wants one service per provider, or one generic service with a provider-selection parameter.

Instance identity and attestation

Cloud VMs authenticate to KMS without baked-in credentials because the hypervisor signs identity tokens. AWS IMDSv2, GCP metadata identity tokens, and Azure IMDS all produce short-lived signed JWTs. Confidential-computing platforms extend this with hardware attestation reports (SEV-SNP, TDX, Nitro).

An InstanceIdentity capability — carved out of Cloud Metadata — exposes these token and attestation paths. Key-source services consume that cap instead of pulling from an ambient metadata endpoint. Revoking a service’s access to the metadata service becomes a cap-graph edit: no firewall rules, no iptables on 169.254.169.254.

OIDC-gated volume unlock (workload identity federation)

InstanceIdentity is the raw material. Modern clouds consume it through OIDC token exchange (RFC 8693) rather than a provider- specific identity API. That pattern is defined in OIDC and OAuth2 as WorkloadIdentityFederation; volume encryption consumes it through OidcFederatedKeySource (see Cryptography and Key Management).

System-volume flow:

  1. Boot the key-less image. init starts the block driver, the metadata service, and the OAuth service, but never holds raw cloud credentials.
  2. CloudMetadata returns an InstanceIdentity cap (a signed JWT from the hypervisor).
  3. WorkloadIdentityFederation.exchange posts that JWT to the cloud STS with grant_type = urn:ietf:params:oauth:grant-type:token-exchange and subject_token_type = urn:ietf:params:oauth:token-type:jwt. It receives a short-lived cloud access token bound to the instance’s identity.
  4. OidcFederatedKeySource uses that access token to authenticate a Decrypt call on the wrapped DEK at the cloud KMS. The plaintext DEK returns as a SymmetricKey cap.
  5. EncryptedBlockDeviceFactory.open composes that key with the raw BlockDevice and returns a plaintext-view BlockDevice.

Per-user volume flow (Layer B):

  1. Alice authenticates through console or web shell OIDC; the IdP issues an ID token and an access token.
  2. SessionManager mints her UserSession; her AccessToken cap is handed to OidcFederatedKeySource wrapped inside the broker- returned session bundle — never as a bearer string.
  3. The key service enforces SealPolicy.tokenExchange { issuer, audience, subjectPattern, requiredClaims, minAuthStrength }. It verifies the access token (or an ID token it exchanges for) against its pinned IdP trust record and only then releases Alice’s DEK.
  4. EncryptedNamespaceFactory.open yields Alice’s plaintext namespace. Logout drops the cap; the in-process key material zeroizes.

Properties this adds on top of plain CloudKmsKeySource:

  • No long-lived IAM credentials anywhere in the image. The historical instance-role access-key pair is gone; what remains is a short-lived access token tied to the live workload.
  • Audit keyed on principal. Cloud KMS logs the OIDC sub of every Decrypt, so “Alice’s laptop unlocked her volume at 09:14” is observable without extra audit glue.
  • Step-up authentication on the unlock path. TokenExchangePolicy.minAuthStrength maps to X.1254 LoA. A volume requiring loa3 cannot be unlocked by a passwords-only session.
  • Revocation through IdP or KMS. Disable Alice at the IdP or revoke the IAM grant and the next unlock fails. Cached DEKs in running instances survive until reboot — identical to today’s cloud KMS semantics but explicit.

Token TTL vs. cached DEK

OIDC access tokens typically expire in minutes; DEKs typically live for as long as a volume is mounted. OidcFederatedKeySource.unlock is called once per mount; the DEK cap is held by the encrypted block/namespace service until mount ends. Token expiry after unlock does not re-lock the volume. This matches every other KMS-unwrap pattern (CloudKmsKeySource, Tpm2KeySource), but it is worth saying aloud: short-lived tokens give short-lived authorization freshness, not short-lived key availability. Deployments that want stricter revocation can:

  • require periodic re-unlock (re-mount) via broker policy,
  • keep the volume mounted read-only by default and require a fresh token for each write window,
  • or use a confidential-computing + attestation-gated KEK that the hardware refuses to re-release on policy change.

No baked credentials policy

The capOS ISO must contain neither a long-lived cloud IAM credential nor a long-lived bearer token. ManifestEmbeddedKeySource remains dev/CI only. Production builds pass through one of: Tpm2KeySource, AttestationKeySource, CloudKmsKeySource (instance-identity flow), or OidcFederatedKeySource (workload-federation flow). The manifest validator should refuse a production-profile image that embeds a symmetric volume key or a long-lived cloud credential.

Object storage (S3, GCS, Azure Blob)

Object storage is a natural backend for the capability-native Store. The Store service holds an S3Bucket cap, serializes capnp messages as S3 objects keyed by their content hash, and exports Store / Namespace caps to clients.

Encryption trust tiers mirror block storage:

ModelProvider sees plaintext?Customer key?Customer does crypto?
SSE-S3YesNoNo
SSE-KMSYesYes (KMS)No
SSE-CBrieflyYesNo
Client-side (Layer B in Store)NoYesYes

Client-side is the interesting case for capOS. The content-addressed Store can encrypt each blob with a per-tenant DEK before upload, keying objects by hash(ciphertext) or HMAC(K, plaintext). The DEK is wrapped by cloud KMS; the bucket can be world-readable without leaking plaintext. This is a deployment where “the provider stores our data” and “the provider cannot read our data” coexist.

Nonce management across objects becomes the main design question. Either:

  • random 192-bit nonce per object (XChaCha), stored as an object header; or
  • derived nonce from object identity (HMAC(K_n, object_id)), requires that the same plaintext object is never uploaded twice under the same key, which is consistent with content-addressing semantics.

Backups

Backups are where encryption choices pay off or hurt:

  • Block-level snapshot / cross-region replication. The provider handles it. A snapshot of a Layer-A-encrypted EBS volume is ciphertext; restoring requires the KMS key. Cross-region replication requires the key to be grant-accessible in the target region. Free; handled by the provider.
  • Application-level backup service. A backup service holds a Store or Directory cap, reads objects, writes them to an object-storage bucket, and records the backup manifest. If Layer B is in place, the backup bytes are already encrypted — no re-encryption needed, and the backup destination does not need the user’s key. If only Layer A is in place, the backup service sees plaintext because Layer A wraps below the Directory; the backup service must re-encrypt for the destination.
  • Restore to a different account / region / capOS install. The key must be reachable in the target environment. For KMS-wrapped DEKs: cross-account grants, multi-region KMS keys, or replicated key material. For TPM-sealed DEKs: explicit re-seal to the target TPM before restore. capOS does not need to implement this directly; it needs the KeySource abstraction to not hide the provider-specific primitives that enable it.

A backup KeyPolicy worth documenting: “this key is usable in regions A, B, and C, wrapped under KMS keys k_a, k_b, k_c, all granting access to the instance identity role backup-reader.” This is routine on AWS and routinely surprising to people who expect Linux dm-crypt semantics.

Keys never in the image

The capOS ISO must never contain production keys. The ManifestEmbeddedKeySource (key-management proposal) exists for development and CI only; the manifest validator should refuse to boot from an image that embeds a non-development key on a production-profile manifest. The production flow is always: boot from a key-less image, obtain identity from the cloud, fetch the wrapping policy from the cloud, unwrap a DEK via KMS, mount the volume. Same property as AWS’s “EBS with KMS requires no bootstrap secrets on the instance.”

Confidential computing

SEV-SNP, TDX, and AWS Nitro Enclaves produce attestation reports that include measurements of the VM image. A KMS policy can require a matching attestation before releasing the wrapping key. In capOS:

  • AttestationService exposes attestation(nonce) -> report (the report includes the image measurement, firmware version, and VM metadata signed by the hardware root of trust).
  • KeySource of kind attestation collects the report and submits it as part of the KMS Decrypt request; KMS enforces the policy server-side.
  • The trust story becomes: “this capOS image, unmodified, running on genuine SEV-SNP / TDX / Nitro hardware, is the only thing that can unlock this volume.” That is materially stronger than instance-identity alone.

This composes cleanly with Layer A: the confidential VM reads ciphertext from a cloud disk, unwraps the DEK via attestation-gated KMS, and decrypts locally. The cloud provider never sees plaintext and a stolen snapshot cannot be decrypted outside the attested VM.

Phases

No implementation exists. Phases here cover only the volume-specific work; the underlying key abstractions, key sources, and KMS integration are phased in Cryptography and Key Management. Volume encryption tracks, but does not duplicate, that sequence.

Phase V1 — EncryptedBlockDevice over RAM block device

  • Add EncryptedBlockDeviceFactory, VolumeFormat, TagAreaLayout, and FormatParams to schema/capos.capnp.
  • Wire the service between a RAM-backed BlockDevice and the Store or a toy FAT reader. Key source is ManifestEmbeddedKeySource from the key-management proposal’s Phase 1.
  • Implement AES-256-GCM-SIV with a reserved tag area; document the on-disk format (superblock, tag area layout, block size).
  • Measurement: demonstrate a Store survives a ciphertext read of the raw RAM disk and fails decrypt after a flipped bit.

Phase V2 — EncryptedNamespace and user-volume path

  • Add EncryptedNamespaceFactory schema.
  • Layer B over a RAM-backed Store. Depends on PassphraseKeySource (key-management Phase 4) and PasskeyPrfKeySource once passkey infrastructure lands.
  • Revocation tests: dropping a session’s key cap renders the namespace unreadable without rebooting.

Phase V3 — Persistent storage integration

  • Promote Phase V1 from RAM disk to virtio-blk.
  • System volume unlock in the normal boot path. Default dev build uses a manifest-embedded key; production build requires passphrase/TPM/KMS.
  • QEMU smoke: system volume encrypted with a passphrase, reboot survives, wrong passphrase fails closed.

Phase V4 — TPM-backed system volume

  • Depends on Tpm2KeySource from key-management Phase 5.
  • Measured-boot chain: firmware, bootloader, kernel, init, key service. PCR composition for a sealed system volume documented.

Phase V5 — Cloud deployment

  • Depends on CloudKmsKeySource from key-management Phase 6.
  • Client-side encrypted block volume over cloud block storage.
  • Optional: client-side encrypted Store backend over object storage.

Phase V5b — OIDC-federated unlock

  • Depends on OidcFederatedKeySource from key-management Phase 6b and on WorkloadIdentityFederation from OIDC and OAuth2 Phase 5.
  • System volume unlocks through token-exchange against the cloud STS; no long-lived IAM credentials in the image.
  • Per-user EncryptedNamespace unlocks from a user AccessToken under SealPolicy.tokenExchange.
  • QEMU smoke against a local fake STS (e.g. dex) proves the flow end-to-end before targeting a real cloud.

Phase V6 — Confidential computing

  • Depends on AttestationKeySource from key-management Phase 7.
  • Attestation-gated system volume unlock on SEV-SNP / TDX / Nitro.
  • QEMU SEV-SNP smoke (where toolchain supports it).

Relationship to Other Proposals

  • cryptography-and-key-management-proposal.md — primary dependency. Volume encryption consumes the SymmetricKey, KeySource, KeyVault, KeyAlgorithm, KeyPurpose, and SealPolicy primitives defined there (see that proposal’s “Schemas” → “Symmetric keys” and “Key lifecycle — the KeyVault” sections) along with the concrete ManifestEmbeddedKeySource, PassphraseKeySource, PasskeyPrfKeySource, Tpm2KeySource, CloudKmsKeySource, OidcFederatedKeySource, and AttestationKeySource implementations under “Concrete Key Sources”. This proposal adds only the volume-specific wrapper factories (EncryptedBlockDeviceFactory, EncryptedNamespaceFactory), the on-disk VolumeFormat / TagAreaLayout, and the block- and object-layer cryptographic constructions; it does not redefine any key, algorithm, purpose, seal-policy, or key-source shape.
  • storage-and-naming-proposal.md — Open Question #5 (manifest trust and secure boot) is a prerequisite for a TPM-sealed KeySource to be meaningful. This proposal extends the storage stack with EncryptedBlockDevice and EncryptedNamespace as optional wrapper services; the BlockDevice, File, Directory, Store, and Namespace interfaces are unchanged.
  • boot-to-shell-proposal.md — the passphrase / passkey unlock path at the console and in the web gateway feeds KeySource implementations. CredentialStore, SessionManager, and AuthorityBroker already think about missing credentials not implying an unlocked system; this proposal extends that to “missing key source implies missing system volume, not zero-fill.”
  • user-identity-and-policy-proposal.md — user-volume keys are bound to session identity. The cap chain that yields “you are Alice” also yields Alice’s KEK.
  • cloud-metadata-proposal.mdCloudMetadata and the InstanceIdentity cap carved out of it are what the cloud KeySource implementations consume to authenticate to KMS without baked-in credentials.
  • oidc-and-oauth2-proposal.md — the WorkloadIdentityFederation and token-exchange primitives behind OidcFederatedKeySource. Also the source of the AccessToken / IdToken cap shape used in per-user volume unlock and the policy inputs consumed by SealPolicy.tokenExchange.
  • cloud-deployment-proposal.md — owns the cloud KMS reasoning this proposal builds on. Its “Managed Application Services” and “GCP Cloud KMS And IAM Notes For Adventure Saves” sections describe the envelope-encryption pattern (KMS holds the KEK; capOS generates the DEK and wraps it; KMS Encrypt/Decrypt unwraps on demand) and the IAM/grant model that CloudKmsKeySource and OidcFederatedKeySource plug into. Its NVMe phase (“Phase 5: NVMe Driver”) and SED/Opal notes set the ground for a future SelfEncryptingBlockDevice capability with hardware inline crypto, distinct from this proposal’s software-crypto Layer A and with a different TCB story (the device is in the TCB).
  • security-and-verification-proposal.md — the encrypted block format is a good target for the tiered tooling plan: fuzz corrupted ciphertext at the block boundary, proptest round-trips through the wrapper, Loom-model the volume unlock state machine, Kani-prove LBA-nonce uniqueness invariants. General crypto-side invariants are tracked in the key-management proposal.
  • system-monitoring-proposal.md — volume unlock, decrypt failure, and format-params events are audit-worthy. The EncryptedBlockDevice service emits them through the audit cap. Generic key events are emitted by the key-management services.
  • live-upgrade-proposal.md — replacing the EncryptedBlockDevice service must preserve in-flight I/O and the DEK. The service holds sensitive state (the key material); live upgrade needs a state-transfer path that does not touch the disk and does not leak the key through shared memory.
  • ../design-risks-register.md — the register currently carries no dedicated R-entry for volume encryption or encryption-at-rest; that is intentional, because no implementation exists yet. The closest tracked entry is Q11 (“Capability persistence model”), which already lists this proposal alongside storage-and-naming-proposal.md as a tracker for the sealed/stored capability and key-material persistence path. Open a dedicated R-entry once Phase V1 lands a real on-disk format, since at that point the tag-area layout, LBA-nonce derivation, and revocation semantics become long-horizon design surfaces in their own right.

Open Questions

  1. Tag area layout. Sidecar journal (dm-integrity style, separate device or partition) vs. reserved footer per block group vs. derived-nonce-only-plus-separate-MAC-area. Affects write amplification, recovery, and fsync semantics. A small measurement study under QEMU would settle it.
  2. Key rotation at scale. Rewrap-only (KEK rotation) is cheap. Rekeying a DEK on a live volume means re-encrypting every block. Online rekey is a research problem; for capOS a controlled offline rekey service reading old-key and writing new-key is the honest first answer.
  3. Metadata leakage in Layer B. fscrypt-style filename encryption is fiddly (deterministic encryption to preserve directory lookups vs. randomized encryption that breaks them). Decide whether Layer B encrypts names as well as contents, and how lookups work if names are randomized.
  4. Backup re-encryption. A backup crossing trust boundaries needs either shared key material at both ends or an explicit re-encrypt step. Who does the re-encryption — the backup service, a dedicated re-encryption service, or a KMS-side primitive? Policy question, not a mechanism question, but worth documenting defaults.
  5. Hardware inline crypto as a separate capability. NVMe OPAL and SED drives do not fit the software-AEAD model. Define SelfEncryptingBlockDevice with its own open/lock/unlock methods and a separate trust story (the device is in the TCB).
  6. Swap / paging. No swap yet. When added, encrypted swap with a per-boot ephemeral key is standard. The memory-pressure policy, page-eligibility rules, and swap lifecycle now live in OOM Handling and Swap.
  7. Firmware and boot-partition integrity. This proposal assumes secure boot / measured boot is available when TPM-sealed keys are in use. The actual secure-boot work is owned by storage-and-naming-proposal.md Open Question #5 and is prerequisite, not in scope here.

Algorithm enum scope, side-channel hardening, post-quantum migration, GOST support, and audit granularity are answered in Cryptography and Key Management’s open-questions section rather than duplicated here.

Proposal: Cloud Instance Bootstrap

Picking up instance-specific configuration — SSH keys, hostname, network config, user-supplied payload — from cloud provider metadata sources, without porting the Canonical cloud-init stack.

Problem

A capOS ISO built once has to boot on any cloud VM and adapt to its environment: different instance IDs, different public IPs, different operator-supplied SSH keys, different user-data payloads. Without this, every instance needs a custom-baked ISO — and the content-addressed-boot story (“same hash boots identically on N machines”) devalues itself at the point where it would actually matter for operations.

The Linux convention is cloud-init: a Python daemon that reads metadata from provider-specific sources and applies it by writing files under /etc, invoking systemctl, creating users, and running shell scripts. Porting it is a non-starter:

  • Python, POSIX, systemd-dependent.
  • Runs as root with ambient authority: parses untrusted user-data as shell scripts, mutates arbitrary system state.
  • ~100k lines covering hundreds of rarely-used modules (chef, puppet, seed_random, phone_home).
  • Assumes a package manager and init system that do not exist on capOS.

capOS needs the pattern — consume provider metadata, use it to bootstrap the instance — reshaped to the capability model.

Metadata Sources

All major clouds expose instance metadata through one or more of:

  • HTTP IMDS. 169.254.169.254. AWS IMDSv2 requires a PUT token-exchange handshake; GCP and Azure accept direct GET. Paths differ per provider. Needs a running network stack.
  • ConfigDrive. An ISO9660 filesystem attached as a block device, containing meta_data.json (or equivalent) and optional user-data file. OpenStack, older Azure. Needs a block driver and filesystem reader, no network.
  • SMBIOS / DMI. Vendor, product, serial-number, UUID fields populated by the hypervisor. Good for provider detection before networking comes up.
  • NoCloud. Seed files baked into the image or on an attached FAT disk. Useful for development and bare-metal.

The bootstrap service should read from whichever source is present rather than hardcoding one. Provider detection via SMBIOS runs first (no dependencies), then the appropriate transport is initialized.

CloudMetadata Capability

A single capnp interface; one or more implementations:

interface CloudMetadata {
    # Instance identity
    instanceId    @0 () -> (id :Text);
    instanceType  @1 () -> (type :Text);
    hostname      @2 () -> (name :Text);
    region        @3 () -> (region :Text);

    # Network configuration (primary interface addresses, gateway, DNS)
    networkConfig @4 () -> (config :NetworkConfig);

    # Authentication material
    sshKeys       @5 () -> (keys :List(Text));

    # User-supplied payload. Opaque to the metadata provider.
    userData      @6 () -> (data :Data, contentType :Text);

    # Vendor-supplied payload. Separate from userData so the
    # bootstrap policy can trust them differently.
    vendorData    @7 () -> (data :Data, contentType :Text);
}

struct NetworkConfig {
    interfaces @0 :List(Interface);

    struct Interface {
        macAddress @0 :Text;
        ipv4       @1 :List(IpAddress);
        ipv6       @2 :List(IpAddress);
        gateway    @3 :Text;
        dnsServers @4 :List(Text);
        mtu        @5 :UInt16;
    }
}

Implementations:

  • HttpMetadata — fetches from 169.254.169.254; one variant per provider because paths and auth handshakes differ (AWS IMDSv2 token, GCP Metadata-Flavor: Google, Azure API version).
  • ConfigDriveMetadata — reads an ISO9660 seed disk.
  • NoCloudMetadata — reads a seed blob from the initial manifest.

Detection lives in a small probe service that inspects SMBIOS (System Manufacturer: Google, Amazon EC2, Microsoft Corporation, …) and grants the cloud-bootstrap service the appropriate CloudMetadata implementation as part of a manifest delta.

Bootstrap Service

A single service — cloud-bootstrap — runs once per boot:

cloud-bootstrap:
  caps:
    - metadata: CloudMetadata        # from probe service
    - manifest: ManifestUpdater      # narrow authority to extend the graph
    - network:  NetworkConfigurator  # apply interface addresses
    - ssh_keys: KeyStore             # target store for authorized keys
  user_data_handlers:
    - application/x-capos-manifest: ManifestDeltaHandler
    # operator-installed handlers for other content types

Sequence:

  1. Gather identity and declarative config (instanceId, hostname, networkConfig, sshKeys), apply through the narrow caps above.
  2. (data, ct) = metadata.userData() — dispatch by content type. If no handler is registered, log and skip.
  3. Exit.

The service never holds ProcessSpawner directly. It holds ManifestUpdater, a wrapper that accepts capnp-encoded ManifestDelta messages and applies them through the existing init spawn path. The decoder and apply path are shared with the build-time pipeline (same capos-config crate, same spawn loop). The precise shape of ManifestDelta is an open question — see “Open Questions” below — but at minimum it covers hostname, network config, SSH keys, and authorized application-level service additions:

struct ManifestDelta {
    addServices      @0 :List(ServiceEntry);
    addBinaries      @1 :List(NamedBlob);
    setHostname      @2 :Text;
    setNetworkConfig @3 :NetworkConfig;
}

Relationship to the Build-Time Manifest Pipeline

The existing build-time pipeline (system.cuetools/mkmanifestmanifest.bin → Limine boot module → capos-config decoder → init spawn loop) and the cloud-metadata bootstrap path are not two parallel systems. They are the same pipeline with different transports and different trust scopes. See docs/proposals/system-configuration-proposal.md for the authoring side — layered package capos CUE, the cue/defaults/defaults.cue baseline, operator-supplied system.local.cue overlays, @tag(user) host-user injection, and the slice-4 mkmanifest cue-to-capnp host tool that turns arbitrary schema-aware CUE into capnp bytes without expanding the boot-manifest ABI. The same authoring tool, decoder, and merge contract back the cloud path; the only delta on the cloud side is who hands the capnp bytes to the parser and which cap applies them.

StageBuild-time (baked ISO)Runtime (cloud metadata)
Authoringsystem.cue in the repouser-data.cue on the operator’s host
Compilemkmanifest (CUE → capnp)same tool, same output
TransportLimine boot moduleHTTP IMDS / ConfigDrive / NoCloud disk
Wire formatcapnp-encoded SystemManifestcapnp-encoded ManifestDelta
Decodercapos-configcapos-config
Applyinit spawn loopsame spawn loop, invoked via ManifestUpdater

Three practical consequences:

  • CUE is a host-side authoring convenience, not an on-wire format. Neither kernel nor init evaluates CUE. An operator supplying user-data writes user-data.cue, runs `mkmanifest user-data.cue

    user-data.binon their host, and ships the capnp bytes (base64 into–metadata [email protected]` for GCP/AWS, or as a file on a ConfigDrive ISO).

  • NoCloud is a Limine boot module by another name. A NoCloud seed blob is the same bytes as a baked-in manifest.bin, attached via a disk or bundled into the ISO instead of handed over by the bootloader. The only difference is who hands the bytes to the parser.
  • No new schema surface. ManifestDelta is defined alongside SystemManifest in schema/capos.capnp, and sharing the decoder means ManifestUpdater’s apply path is a thin merge-and-spawn on top of code that already boots the base system.

The trust model stays clean precisely because ManifestDelta is not SystemManifest. The base manifest is inside the content-addressed ISO hash (fully trusted, reproducible). The runtime delta is applied by a narrowly-permitted service whose caps define what fields of the delta can actually take effect — the content-addressed-boot story is preserved because cloud metadata augments the base graph, it cannot replace it.

User-Data Model

User-data on the wire is a capnp blob, not a shell script. Content type application/x-capos-manifest identifies the canonical case: the payload is a ManifestDelta message produced by mkmanifest on the operator’s host and consumed directly by the bootstrap service.

For cross-cloud-vendor compatibility, operators can install user-data dispatcher services for other content types (YAML, other capnp schemas, signed manifests, etc.). The bootstrap service holds a handler cap per content type; unknown types are logged and ignored, not executed.

Shell-script user-data — the Linux default — has nowhere to run on capOS because there is no shell and no ambient-authority process to execute it under. An operator who insists on this can install a shell service and a handler that routes text/x-shellscript to it, but that is a deliberate choice, not a default fallback.

Trust Model

The capability angle earns its keep here.

  • The metadata endpoint is assumed as trustworthy as the hypervisor running the VM — the same assumption Linux cloud-init makes.
  • The bootstrap service holds narrow caps (ManifestUpdater, NetworkConfigurator, KeyStore), not ambient root. A bug or a malicious metadata response can at most spawn services the ManifestUpdater accepts, set network config the NetworkConfigurator accepts, and drop keys into the KeyStore. It cannot reach for arbitrary system state.
  • vendorData and userData are separated on the wire. A policy that trusts the cloud provider but not the operator (e.g., apply vendorData as-is, route userData through a signature check) is expressible by granting different handler caps to each.
  • User-data content-type dispatch is capability-mediated: the bootstrap service cannot execute a content type it wasn’t given a handler for. There is no fallback “try to run it as shell.”

Phased Implementation

Most of the manifest-handling machinery already exists from the build-time pipeline (capos-config, mkmanifest, init’s spawn loop). The new work is transports, provider detection, and the ManifestDelta merge semantics. The transport and platform prerequisites — SMBIOS decode beyond the bounded diagnostics snapshot, ISO9660/block stack for ConfigDrive, userspace networking for HTTP IMDS, and cloud-vendor disk-image bring-up — all land through docs/proposals/cloud-deployment-proposal.md, which already owns the imported-image boot proof and the userspace-driver authority gate this proposal depends on.

  1. ManifestDelta schema and ManifestUpdater cap. Add the delta type to schema/capos.capnp alongside SystemManifest, extend capos-config with a merge routine (SystemManifest + ManifestDelta → new services to spawn), and expose ManifestUpdater as a cap in init. NoCloudMetadata seeded from a test fixture is enough to demo the apply path end-to-end without any cloud dependency.
  2. Provider detection via SMBIOS. Kernel-side primitive or capability that reads SMBIOS DMI tables and exposes manufacturer / product strings. No network required.
  3. ConfigDrive support. ISO9660 reader plus ConfigDriveMetadata. Gives a working real-transport metadata source with no dependency on userspace networking. QEMU can attach one via -drive file=configdrive.iso,if=virtio for local testing.
  4. HttpMetadata per provider. Requires the userspace network stack (Stage 6+). GCP first (simplest auth), then AWS (IMDSv2 token flow), then Azure.
  5. Cross-provider Cloud Metadata demo. Same ISO hash boots under QEMU, GCP, AWS, and Azure; the only difference is the SMBIOS manufacturer string, which the probe service uses to pick the right HttpMetadata variant. This is the Cloud Metadata observable milestone.

Open Questions

Which fields of system.cue are runtime-modifiable?

system.cue today is a handful of service entries with kernel Console cap grants encoded as structured source variants. That will grow. Plausible additions as capOS matures: driver process definitions (virtio-net, virtio-blk, NVMe) with device MMIO, interrupt, and frame allocator grants; scheduler tuning (priority, budget, CPU pinning); filesystem driver services; memory-policy hooks; ACPI/SMBIOS consumers.

Most of those are either fragile (kernel-adjacent; a bad value bricks the instance), sensitive (granting kernel:frame_allocator to a user-data-declared service is effectively root), or both. A ManifestDelta with full SystemManifest equivalence hands every such knob to whoever controls user-data.

The narrowing has to happen somewhere, but there are several places it could live:

  1. Different schema. ManifestDelta is not structurally a subset of SystemManifest — it omits driver entries, scheduler config, and kernel cap sources entirely. Schema-level guarantee; rigid but unambiguous.
  2. Shared schema, policy-narrowing cap. ManifestUpdater accepts a full delta but validates at apply time: kernel source variants are rejected unless explicitly allow-listed by the cap’s parameters; additions that touch driver-level service entries fail. Flexible, but the narrowing logic is code that has to be audited, not a schema that is self-documenting.
  3. Tiered deltas. PrivilegedDelta (drivers, scheduler) and ApplicationDelta (hostname, SSH keys, app services), minted by different caps. An operator supervisor holds PrivilegedManifestUpdater; cloud-bootstrap holds only ApplicationManifestUpdater. Compositional; matches the capability-model grain but doubles the schema surface.
  4. Tag-based field permissions. Fields in ServiceEntry carry a privilege tag; ManifestUpdater is parameterized with a permitted-tag set. One schema, orthogonal policy.

Picking one prematurely would either over-constrain the cloud path (option 1 before we know what apps legitimately need) or under-constrain it (option 2 without clarity on what to check against). This proposal commits only to the shared pipeline (decoder, spawn loop, authoring tool). The shape of the public type(s) the cap accepts is deferred until system.cue has grown enough that the privileged vs. application split is visible in concrete form.

Related open question: whether kernel cap sources should be expressible in system.cue at all, or whether the build-time manifest should also declare them through a narrower mechanism so that the same discipline that protects cloud user-data also protects the baked-in manifest from accidental over-grants. If they remain expressible, they should be structured enum/union variants, not free-form strings; the associated interface TYPE_ID is only a schema compatibility check and does not identify the authority being granted.

Non-Goals

  • cloud-init compatibility. No parsing of #cloud-config YAML, no #!/bin/bash execution, no include-url, no MIME multipart handling. Operators who need these install their own dispatcher services; the base system does not.
  • Runtime package installation. The capOS equivalent of “install nginx on boot” is “include nginx in the manifest.” User-data can add services to the manifest; it cannot install packages (there is no package manager to install into).
  • Re-running on every boot. cloud-init distinguishes per-boot, per-instance, and per-once modules. The capOS bootstrap service runs once per boot; the manifest it produces is cached under the instance ID, and subsequent boots read the cache and skip the metadata round-trip. A full mode matrix is future work.
  • IPv6-only bring-up in the first iteration. Many clouds expose both; the schema supports both; the first implementations do whichever is easier per provider (typically IPv4).
  • Automatic secret rotation. Metadata often exposes short-lived credentials (IAM role tokens on AWS, service-account tokens on GCP). Refresh logic belongs to the service that consumes the credential, not to cloud-bootstrap.
  • docs/proposals/cloud-deployment-proposal.md owns the hardware/disk/network surface this proposal sits on: PCIe config-space access, MSI/MSI-X, ACPI/SMBIOS, virtio-net/virtio-blk, cloud-vendor disk-image bring-up, and the userspace-driver authority gate (DeviceMmio, DMAPool, Interrupt, HardwareAuditLog). The probe service’s SMBIOS read, the ConfigDrive block path, and the HTTP IMDS network path all wait on primitives tracked there.
  • docs/proposals/system-configuration-proposal.md owns the authoring side: package capos layering, cue/defaults/defaults.cue baseline, system.local.cue overlay, host-user @tag(user) injection, the per-user ~/.capos-tools cache, and the slice-4 mkmanifest cue-to-capnp host tool. The cloud bootstrap service reuses the same decoder, the same merge contract, and the same authoring conventions; only the transport and the apply cap differ.
  • docs/proposals/service-architecture-proposal.md defines the init spawn loop and ProcessSpawner boundary the ManifestUpdater cap narrows.
  • docs/proposals/cryptography-and-key-management-proposal.md and docs/proposals/certificates-and-tls-proposal.md own the trust anchors that signed-manifest user-data handlers will need once they exist.
  • cloud-init (Canonical). The Linux reference. Huge scope, shell-script-centric, assumes root and POSIX. The capOS design intentionally takes the pattern and drops everything that depends on ambient authority.
  • ignition (CoreOS/Flatcar). Runs once in initramfs, consumes a JSON spec, fails-fast if the spec can’t be applied. Closer in spirit to the capOS design — small, single-pass, declarative. Worth studying for its rollback and error-handling approach.
  • AWS IMDSv2. The token-exchange handshake is the one thing the HTTP client needs to handle that is not plain GETs. Designing the HttpMetadata interface without accounting for it up front leads to a rewrite later.

Proposal: Hardware Abstraction and Cloud Deployment

How capOS goes from “boots in QEMU” to “boots on a real cloud VM” (GCP, AWS, Azure). This covers the hardware abstraction infrastructure missing between the current QEMU-only kernel and real x86_64 hardware, plus the build system changes needed to produce deployable images.

Depends on: Kernel Networking Smoke Test (for PCI enumeration), Stage 5 (for timer history), Stage 7 / SMP proposal Phase C (for LAPIC timer and IPI).

Complements: Networking proposal (extends virtio-net toward cloud NICs), Storage proposal (extends local block-device work toward virtio-scsi and NVMe), SMP proposal (LAPIC timer/IPI infrastructure shared, with x2APIC tracked as a later backend).


Current State

The kernel boots via Limine UEFI, outputs to COM1 serial, has QEMU legacy PCI enumeration for the virtio-net smoke path, and has LAPIC timer/IPI groundwork from the SMP track. It also has an initial bounded, read-only ACPI diagnostic parser for Limine RSDP, RSDT/XSDT table inventory, MADT summaries, and MCFG presence/allocation summaries, plus a Q35 smoke that proves the reusable PCI config backend can enumerate a capped PCIe ECAM function inventory from MCFG. The x86 path exports bounded MADT I/O APIC/source-override records, maps the I/O APIC, and programs masked legacy IRQ routes to LAPIC vectors while honoring source overrides. PCI drivers can validate and map memory BAR subregions through a shared kernel helper; the virtio-net modern transport uses that helper for its common, notify, ISR, and device configuration regions. The PCI capability walk also reports MSI/MSI-X metadata for the virtio-net function, and the QEMU net smoke uses that metadata for a bounded kernel-owned virtio-net MSI-X dispatch/unmask and lifecycle proof through the device MSI vector pool; the remaining run-net fixture also covers queue setup, descriptor guards, ARP, and ICMP. Device-autonomous virtio-net MSI-X delivery is covered by the dedicated userspace-provider gates after the kernel L4 owner is retired.

The cloudboot image/harness slice landed in commit 02635421 (2026-05-05 06:51 UTC): make capos-cloudboot-image builds the importable raw disk tarball and make cloudboot-test drives the GCE upload/import/temporary-instance/serial-log loop with teardown. The first GCP imported-image serial-console boot proof is run 1778230874-715a (2026-05-08 09:06 UTC) against source commit 3951e275 (2026-05-08 08:50 UTC), reaching the capos kernel starting serial landmark on a temporary no-public-IP, no-service-account/scopes e2-small instance before teardown.

It still lacks public L4/SSH/WebShell ingress, AWS/Azure boot proofs and provider drivers, broader storage variants, high-throughput/multiqueue NIC readiness, direct-remapping DMA, production cloud-image release paths, and a cloud-ready clocksource/clockevent closeout. The GCP-first provider rollup has live serial-console operator access, selected NIC raw-frame reachability, selected NVMe Persistent Disk I/O, and gVNIC portability evidence.

The GCP-first usable cloud-instance provider rollup is closed by docs/tasks/done/2026-06-07/cloud-usable-instance-provider-nic-storage.md. Do not cite the cloudboot harness or the first GCP serial-console boot alone as evidence for provider NIC/storage readiness; the closeout depends on separate live NIC, storage, operator-access, and gVNIC evidence records. AWS/Azure, public ingress, and production cloud-image release gates remain separate.

Cloud deployment depends on the same trusted-build-inputs inventory that covers local builds. The consolidated supply-chain risk view – floating Rust nightly, observed-not-pinned xorriso / qemu-system-x86_64 / OVMF, CI publication and comparison of build-provenance records, and pinned production runner identity – is tracked as R13 in docs/design-risks-register.md; the detailed inventory, dependency policy, vendored-snapshot table, and the build-provenance retention/comparison policy live in docs/trusted-build-inputs.md. This proposal is recorded as a secondary owner of R13 because cloud-image release paths and provider-driver bring-up both depend on those reproducibility gates.

The implication for cloud bring-up is concrete: imported cloud images must travel with the corresponding make build-provenance record (source commit, toolchain identity, embedded-binary hashes, OVMF identity or explicit absence) before any provider serial-console run is cited as production evidence. Until the R13 gates close, cloud images remain local/CI proof artifacts rather than third-party reproducible boot images.

What Cloud VMs Provide

GCP (n2-standard), AWS (m6i/c7i), and Azure (Dv5) all expose:

ResourceCloud interfacecapOS status
Boot firmwareUEFI (all three)Limine UEFI works
Serial consoleCOM1 0x3F8Works (serial.rs)
Boot mediaHybrid BIOS+UEFI raw disk image, packaged per provider import rulesPartial (make capos-cloudboot-image builds a GCE-importable raw disk tarball; production release packaging and non-GCP provider packaging remain future)
Storagevirtio-scsi or NVMe (GCP Persistent Disk), NVMe/EBS (AWS Nitro), managed disksPartial (GCP NVMe Persistent Disk brokered READ proof landed; GCP virtio-scsi, Local SSD, AWS/Azure storage, and broader filesystem-backed cloud storage remain future)
NICvirtio-net or gVNIC (GCP), ENA (AWS), MANA (Azure)Partial (GCP legacy virtio-net raw-frame provider-nic-bound and gVNIC raw-frame / typed-Nic proofs landed; public ingress, high-throughput/multiqueue, ENA, and MANA remain future)
Virtio NICQEMU, GCP where selectable, some bare-metalPartial (QEMU smoke; reusable/cloud path planned)
TimerLAPIC timer, TSC, HPETPartial (LAPIC timer groundwork; cloud clocksource work missing)
Interrupt deliveryI/O APIC, MSI/MSI-XPartial (masked MADT-backed I/O APIC routes, MSI/MSI-X capability metadata, and bounded kernel-owned virtio-net MSI-X dispatch/lifecycle proof; I/O APIC ownership and userspace interrupt authority missing)
Device discoveryACPI + PCI/PCIePartial (QEMU legacy PCI smoke, bounded ACPI diagnostics/routing state, reusable legacy/ECAM PCI config access, kernel BAR/MMIO validation, MSI/MSI-X metadata discovery, and bounded virtio-net MSI-X dispatch proof; broader driver authority still missing)
DisplayNone (headless)N/A

Cloud NIC And Storage Portability Notes

The Device Driver Foundation is not complete just because QEMU virtio-net works. Cloud bring-up has provider-specific NIC and storage surfaces, and the first implementation slices must keep those differences visible while still deferring the actual provider drivers.

Provider pathExpected device surfacecapOS dependencyCurrent state
QEMU / constrained GCP virtio-netVirtio PCI transport, virtqueues, MSI-X where availableShared virtio transport helpers, DMAPool, DeviceMmio, Interrupt, and queue lifecycle proofsQEMU virtio-net proofs and the live GCE legacy virtio-net raw-frame provider-nic-bound proof landed. This does not claim public L4 ingress, high-throughput/multiqueue readiness, or device-autonomous MSI-X completion delivery
GCP gVNICgVNIC as the modern Compute Engine NIC, replacing virtio-net on newer machine generations and required for some featuresPCI BAR/MMIO binding, MSI-X routing, per-queue ring setup, image metadata declaring GVNIC, and fallback choice between virtio-net and gVNIC by machine familyGrounding plus bounded live proofs landed: the GCE gVNIC provenance map records the spec basis and authority mapping, the GCE harness can request GVNIC image/instance posture and inventory the 1ae0:0042 PCI function, the admin-queue/register proof maps BAR0 and issues one DESCRIBE_DEVICE, the raw-frame proof configures one GQI/QPL TX/RX queue pair, and the typed Nic adaptation proof exercises inline-frame Nic.transmit / Nic.receive over live gVNIC. No QEMU gVNIC model exists. This remains a separate GCE portability lane, not a blocker for the first public Web UI proof on a virtio-compatible machine type
AWS Nitro ENA + EBSENA enhanced networking plus Nitro NVMe storageENA queue/MSI-X driver, NVMe controller/storage path, IOMMU or bounce-buffer policy, and image import with ENA/NVMe expectationsPlanned; no ENA, NVMe EBS, or AWS boot proof
Azure Accelerated NetworkingAccelerated Networking exposes SR-IOV hardware families, with MANA as the newer Azure NIC and Mellanox mlx4/mlx5 still relevant on some hostsSynthetic-interface fallback awareness, VF binding/revocation handling, MANA/Mellanox driver binding, MSI-X routing, and reset/revoke paths that survive VF removalPlanned; no MANA, Mellanox VF, or Azure boot proof

These rows are planning gates, not implementation evidence. Each provider NIC has its own queue layout, feature negotiation, MSI-X/vector conventions, reset behavior, and driver-binding rules. Azure’s accelerated-networking path also requires the OS and applications to tolerate dynamic SR-IOV VF revocation by falling back to the synthetic network interface. Provider storage follows the same rule: AWS Nitro uses NVMe for EBS, GCP can require NVMe on newer or Confidential VM paths while retaining virtio-scsi on older paths, and Azure uses SCSI on many older families while Azure Boost and newer NVMe-capable VM families expose managed disks through NVMe. The shared foundation therefore needs ACPI/PCIe discovery, BAR validation, interrupt ownership, DMAPool accounting, IOMMU/bounce-buffer policy, and lifecycle teardown before any cloud NIC or storage driver is treated as portable.

What Already Works

  • UEFI boot – Limine ISO includes BOOTX64.EFI. The boot path itself is cloud-compatible.
  • Serial output – all three clouds expose COM1. gcloud compute instances get-serial-port-output, aws ec2 get-console-output, and Azure serial console all read from it.
  • x86_64 long mode – cloud VMs are KVM-based x86_64. Architecture matches.

Managed Application Services

Booting capOS on a cloud VM and using managed cloud services are separate tracks. The VM path proves hardware, disk, network, and serial behavior. Managed services can be useful earlier for application persistence, especially game profile/world state, as long as they sit behind narrow capOS service capabilities.

For a GCP-backed adventure persistence bridge:

  • Cloud Run hosts a small bridge endpoint. It translates capOS save/load/append requests into provider calls and enforces request bounds before touching cloud APIs.
  • Cloud KMS owns the key-encrypting keys (KEKs) for each game-world instance or shard. The bridge or game-world service gets narrow authority to wrap or unwrap data-encrypting keys (DEKs) through Cloud KMS envelope encryption. Ordinary browser clients do not receive DEKs, game-world key capabilities, KMS decrypt/unwrap grants, or provider-independent plaintext authority; provider storage objects contain ciphertext, wrapped DEKs, and metadata only.
  • Firestore Native mode stores mutable profile summaries, indexes, and compare-and-set version records.
  • Cloud Storage stores larger immutable snapshots, evidence blobs, exports, and content-addressed records. Object versioning and lifecycle policy are required before using it for durable game data.
  • Secret Manager stores bridge-side provider credentials and rotation material. Those secrets are never granted to ordinary capOS game clients.

This does not change the storage proposal’s rule: persistence is still application-level serialization of bounded Cap’n Proto records. The cloud bridge is just one backing implementation for Store, Namespace, or an app-specific AdventureSaveStore/CloudGameStore capability. Local fake-cloud tests must enforce stale-write rejection, wrong-profile rejection, append-only ledger behavior, and size bounds before a real GCP deployment is trusted.

A separate browser-mediated path can serve user-owned private backups. In that model, the browser or web terminal host authenticates the user to Google, stores encrypted save capsules in Drive appDataFolder or Firebase user documents, and returns only opaque provider handles and encrypted capsule bytes through explicit restore flows. DEK unwrap and plaintext validation happen in the local capOS key domain or in the game-world service with KMS/IAM authority, not in browser JavaScript. This is appropriate for user profile backup, private expedition checkpoints, and settings sync. It is not appropriate for authoritative public world state, reward witness records, market receipts, or multiplayer outcomes. The user’s browser holds provider tokens; capOS game services do not. For GCP-backed game worlds, the browser transports envelope-encrypted capsules with wrapped DEKs but does not hold game-world key capabilities, KMS decrypt/unwrap grants, DEKs, or plaintext authority.

Firebase user-document capsule paths must make the auth binding visible in the path template, not just in policy metadata. Use a narrow shape such as users/{request.auth.uid}/saveCapsules/{capsule_id} so Firestore rules can bind the user wildcard to request.auth.uid; literal profile names such as users/alice/... are not accepted by the capOS policy model. Firestore rules remain access control for opaque encrypted capsules only. They must not be treated as validation for decrypted adventure semantics, and path segments must respect Firestore ID constraints such as no ., no .., no __.*__, and the 1,500-byte collection/document ID limit.

GCP Cloud KMS And IAM Notes For Adventure Saves

GCP-backed adventure save capsules follow the same envelope-encryption model as CloudKmsKeySource and the volume-encryption proposal: Cloud KMS holds a key-encrypting key (KEK), the game-world service owns the capsule data-encrypting key (DEK), and KMS Encrypt/Decrypt wraps or unwraps that DEK rather than bulk-encrypting capsule bytes. Provision one Cloud KMS key ring and one symmetric CryptoKey KEK per game-world instance or shard. The key ring is an administrative grouping boundary; ordinary runtime authority should be granted on the CryptoKey resource where possible, not at the project or key-ring level. Do not claim key-version-scoped IAM as a design primitive for this path: predefined Cloud KMS crypto roles have CryptoKey as their lowest grantable resource.

Service accounts are split by operation:

  • Writers that only create new ciphertext receive roles/cloudkms.cryptoKeyEncrypter on the configured game-world CryptoKey so they can wrap a freshly generated DEK.
  • Restore, validation, and migration workers that must read protected capsules receive roles/cloudkms.cryptoKeyDecrypter on that CryptoKey so they can unwrap an existing DEK.
  • The narrow game-world service account receives roles/cloudkms.cryptoKeyEncrypterDecrypter only when the same service must both wrap and unwrap DEKs. Avoid roles/cloudkms.cryptoOperator, project-wide grants, owner/editor roles, browser OAuth identities, and service-agent roles for ordinary adventure runtime access.

The browser-vault boundary does not change. Browser JavaScript may carry ciphertext, wrapped DEKs, capsule metadata, and opaque Drive/Firebase provider handles. It must not receive plaintext DEKs, capOS SymmetricKey or KeySource capabilities, Cloud KMS decrypt/unwrap grants, service account credentials, or provider-independent plaintext. The game-world service may use the unwrapped DEK internally as service authority, modeled as a SymmetricKey capability, but that authority does not cross into browser JavaScript. Possession of a Drive file id or Firebase document path is only transport authority over opaque encrypted bytes.

Rotation creates a new primary KEK version for future DEK wrapping. It does not re-encrypt existing capsules, rewrite wrapped DEK blobs, or disable/destroy old key versions automatically. Capsule re-encryption or rewrapping is a managed game-world service operation: unwrap the old DEK while its KEK version remains enabled and authorized, decrypt and validate the capsule inside the service, then write a new capsule using a new DEK or a DEK rewrapped by the current primary KEK version. The service verifies content hashes and ledger/profile bindings before replacing capsule metadata. Old KEK versions should only be disabled or scheduled for destruction after inventory proves no accepted wrapped DEK still depends on them.

Retiring a game-world first removes IAM decrypt authority from the world service and migration workers. If the retirement is meant to make existing capsules inaccessible, disable the relevant key versions and record the expected outage and recovery procedure before doing it. Destruction is delayed by Cloud KMS’ scheduled destruction period and is irreversible once completed, so destroy key versions only after audit retention, export, and break-glass recovery decisions are recorded. Disabling or destroying a key version can make all capsules that depend on it unreadable; this is a revocation tool, not cleanup.


Phase 1: Bootable Disk Image And Serial Diagnostics

Goal: Produce a raw hybrid BIOS+UEFI disk image that can boot locally and can be packaged for cloud import, alongside the existing ISO for QEMU. The first cloud-visible proof is serial-console boot to init/diagnostics, not network shell access.

The Problem

Cloud VMs boot from disk images, not ISOs. Each cloud has provider-specific format and boot-mode rules:

CloudImage formatImport method
GCPdisk.raw in gzip .tar.gz using old GNU tar; raw size in 1 GiB incrementsgcloud compute images create --source-uri=gs://...
AWSraw, VMDK, VHD/VHDX, or OVAaws ec2 import-image with explicit boot-mode notes
AzureVHD (fixed size)az image create --source

GCP’s manual import path documents a functional MBR partition table or a hybrid GPT+MBR bootloader configuration for imported boot disks, plus ACPI support. AWS VM Import/Export supports both UEFI and legacy BIOS boot modes, but UEFI imports need a fallback EFI binary at /EFI/BOOT/BOOTX64.EFI; Nitro instances generally expect NVMe storage and ENA networking for useful operation. Therefore the first capOS image target should be a hybrid BIOS+UEFI raw disk: an ESP for UEFI fallback boot and a BIOS/MBR-compatible Limine path for import paths that still validate MBR bootability.

Disk Layout

Hybrid raw disk image (1 GiB-aligned for cloud packaging)
  Protective/hybrid MBR + GPT
  Partition 1: EFI System Partition (FAT32, ~32 MB)
    /EFI/BOOT/BOOTX64.EFI     (Limine UEFI loader)
    /limine.conf               (bootloader config)
    /boot/kernel               (capOS kernel ELF)
    /boot/init                 (init process ELF)
  Partition 2: (reserved for future use -- persistent store backing)

Build Tooling

New Makefile target make image using standard tools:

IMAGE := capos.img
IMAGE_SIZE := 1024  # MB, keeps GCP raw image packaging simple

image: kernel init $(LIMINE_DIR)
	# Create raw disk image
	dd if=/dev/zero of=$(IMAGE) bs=1M count=$(IMAGE_SIZE)
	# Partition with GPT + ESP; keep room for hybrid/MBR boot metadata.
	sgdisk -n 1:2048:+32M -t 1:ef00 $(IMAGE)
	# Format ESP as FAT32, copy files
	# (mtools or loop mount + mkfs.fat)
	mformat -i $(IMAGE)@@1M -F -T 65536 ::
	mcopy -i $(IMAGE)@@1M $(LIMINE_DIR)/BOOTX64.EFI ::/EFI/BOOT/
	mcopy -i $(IMAGE)@@1M limine.conf ::/
	mcopy -i $(IMAGE)@@1M $(KERNEL) ::/boot/kernel
	mcopy -i $(IMAGE)@@1M $(INIT) ::/boot/init
	# Install Limine BIOS path as well as UEFI fallback files.
	$(LIMINE_DIR)/limine bios-install $(IMAGE)

New QEMU target to test disk boot locally:

run-disk: $(IMAGE)
	qemu-system-x86_64 -drive file=$(IMAGE),format=raw \
		-bios /usr/share/edk2/x64/OVMF.4m.fd \
		-display none $(QEMU_COMMON); \
	test $$? -eq 1

Cloud upload helpers (scripts, not Makefile targets):

# GCP
cp capos.img disk.raw
tar --format=oldgnu -Sczf capos.tar.gz disk.raw
gcloud storage cp capos.tar.gz gs://my-bucket/
gcloud compute images create capos --source-uri=gs://my-bucket/capos.tar.gz

# AWS
aws ec2 import-image --disk-containers \
  "Format=raw,UserBucket={S3Bucket=my-bucket,S3Key=capos.img}" \
  --boot-mode uefi

Serial diagnostics are part of Phase 1 rather than a later convenience. The cloud bring-up loop should be:

  1. make run-disk proves the hybrid image under local QEMU/OVMF.
  2. a local BIOS-mode disk run proves the MBR/Limine path if provider import requires it;
  3. a serial diagnostics prompt is reachable on COM1 in QEMU;
  4. GCP/AWS imported instances reach the same prompt through provider serial console output.

The serial diagnostics prompt should expose bounded read-only commands for status, cpu, mem, acpi, pci, irq, timers, devices, and logs, plus reboot/halt. It is the early remote debugging path for cloud driver bring-up before NICs or disks are reliable. It should not be required to upload large binaries, replace kernels in place, or stream high-volume tracing through cloud serial consoles.

Dependencies

  • sgdisk (gdisk package) – GPT partitioning
  • mtools (mformat, mcopy) – FAT32 manipulation without root/loop mount

Scope

Makefile/helper script work for the image plus a narrow diagnostics-mode surface. Kernel changes are limited to serial diagnostics and any boot path adjustments needed for disk images; network and block drivers remain later phases.

Phase 0 closeout: GCE harness landed (2026-05-05 06:51 UTC)

Commit 02635421 (2026-05-05 06:51 UTC) records this harness closeout.

The first build-and-boot leg of Phase 1 landed as the cloud-boot harness. make capos-cloudboot-image produces a 10 GiB GPT-partitioned target/disk.raw with a 128 MiB FAT32 EFI System Partition holding the Limine UEFI loader, limine.conf, the kernel ELF, and the manifest, plus the Limine BIOS stage 2 embedded in the GPT for legacy SeaBIOS boot. The disk is repackaged as target/capos-disk.tar.gz using tar --format=oldgnu -czf, the exact form GCE’s manual import path expects. Disk size is enforced as an exact multiple of 1 GiB.

tools/cloudboot/run-test.sh (also wired as make cloudboot-test) drives the end-to-end loop on a sandbox GCE project: an idempotent orphan sweep on a configured project-pinned label, a staging tarball upload, image creation, instance creation with no public IP, no service account, no API scopes, the same project-pinned label set, and the configured sandbox subnet, then serial-port polling for the capos kernel starting landmark with a hard wall-clock budget. Serial output is captured under target/cloudboot-evidence/run-<id>/serial.log BEFORE teardown, and a bash trap on EXIT INT TERM always deletes the instance, image, and staged tarball even on signal or partial failure. The harness hard-fails if the active project name does not match the configured sandbox.

Sandbox project name, subnet, staging bucket, and the IAM custom roles the harness assumes are operational details that depend on the host environment; they belong in tools/cloudboot/README.md and operator-local configuration, not in this proposal.

This is the harness only. The recurring portability gate that records cloud boot evidence on every reviewed cloud-relevant change remains open as docs/backlog/hardware-boot-storage.md Task 6, and the userspace driver authority gate remains open under DDF Task 5.

First GCP serial-console boot proof (2026-05-08 09:06 UTC)

The first imported-image GCP serial-console proof reached capos kernel starting as run 1778230874-715a at 2026-05-08 09:06 UTC, against source commit 3951e275 from 2026-05-08 08:50 UTC. The run used the cloudboot harness to import the staged disk image, create a temporary e2-small instance with no public IP and no service account/scopes, poll serial output for the kernel-start landmark, save the serial log under the run evidence directory, and tear down the temporary instance/image/staging objects.

This proves imported-image firmware/bootloader/kernel serial reachability on one GCP sandbox run only. It does not prove a usable cloud instance, provider NIC or storage drivers, cloud clocking, persistence, SSH/network shell access, AWS/Azure import, or production cloud readiness.

Private Web UI Reachability Evidence Contract

The first self-hosted Web UI provider proof is private GCE reachability, not operator browser exposure. The behavior task cloud-gce-private-self-hosted-webui-proof extends tools/cloudboot/run-test.sh with --require-web-ui-proof only after the local Web UI L4 proof, DHCP/IPv4 configuration, and Web UI hardening tasks are closed. This proposal defines the evidence contract for that later behavior slice; it does not authorize a billable GCE run, a public endpoint, broad firewall changes, TLS certificate provisioning, service-account broadening, or a production release.

The proof must keep the current cloudboot posture unless the behavior task is explicitly amended: no public IP on the capOS VM, no service account, no API scopes, no public firewall rule, and teardown through the existing orphan-sweep and EXIT INT TERM trap discipline. The reachability probe must cross the live GCE virtual network boundary. Acceptable shapes include a same-VPC probe instance, a provider-supported internal probe path, or another reviewed private path that sends packets through the capOS VM’s GCE NIC and private endpoint.

Evidence classes stay separate:

Evidence classWhat it can proveWhat it cannot prove
Cloudboot-onlyThe image imports, boots, emits serial markers, and tears down provider resourcesWeb UI reachability over the provider network
Provider-privateA private probe reaches remote-session-web-ui through the live GCE NIC and Phase C L4 pathPublic operator access, TLS readiness, DNS readiness, or browser production posture
Operator-exposureA separately authorized public or browser-mediated path reaches the Web UI under the selected ingress policyThe private proof by itself; it must depend on the private proof instead

The private Web UI proof records, before teardown, at least:

FieldRequirement
Run identityCloudboot run id plus source commit or image provenance used for the imported image
Machine shapeGCE machine family/type, NIC selection posture, and zone
Private posturepublic_ip=false or equivalent, service-account/scopes posture, and no public firewall rule
Private endpointInternal IP or provider-private endpoint, UI port, and probe source identity
Probe pathSame-VPC probe, provider-supported internal probe, or other reviewed private path that crosses the GCE virtual network boundary
Web UI markerA run-unique Web UI response marker, header, or body token observed by the private probe
Phase C L4 markerThe remote-session-web-ui Phase C L4 evidence marker, such as cloudboot-evidence: remote-session-web-ui-l4 <token>, tied to the same source commit/image
Private proof markerA final structured marker, such as cloudboot-evidence: gce-private-self-hosted-webui <token>, emitted only after the private probe succeeds
TeardownInstance, image, staged object, probe resources, and any private firewall or route resources created by the run were deleted or reported as a failed run

Private Proof Runbook Checklist

The future --require-web-ui-proof harness gate closes provider-private Web UI reachability only when the run records these steps in order:

  1. Preflight confirms the local Web UI L4 proof, DHCP/IPv4 proof, session hardening, and connection-bound prerequisites are closed, and confirms that the run has current authorization for billable private GCE execution.
  2. Image/source provenance records the cloudboot run id, source commit, imported image or staged object identity, and the local artifact set used for the VM.
  3. Launch posture records the zone, machine type, NIC posture, no public IP, no service account or API scopes, and no public firewall rule.
  4. Probe setup records the private endpoint, UI port, probe source identity, and same-VPC or provider-supported private path that crosses the GCE virtual network boundary.
  5. The private probe fetches the Web UI over that provider-private path and records a run-unique response marker, header, or body token.
  6. The serial or harness evidence ties the same run to the Phase C L4 marker for remote-session-web-ui, such as cloudboot-evidence: remote-session-web-ui-l4 <token>, from the same source commit/image.
  7. The harness emits the private proof marker, such as cloudboot-evidence: gce-private-self-hosted-webui <token>, only after the provider-private probe and L4-marker correlation both succeed.
  8. Teardown removes the VM, imported image, staged object, probe resources, and any private firewall or route resources created by the run, using the normal orphan-sweep and trap discipline.
  9. Failed-run reporting preserves the run id, failure class, last observed private posture, teardown result, and whether any loopback, same-guest, or serial-only diagnostics passed without treating those diagnostics as a provider-private proof.

No-Spend Preflight (Step 1, Landed as a Local Gate)

Step 1 of the checklist is implemented and testable today without any provider mutation: tools/cloudboot/run-test.sh --require-web-ui-proof --preflight-only runs the local no-spend preflight and exits before the harness access probe, orphan sweep, upload, image import, instance launch, firewall mutation, or any probe resource. It validates that the local prerequisite proofs are done (cloud-prod-remote-session-web-ui-l4-local-proof, remote-session-web-ui-session-hardening, remote-session-web-ui-connection-bounds, and the legacy-datapath serving prerequisite cloud-gce-legacy-virtio-webui-serving-local-proof), that an operator supplied a firewall-IAM attestation (the documented live blocker), and that a current per-run billable authorization is present, emitting one structured cloudboot-webui-preflight: line per check naming the failure class without printing credentials or attestation values. make cloudboot-gce-private-webui-preflight-check is the fixture gate proving the safe failure paths and that no provider CLI is invoked on any preflight path (tools/cloudboot/README.md documents the inputs and failure classes). A preflight pass is cloudboot-only evidence – the output labels itself evidence-class=cloudboot-local-preflight – and is neither the provider-private proof nor authorization for a billable run. The live --require-web-ui-proof gate remains unimplemented and fails closed without --preflight-only.

Evidence-Grammar Fixture (Local Gate)

The closeout evidence grammar for the table above is also locally testable without any provider mutation: tools/cloudboot/validate-private-webui-evidence.sh validates a harness-rendered evidence report for field completeness, marker ordering (the private proof marker only after the recorded private-probe pass and the correlated remote-session-web-ui-l4 marker), run/source identity agreement, private posture, and teardown result, and rejects loopback-only, serial-only, same-guest, public-IP, public-firewall, and missing-teardown evidence with structured failure classes. make cloudboot-gce-private-webui-evidence-fixture-check is the fixture gate (tools/cloudboot/README.md documents the report grammar and failure classes). A pass is evidence-class=cloudboot-local-private-webui-evidence-fixture with an explicit provider-private-reachability=not-proven label: it proves only that a future successful run’s evidence will be parsed, ordered, and classified correctly, not that any provider-private probe has run.

Loopback-only checks (127.0.0.1, guest-local localhost, or an in-guest HTTP health request) are supplemental service-health evidence. They may help diagnose a failed run, but they do not close cloud-gce-private-self-hosted-webui-proof because they do not prove the provider NIC, VPC routing, private endpoint, or probe-to-VM packet path. Serial-only markers are likewise insufficient for the private Web UI proof unless the private probe also succeeds and the harness records the required provider-private fields.

The public ingress policy below remains a later authorization boundary. Closing the private proof does not permit a public IP, load balancer, DNS name, TLS certificate, Identity-Aware Proxy, operator browser exposure, or widened service account. Public browser-facing exposure must reference the private proof as an input and then satisfy the separate public-ingress policy and on-hold approval gate.


Public Web UI Ingress Policy (First Operator-Access Proof)

The cloudboot harness intentionally launches with no public IP, no service account, and no API scopes. Exposing the self-served capOS Web UI (remote-session-web-ui, see Remote Session CapSet Client Gate 1B) to an operator browser is therefore a separate, reviewed exposure decision, not a follow-on of the private reachability proof. This section is the selected policy that the first public-ingress behavior task (cloud-gce-public-self-hosted-webui-ingress-tls) builds against, decided by cloud-gce-public-webui-ingress-tls-policy-design.

Selected Ingress Shape: Provider-Terminated HTTPS Load Balancer

The first public proof uses a GCP external Application Load Balancer that terminates HTTPS at the Google front end. capOS serves only plain HTTP/1.1 on its UI backend port; the operator browser reaches the UI exclusively through the load balancer’s HTTPS virtual IP and hostname. TLS is terminated by Google’s front end against a managed certificate; capOS never holds the TLS private key and never parses hostile TLS bytes in this proof.

graph LR
    B[Operator browser] -- HTTPS --> LB[GCP external HTTPS<br/>Application Load Balancer<br/>Google-managed cert]
    LB -- HTTP, health-check-scoped firewall --> NEG[Zonal NEG / backend service]
    NEG --> VM[capOS VM<br/>remote-session-web-ui :8080<br/>plain HTTP/1.1, no public IP]
    style LB fill:#2d5,stroke:#333
    style VM fill:#2d5,stroke:#333

Why this shape is the first proof rather than direct capOS TLS termination:

  • No capOS TLS termination stack exists yet. The Phase-1 certificate verifier has landed, but the capability-native TLS termination model (TlsServerConfig, ACME issuance, OCSP stapling, and private-key custody) is not landed in Certificates and TLS, and the userspace L4 network stack has not yet completed full TcpSocket relocation. The ACME/Let’s Encrypt successor path is decomposed, but it still depends on minimal PrivateKey / KeyVault / KeySource custody, server-side TLS, the RFC 8555 client, the scoped http-01 solver, and CertificateStore.watch renewal. A direct external IP would put capOS’s nascent userspace HTTP parser at the first byte of hostile internet traffic with no TLS and no reviewed key custody.
  • Least privilege and reversibility. Provider-terminated TLS keeps the VM with no public IP, no inbound 0.0.0.0/0, and no private-key custody in either capOS or the harness. Teardown is the deletion of a bounded set of provider resources, not the rotation of an exposed key.
  • Clean successor path. When the capability-native TLS stack and an ACME flow ship, the direct-external-IP / capOS-terminated shape becomes available as a second, separately reviewed ingress. This proof does not foreclose it; it is the bootstrap step before it. The interim posture is recorded as “Bootstrap TLS for the First Public GCE Web UI” in Certificates and TLS, and the public GCE successor task is cloud-gce-public-webui-letsencrypt-direct-termination-proof. That successor requires a controlled public DNS name plus explicit billable/public-ingress authorization, and any Let’s Encrypt production call requires explicit CA authorization.

Raw public HTTP is not acceptable closeout evidence. If port 80 is published at all, it exists only as an HTTP-to-HTTPS 301 redirect at the load balancer and never reaches capOS. The closeout evidence must be the HTTPS path.

An optional hardening for the first proof is to enable Identity-Aware Proxy (IAP) on the backend service so the public door is gated by Google IAM before any request reaches the capOS backend. IAP here is not a separate ingress shape: it rides on the same external HTTPS load balancer and gates that backend service, so the ALB is still the only public entry point. IAP composes with, and does not replace, the capOS SessionManager/AuthorityBroker login boundary: IAP authenticates the human to Google; capOS still mints its own UserSession and projects only browser-safe view models. The browser never receives raw capOS caps.

Certificate and Key Custody

ConcernFirst proofSuccessor (deferred)
TLS terminatorGoogle front end (load balancer)capOS userspace TLS service
Certificate sourceGoogle-managed certificate (Certificate Manager or classic managed cert), or an operator-supplied cert resource on the load balancerACME (AcmeClient + http-01/tls-alpn-01 solver) from Certificates and TLS
Private-key custodyGoogle-held; never in capOS or the harnesscapOS PrivateKey cap sealed under a KeySource
Min TLS version / cipher policyLoad balancer SSL policy (TLS 1.2+ minimum; prefer the GCP MODERN/RESTRICTED profile)capOS CipherPolicy (modern)

The first proof must not write a private key into the disk image, the manifest, the cloudboot evidence directory, or any harness-staged object. A managed certificate keeps key material entirely on the provider side.

The successor must preserve the same no-export rule on the capOS side: the ACME account key and TLS private key remain behind PrivateKey / KeyVault authority and are not copied into cloudboot images, manifests, logs, or evidence directories. Local ACME proofs use a local directory; public GCE/Let’s Encrypt proofs require explicit run authorization, DNS-name control, public-ingress teardown evidence, and staging-vs-production CA labeling.

Browser Session and Origin Policy

The self-served Web UI keeps the Gate 1B boundary: remote-session-web-ui is the trusted backend that holds remote-session/CapSet state server-side, and browser JavaScript receives only browser-safe view models. Public exposure adds the following reviewed browser rules:

  • Single public origin. UI assets and the same-origin JSON API are served from the one HTTPS origin (the load balancer hostname). No second origin, no wildcard CORS, no cross-origin credentialed requests. The service-side policy is implemented in remote-session-web-ui as a boot-manifest input: one public_origin.<host> marker cap (an inert Endpoint, granted after the service caps) fixes the accepted https://<host> origin at boot, validated fail-closed (second marker, malformed, loopback-named, or IP-literal-shaped host, or any unrecognized extra grant fails the boot), and consulted by the Host/Origin/Referer gates only for requests on the trusted forwarded-scheme HTTPS path, so a direct client can never claim the public origin. Browser-supplied principal/source hint headers (IAP assertions, authenticated-user hints) are rejected on the public-origin path before any backend-held capability dispatch, no CORS headers are emitted, and login ingress extends to the recorded GFE ranges only when a public origin is configured. Locally proven by make run-cloud-prod-remote-session-web-ui-l4 (in-process trusted-forwarder fixture positive plus cross-origin, mixed-scheme, wildcard, missing-origin, hostile-Referer, principal-hint, and real-ingress direct-client forged negatives); the proof claims no DNS name, load balancer, TLS endpoint, or live public exposure.
  • Forwarded-scheme trust is firewall-bounded. Because the backend hop is plain HTTP, capOS derives the external scheme from the load balancer’s X-Forwarded-Proto/forwarding headers. It must trust those headers only from the Google front-end source ranges (enforced by the firewall below), and treat any such header from an unexpected source as absent (default to “not HTTPS”, fail closed on secure-context assumptions). The service-side trust gate is implemented in remote-session-web-ui (forwarded_scheme_peer_trusted / external_scheme_is_https, pinned to 130.211.0.0/22 and 35.191.0.0/16, fail-closed on unknown peer formats) and locally proven by make run-cloud-prod-remote-session-web-ui-l4: a real ingress client forging X-Forwarded-Proto: https keeps the non-Secure cookie posture, and a fixture simulating the recorded ranges is the only path that flips the session cookie to Secure. The local proof remains plaintext-loopback and claims no live load balancer or TLS endpoint.
  • Session cookies. The session cookie is Secure, HttpOnly, and SameSite. The SameSite value is picked deterministically rather than mid-slice: Strict when no IAP front door is used, and Lax when IAP is enabled (the IAP sign-in redirect is a cross-site top-level navigation that would drop a Strict cookie on return). Secure is honored because the browser only ever sees the cookie over the load balancer’s HTTPS origin. The switch is implemented in remote-session-web-ui as a boot-manifest policy input: an IAP-fronted deployment manifest grants the inert iap_fronted_ingress marker cap (last in the web-ui grant list) to select Lax; without it the service emits Strict, and SameSite=None is never emitted. The posture applies uniformly to the session, CSRF, and logout/expiry clear-cookie headers, stays independent of the forwarded-scheme-derived Secure attribute, and is fixed at boot so no request header, cookie, or body field can select the weaker branch. Because a Lax cookie attaches on cross-site top-level GET navigations, the Lax posture additionally rejects authenticated GET views whose Fetch Metadata provenance (Sec-Fetch-Site) is cross-site – and cookie-bearing GETs with no Fetch Metadata at all, covering legacy browsers and webviews that attach Lax cookies without stating provenance – before any session state is touched; the gate is inert under Strict, where the cookie never attaches cross-site. make run-cloud-prod-remote-session-web-ui-l4 proves the default Strict posture end to end (including a real-ingress login forging IAP-shaped headers and body fields) and the Lax branch through the service’s in-process policy fixture; the live IAP-fronted deployment is future work.
  • HSTS and redirect. The HTTPS edge sets Strict-Transport-Security with a conservative max-age (no preload, no includeSubDomains commitment for the first proof). Any port-80 listener is a 301 to HTTPS only.
  • CSRF. State-changing JSON routes require a per-session anti-CSRF token and an Origin/Referer check against the known public origin; cross-origin or origin-absent state changes are rejected.
  • Session lifetime and logout. Sessions carry a bounded idle timeout and an absolute lifetime. Logout drops the server-side session and clears the cookie; the existing self-served stale-session / logout failure-closed boundary (proven in the Gate 1B implementation gate) extends unchanged to the public endpoint. A stale or expired cookie yields no authority.

Firewall and Source-Range Policy

The instance keeps no public IP. Ingress to the capOS UI backend port is allowed only from Google’s load-balancer and health-check ranges, never from 0.0.0.0/0:

Allowed sourcePurpose
130.211.0.0/22, 35.191.0.0/16Google Front Ends and load-balancer health checks reaching the backend port
35.235.240.0/20Identity-Aware Proxy (only if IAP fronting or IAP-tunneled SSH/diagnostics is used)

No other ingress rule is created. The proof does not broaden the service account, add API scopes beyond the LB/health-check need, open SSH to the public internet, or attach a broad firewall tag. Egress stays default-deny-friendly: the LB-terminated path needs no capOS outbound, and the future ACME path (which would require egress 443 to the ACME directory) is explicitly out of scope here.

Backend Health-Check Contract (Local Proof Landed)

The backend port is reachable only from the GFE/health-check ranges above, so the load balancer’s health checker is the route’s only intended public caller. The backend health contract, proven locally by make run-cloud-prod-remote-session-web-ui-l4:

  • Route: GET /healthz on the Web UI backend port, served by demos/remote-session-web-ui (HEALTH_BODY). The exact bounded response body is {"ok":true,"service":"remote-session-web-ui"} with Content-Type: application/json and Cache-Control: no-store; it carries no cap ids, session ids, user/profile names, endpoint handles, provider resource ids, host paths, or secret material.
  • No authority: the route is unauthenticated and never creates, rotates, refreshes, or consumes a browser session; it never emits Set-Cookie, and a presented (even forged) session cookie changes nothing. The local proof drives a /healthz probe with live session cookies against an idle-expired session and asserts the next authenticated call still fails closed. It is the only unauthenticated public-ingress liveness exception; the Host/Origin/CSRF/session gates on authority-bearing routes are unchanged. (/api/health remains the bundled operator app’s same-origin page-load ping with the same no-authority posture; the provider health check never probes it.)
  • Host-gate exemption: the health checker probes the backend by IP, so /healthz deliberately does not require the loopback/public-host Host allowlist that authority-bearing routes enforce.
  • Fail-closed variants: non-GET methods and path variants (POST /healthz, /healthz/extra, /HEALTHZ) return 404 without reaching any authority-bearing handler.
  • Availability under abuse: the slow-client phases of the L4 smoke prove a concurrent /healthz keeps completing while idle, partial-request, and drip-feed clients are held open, and after they are abandoned.

This is local backend readiness for the selected policy (evidence-class=local-qemu), not a live GCE health check: no health-check resource, load balancer, firewall rule, or public endpoint exists, and a passing local contract proof authorizes none of them.

Audit and Evidence Fields

The public proof records, before teardown, at least:

  • selected ingress shape (https-load-balancer) and whether IAP was enabled;
  • public endpoint (hostname and HTTPS virtual IP);
  • TLS posture: terminator (google-frontend), certificate type (google-managed or operator-supplied), and the load balancer SSL-policy minimum TLS version;
  • authentication method exercised (capOS SessionManager login, and Google IAM identity if IAP is enabled);
  • firewall/forwarding scope: the named source ranges, backend port, and the URL-map/forwarding-rule chain created;
  • HTTP-to-HTTPS redirect and HSTS header observation;
  • teardown result for every resource the proof created.

Teardown Checklist

The existing harness deletes the instance, image, and staging tarball in an EXIT INT TERM trap. The public proof extends that trap to delete, in dependency order, every ingress resource it creates:

  • global forwarding rule and target HTTPS proxy;
  • URL map and any HTTP-to-HTTPS redirect URL map / target HTTP proxy;
  • backend service and health check;
  • zonal/serverless NEG or managed instance group backing the backend;
  • managed certificate / certificate-map entry / SSL policy created for the run;
  • the LB-scoped and (if used) IAP-scoped firewall rules;
  • the reserved external IP address, if one was allocated for the LB;
  • the instance, image, and staged tarball (existing harness behavior).

Teardown must be idempotent and must run on signal or partial failure, matching the existing orphan-sweep discipline. A run that cannot confirm deletion of an ingress resource is a failed run, not a passed one.

Local Plan Gate (Landed)

The resource graph above is locally reviewable before any billable work: tools/cloudboot/plan-public-webui-ingress.sh renders and validates the selected plan shape with zero provider interaction, and make cloudboot-public-webui-ingress-plan-check is the fixture gate proving each rejected hazard (raw public HTTP to capOS, instance public IP, 0.0.0.0/0 backend ingress, missing /healthz health check, broad service account/scopes, staged private-key material, non-provider certificate custody) fails closed by structured class before any provider CLI could be invoked. Output is stamped evidence-class=cloudboot-local-plan with operator-exposure=not-proven; a plan pass is not public reachability, TLS readiness, or authorization for the on-hold public proof. The command contract and failure classes are documented in tools/cloudboot/README.md (“Public Web UI ingress plan gate”).

Local Teardown Fixture Gate (Landed)

The teardown checklist above is locally proven before any billable work: tools/cloudboot/teardown-public-webui-ingress.sh is the dependency-ordered, idempotent, deletion-confirming teardown engine over a per-run created-resources journal, and make cloudboot-public-webui-teardown-fixture-check exercises it against recording stub provider CLIs across complete, partial-create, command-failure, delete-claims-success-but-persists, unreadable-state, signal-trap, and orphan-sweep paths. Every checklist resource class is modeled and the engine’s class list must equal the plan gate’s rendered teardown-order= line (the fixture fails on drift), so a class added to the selected plan cannot go missing from the cleanup graph. An unconfirmed deletion is a blocking structured failure (undeleted-<class> / resource-state-unknown), matching the failed-run policy above. All public-ingress resource names must carry the capos-test- sweepable marker; a journal naming anything else is rejected before any provider call, and the orphan sweep enforces the marker client-side so out-of-scope resources are never deleted. Output is stamped evidence-class=cloudboot-local-teardown-fixture live-teardown=not-proven; a fixture pass is local harness evidence only, never live provider teardown evidence, and authorizes no public ingress. The journal grammar, sweep contract, and failure classes are documented in tools/cloudboot/README.md (“Public Web UI ingress teardown fixture gate”).

Local Evidence Fixture Gate (Landed)

The “Audit and Evidence Fields” contract above is locally proven before any billable work: tools/cloudboot/validate-public-webui-evidence.sh validates a harness-rendered public-proof closeout report against the selected evidence grammar, and make cloudboot-public-webui-proof-evidence-fixture-check is the fixture gate proving accepted and rejected reports over stub inputs with zero provider CLI invocations. Acceptance requires the recorded ingress shape, public HTTPS hostname/VIP, provider TLS terminator and managed or operator-supplied certificate resource, minimum TLS policy, IAP posture, no-key-custody statement, no-public-IP instance posture, GFE/health-check firewall scope, health-check, HTTP-to-HTTPS redirect and HSTS observations, capOS SessionManager login observation, a public HTTPS probe record, the correlated gce-public-self-hosted-webui-ingress-tls proof marker, and a per-resource teardown record pinned to the plan gate’s teardown-order= class list (the fixture fails on drift). Raw public HTTP, a direct instance public IP, wildcard backend ingress, a missing health check, missing HSTS/redirect observation, capOS or harness private-key custody, stale/missing/incomplete teardown, a non-provider TLS terminator, and private-proof-only evidence (a same-VPC or provider-internal probe path, or a proof marker without a recorded HTTPS probe) each fail closed by structured class. The tls terminator= label structurally separates this provider-terminated evidence contract from the later capOS-terminated TLS successor, so successor evidence can never pass through the first-proof grammar. Output names field names, classes, and line numbers only; input values are never echoed. Every pass is stamped evidence-class=cloudboot-local-public-webui-evidence-fixture with operator-exposure=not-proven: a fixture pass is local evidence-grammar validation only, never public reachability or operator-access evidence, and it does not authorize public exposure or move the live proof out of cloud-gce-public-self-hosted-webui-ingress-tls. The report grammar and failure classes are documented in tools/cloudboot/README.md (“Public Web UI evidence-grammar fixture gate”).

Local Provider Command Allowlist Gate (Landed)

The provider command boundary the future public proof may use is locally proven before any billable work: tools/cloudboot/check-public-webui-provider-commands.sh validates a recorded provider-command transcript against the selected resource graph, and make cloudboot-public-webui-provider-command-allowlist-check is the fixture gate proving both directions over recording stub gcloud/gsutil with zero live provider invocations. The allowlist permits only the resource families the plan and teardown checklist name – forwarding rules, target HTTPS/HTTP proxies, URL maps, backend services, health checks, zonal NEGs, scoped firewall rules, managed-certificate resources, SSL policies, reserved addresses, instance/image creation, and staged tarball upload/delete – and requires the capos-test- marker on every created resource, journal-pinned deletion (a delete must name a resource the created-resources journal recorded), GFE/IAP-only firewall source ranges, the capos-test filter on every listing, marker discipline on create-wired references, per-surface create flags and parameters pinned to the selected graph shape, an explicit pin of the documented sandbox project on every command, and explicit --global/--zone scope on deletes (ambient Cloud SDK project/region defaults are never trusted). Drift toward broader provider authority fails closed by structured class: IAM mutation, service-account/scopes changes, DNS mutation, private-key upload, 0.0.0.0/0 backend ingress, unmarked resources, deletion outside the journal (zone-pinned), project-wide or filter-restating sweeps, ambient credential flags, project/network/region scope overrides beyond the pinned sandbox forms, --flags-file indirection, non-selected create parameters, shell/environment inspection, and provider CLI resolution from an unexpected path. Rejected command content is reported by class and line number only; credentials, principals, key paths, and rejected names are never echoed. Output is stamped evidence-class=cloudboot-local-provider-command-allowlist with provider-mutation=none: a pass narrows what the future live proof may execute, it is not live provider evidence and does not authorize the on-hold public proof. The transcript grammar and failure classes are documented in tools/cloudboot/README.md (“Public Web UI provider-command allowlist gate”).


Phase 2: ACPI and Device Discovery

Goal: Parse ACPI tables to discover hardware topology, interrupt routing, and PCI root complexes. This replaces QEMU-specific hardcoded assumptions.

Why ACPI

On QEMU with default settings, you can hardcode PCI config space at 0xCF8/0xCFC and assume legacy interrupt routing. On real cloud hardware:

  • PCI root complex addresses come from ACPI MCFG table (PCIe ECAM)
  • Interrupt routing comes from ACPI MADT (I/O APIC entries) and _PRT
  • CPU topology comes from ACPI MADT (LAPIC entries)
  • Timer info comes from ACPI HPET/PMTIMER tables

Limine provides the RSDP (Root System Description Pointer) address via its protocol. From there, the kernel can walk RSDT/XSDT to find specific tables.

Required Tables

TablePurposePriority
MADTLAPIC and I/O APIC addresses, CPU enumerationHigh (Phase 2)
MCFGPCIe Enhanced Configuration Access Mechanism baseHigh (Phase 2)
HPETHigh Precision Event Timer addressMedium (fallback timer)
FADTPM timer, shutdown/reset methodsLow (future)

Landed Discovery Slice

The first landed slices are bounded diagnostics plus reusable config access. The ACPI parser requests Limine’s RSDP, validates RSDP/RSDT/XSDT/static-table lengths and checksums within fixed caps, emits serial summaries for RSDT/XSDT table count and MADT/MCFG presence, reports MADT LAPIC/I/O APIC/interrupt-source-override inputs, and reports MCFG ECAM allocation records when firmware provides the table. The PCI layer now keeps the existing legacy I/O-port backend and adds an ECAM backend selected from MCFG allocations; devices retain their discovery backend so config reads, writes, capability walking, and BAR sizing use the same access path. The PCI layer also exposes a shared memory-BAR subregion validator/mapper, and the virtio-net transport uses it for modern capability regions. It also reports MSI/MSI-X capability metadata for the virtio-net function and uses kernel-owned config/RX/TX source records with a bounded first-fit LAPIC device MSI vector pool plus lock-free dispatch slots for QEMU virtio-net MSI-X table programming, virtio vector assignment, driver-owned route unmask, claimed-route lifecycle/reassignment proof, and TX delivery proof. The x86 setup maps MADT I/O APICs and programs masked legacy IRQ routes from MADT source overrides before higher-level drivers can depend on interrupt routing. The Q35 smoke asserts both the ECAM inventory lines, a pci: config backend=ecam enumerated ... proof line, and representative masked I/O APIC route lines; the net smoke asserts virtio-net BAR, capability, MSI-X metadata, source-route records, route unmask records, vector programming, queue assignment, descriptor guards, ARP, and ICMP fixture lines before MMIO transport mapping completes. This path does not interpret AML, provide userspace driver authorities, or provide full unbounded bus discovery yet.

Implementation

#![allow(unused)]
fn main() {
// kernel/src/acpi.rs

/// Minimal ACPI table parser.
/// Walks RSDP -> XSDT -> individual tables.
/// Does NOT implement AML interpretation -- static tables only.

pub struct AcpiInfo {
    pub lapics: Vec<LapicEntry>,
    pub io_apics: Vec<IoApicEntry>,
    pub iso_overrides: Vec<InterruptSourceOverride>,
    pub mcfg_base: Option<u64>,  // PCIe ECAM base address
    pub hpet_base: Option<u64>,
}

pub fn parse_acpi(rsdp_addr: u64, hhdm: u64) -> AcpiInfo { ... }
}

For the fuller static-table subsystem, prefer the acpi crate (or an equivalent maintained no_std parser) rather than expanding the diagnostic parser into a general hand-written ACPI stack. The landed parser is a boot-time inventory proof for RSDP/RSDT/MADT/MCFG summaries; it can be retired or narrowed once the crate-backed table model fits capOS mapping and table lifetime constraints.

Limine RSDP

#![allow(unused)]
fn main() {
use limine::request::RsdpRequest;

static RSDP: RsdpRequest = RsdpRequest::new();

// In kmain:
let rsdp_addr = RSDP.response().expect("no RSDP").address as u64;
let acpi_info = acpi::parse_acpi(rsdp_addr, hhdm_offset);
}

Crate Dependencies

CratePurposeno_std
acpiPlanned fuller/static ACPI table parsing (MADT, MCFG, HPET, FADT, etc.)yes

Scope

The landed diagnostic slice is kernel-local bounded read-only parsing for serial inventory. Fuller handling should be mostly glue around a maintained static-table parser plus capOS mapping, lifetime, and authority types.


Phase 3: Interrupt Infrastructure

Goal: Set up I/O APIC for device interrupt routing and MSI/MSI-X for modern PCI devices. This replaces the implicit legacy PIC setup.

I/O APIC

The I/O APIC routes external device interrupts (keyboard, serial, PCI devices) to specific LAPIC entries (CPUs). Its address and configuration come from the ACPI MADT (Phase 2).

#![allow(unused)]
fn main() {
// kernel/src/arch/x86_64/ioapic.rs

pub struct IoApic {
    base: *mut u32,  // MMIO registers via HHDM
}

impl IoApic {
    /// Route an IRQ to a specific LAPIC/vector.
    pub fn route_irq(&mut self, irq: u8, lapic_id: u8, vector: u8) { ... }

    /// Mask/unmask an IRQ line.
    pub fn set_mask(&mut self, irq: u8, masked: bool) { ... }
}
}

The current x86 implementation maps MADT I/O APIC MMIO, reads each controller’s ID/version/redirection count, and programs legacy IRQ 0-15 routes to LAPIC vectors while keeping the redirection entries masked. It respects Interrupt Source Override entries from MADT (for example, Q35 remaps IRQ 0 to GSI 2). Driver-owned unmask policy, dispatch, and EOI handling remain planned.

MSI/MSI-X

Modern PCI/PCIe devices (NVMe, cloud NICs) use Message Signaled Interrupts instead of pin-based IRQs routed through the I/O APIC. MSI/MSI-X writes directly to the LAPIC’s interrupt command register, bypassing the I/O APIC entirely.

This is critical for cloud deployment because:

  • NVMe controllers require MSI or MSI-X (no legacy IRQ fallback on many controllers)
  • Cloud NICs (ENA, gVNIC) use MSI-X exclusively
  • MSI-X supports per-queue interrupts (one vector per virtqueue/submission queue), enabling better SMP scalability
#![allow(unused)]
fn main() {
// kernel/src/pci/msi.rs

/// Configure MSI for a PCI device.
pub fn enable_msi(device: &PciDevice, vector: u8, lapic_id: u8) { ... }

/// Configure MSI-X for a PCI device.
pub fn enable_msix(
    device: &PciDevice,
    table_bar: u8,
    entries: &[(u16, u8, u8)],  // (index, vector, lapic_id)
) { ... }
}

MSI/MSI-X capability structures are found by walking the PCI capability list (already needed for PCI enumeration in the networking proposal). The current PCI path reports MSI/MSI-X capability metadata for virtio-net so diagnostics can see the advertised table and pending-bit-array layout. The virtio-net QEMU smoke now records kernel-owned config/RX/TX MSI-X sources, publishes them into the device interrupt dispatch table, allocates LAPIC vectors from the bounded device MSI vector pool to program their table entries and virtio vector registers, lets the in-kernel virtio-net owner unmask only those routes, then proves TX delivery by observing that source’s dispatch counter advance after maskable interrupts are live. The same smoke uses an unused masked MSI-X table entry to prove claimed-route reassignment, stale old-route rejection, old-vector unregistered delivery, reassigned-vector masked delivery, unsupported-vector delivery, and release. Broader driver dispatch and userspace interrupt authority remain planned.

Integration with SMP

LAPIC initialization is shared with the SMP proposal. The active x86 path uses xAPIC MMIO for the immediate QEMU/KVM timer and IPI foundation, with PIT/PIC fallback. This cloud phase consumes that architectural LAPIC path for local interrupt delivery and now adds masked ACPI MADT I/O APIC routing plus MSI/MSI-X capability metadata discovery and a bounded virtio-net MSI-X dispatch/lifecycle proof; userspace device interrupts remain planned.

KVM/QEMU paravirtual features such as PV EOI, PV IPI, and PV TLB flush are host-specific accelerations. They are useful later for cloud performance, but cloud boot correctness should use the architectural LAPIC path first. x2APIC is a later backend for newer/high-core systems and firmware states where xAPIC is unavailable or undesirable; it is not a blocker for the current LAPIC path.

Scope

~300-400 lines total:

  • I/O APIC driver: ~150 lines
  • MSI/MSI-X setup: ~100-150 lines
  • Integration/routing logic: ~50-100 lines

Phase 4: PCI/PCIe Infrastructure

Goal: Standalone PCI bus enumeration and device management, usable by all device drivers (virtio-net, NVMe, cloud NICs).

The networking proposal includes PCI enumeration as a substep for finding virtio-net. This phase promotes it to a reusable kernel subsystem that all device drivers build on.

PCI Configuration Access

Two mechanisms, determined by ACPI:

  1. Legacy I/O ports (0xCF8/0xCFC) – works in QEMU, limited to 256 bytes of config space per function. Insufficient for PCIe extended capabilities.
  2. PCIe ECAM (Enhanced Configuration Access Mechanism) – memory-mapped config space, 4 KB per function. Base address from ACPI MCFG table. Required for MSI-X capability parsing and NVMe BAR discovery on real hardware.

Legacy I/O and Q35 ECAM config access exist today behind the same early PCI backend abstraction. The PCI layer also validates memory BAR subregions with checked offset/length/alignment bounds and maps selected subregions through the kernel MMIO window for in-kernel drivers, and it records non-programming MSI/MSI-X metadata for the current virtio-net path by walking the standard PCI capability list. The virtio-net path now selects a usable MSI-X capability and programs config/RX/TX table entries through the typed PCI MSI-X table helper using the kernel-owned source records and bounded first-fit LAPIC device MSI vectors. The QEMU net smoke lets the in-kernel virtio-net owner claim and unmask those routes, assigns the virtio common and queue MSI-X vector registers, and proves TX delivery by observing that source’s dispatch counter advance after the TX completion path has run and maskable interrupts are live. It also proves claimed-route reassignment and release with an unused masked MSI-X table entry. The next steps are using that path for full bus discovery, userspace DeviceMmio authority, broader driver dispatch, and driver binding.

Device Enumeration

#![allow(unused)]
fn main() {
// kernel/src/pci.rs

pub struct PciDevice {
    pub bus: u8,
    pub device: u8,
    pub function: u8,
    pub vendor_id: u16,
    pub device_id: u16,
    pub class: u8,
    pub subclass: u8,
    pub bars: [Option<Bar>; 6],
    pub interrupt_pin: u8,
    pub interrupt_line: u8,
}

pub enum Bar {
    Memory {
        base: u64,
        size: u64,
        prefetchable: bool,
        width: MemoryBarWidth,
    },
    Io { base: u32, size: u32 },
}

/// Scan all PCI buses and return discovered devices.
pub fn enumerate() -> Vec<PciDevice> { ... }

/// Find a device by vendor/device ID.
pub fn find_device(vendor: u16, device: u16) -> Option<PciDevice> { ... }

/// Walk the PCI capability list for a device.
pub fn capabilities(device: &PciDevice) -> Vec<PciCapability> { ... }
}

BAR Mapping

Device drivers need MMIO access to BAR regions. The kernel now maps validated memory-BAR subregions into its bounded MMIO virtual window for in-kernel drivers. A future DeviceMmio capability will carry equivalent authority to userspace drivers as described in the networking proposal.

PCI Device IDs for Cloud Hardware

DeviceVendor:DeviceCloud
virtio-net1AF4:1000 (transitional) or 1AF4:1041 (modern)QEMU, supported first/second-generation GCP machine families
virtio-blk1AF4:1001 (transitional) or 1AF4:1042 (modern)QEMU
NVMe8086:various, 144D:various, etc.All clouds (EBS, PD, Managed Disk)
AWS ENA1D0F:EC20 / 1D0F:EC21AWS
GCP gVNIC1AE0:0042GCP
Azure MANA1414:00BAAzure

Scope

~400-500 lines:

  • Config space access (I/O + ECAM): ~100 lines
  • Bus enumeration: ~150 lines
  • BAR parsing and mapping: ~100 lines
  • Capability list walking: ~50-100 lines

Phase 5: NVMe Driver

Goal: Basic NVMe block device driver, sufficient to read/write sectors. This is the storage equivalent of virtio-net for networking – the first real storage driver.

Why NVMe Over virtio-blk

The storage-and-naming proposal mentions virtio-blk for Phase 3 (persistent store). On cloud VMs, all three providers expose NVMe:

  • AWS EBS – NVMe interface (even for gp3/io2 volumes)
  • GCP Persistent Disk – NVMe or SCSI (NVMe is default for newer VMs)
  • Azure Managed Disks – SCSI on many older VM families such as D/Ev5 or Fv2 and older; NVMe on Azure Boost and newer NVMe-capable families such as Ebsv5 and Da/Ea/Fav6 and newer

virtio-blk is QEMU-only. An NVMe driver unlocks persistent storage on all cloud platforms where the selected VM shape exposes NVMe. For QEMU testing, QEMU also emulates NVMe well: -drive file=disk.img,if=none,id=d0 -device nvme,drive=d0,serial=capos0.

NVMe Architecture

NVMe is a register-level standard with well-defined queue-pair semantics:

Application
    |
    v
Submission Queue (SQ) -- ring buffer of 64-byte command entries
    |
    | doorbell write (MMIO)
    v
NVMe Controller (hardware)
    |
    | DMA completion
    v
Completion Queue (CQ) -- ring buffer of 16-byte completion entries
    |
    | MSI-X interrupt
    v
Driver processes completions

Minimum viable driver needs:

  1. Admin Queue Pair (for identify, create I/O queues)
  2. One I/O Queue Pair (for read/write commands)
  3. MSI-X for completion notification (or polling)

Implementation Sketch

#![allow(unused)]
fn main() {
// kernel/src/nvme.rs (or kernel/src/drivers/nvme.rs)

pub struct NvmeController {
    bar0: *mut u8,          // MMIO registers
    admin_sq: SubmissionQueue,
    admin_cq: CompletionQueue,
    io_sq: SubmissionQueue,
    io_cq: CompletionQueue,
    namespace_id: u32,
    block_size: u32,
    block_count: u64,
}

impl NvmeController {
    pub fn init(pci_device: &PciDevice) -> Result<Self, NvmeError> { ... }
    pub fn read(&self, lba: u64, count: u16, buf: &mut [u8]) -> Result<(), NvmeError> { ... }
    pub fn write(&self, lba: u64, count: u16, buf: &[u8]) -> Result<(), NvmeError> { ... }
    pub fn identify(&self) -> NvmeIdentify { ... }
}
}

DMA Considerations

NVMe uses DMA for data transfer. The controller reads/writes directly from physical memory addresses provided in commands. Requirements:

  • Buffers must be physically contiguous (or use PRP lists / SGLs for scatter-gather)
  • Physical addresses must be provided (not virtual)
  • Cache coherence is handled by hardware on x86_64 (DMA-coherent architecture)

The existing frame allocator can provide physically contiguous pages. For larger transfers, PRP (Physical Region Page) lists allow scatter-gather.

Crate Dependencies

CratePurposeno_std
(none)NVMe register-level protocol is simple enough to implement directlyN/A

The NVMe spec is cleaner than virtio and the register interface is straightforward. A minimal driver (admin + 1 I/O queue pair, read/write) is ~500-700 lines without external dependencies.

Integration with Storage Proposal

The storage proposal’s Phase 3 (Persistent Store) specifies virtio-blk as the backing device. This can be generalized to a BlockDevice trait:

#![allow(unused)]
fn main() {
trait BlockDevice {
    fn read(&self, lba: u64, count: u16, buf: &mut [u8]) -> Result<(), Error>;
    fn write(&self, lba: u64, count: u16, buf: &[u8]) -> Result<(), Error>;
    fn block_size(&self) -> u32;
    fn block_count(&self) -> u64;
}
}

Both NVMe and virtio-blk implement this trait. The store service doesn’t care which backing driver it uses.

Scope

~500-700 lines for a minimal in-kernel NVMe driver (admin queue + 1 I/O queue pair, read/write, identify). Userspace decomposition follows the same pattern as the networking proposal (kernel driver first, then extract to userspace process with DeviceMmio + Interrupt caps).


Phase 6: Cloud NIC Strategy

Goal: Define the path to networking on cloud VMs, given that each cloud uses a different proprietary NIC.

The Landscape

CloudPrimary NICVirtio NIC available?Open-source driver?
GCPgVNIC (1AE0:0042)Yes on supported first/second-generation machine familiesYes (Linux, ~3000 LoC)
AWSENA (1D0F:EC20)No (Nitro only)Yes (Linux, ~8000 LoC)
AzureMANA (1414:00BA)No (accelerated networking)Yes (Linux, ~6000 LoC)

Short term: constrained virtio-net on GCP

GCP can expose VIRTIO_NET on supported first/second-generation machine families. After the shared image, ACPI/PCIe, interrupt, DMA/MMIO, and virtio foundation exists, that gives a constrained early cloud-network proof without writing a provider-specific NIC driver. It is not the general GCP target: third-generation-and-later machine families, Tau T2A, Confidential VM, and some higher-bandwidth paths require gVNIC.

gcloud compute instances create capos-test \
    --image=capos \
    --machine-type=e2-micro \
    --network-interface=nic-type=VIRTIO_NET

Medium term: gVNIC driver

gVNIC is a simpler device than ENA or MANA. The Linux driver is ~3000 lines (vs ~8000 for ENA). It uses standard PCI BAR MMIO + MSI-X interrupts. A minimal gVNIC driver (init, link up, send/receive) would be ~800-1200 lines.

gVNIC is worth prioritizing because:

  • GCP’s constrained virtio-net path can de-risk cloud networking before a provider-specific NIC driver exists
  • Graduating from virtio-net to gVNIC on the same cloud is the required path for newer, Tau T2A, Confidential VM, and higher-bandwidth GCP instances
  • The gVNIC register interface is documented in the Linux driver source

Long term: ENA and MANA

ENA and MANA are more complex and less well-documented outside their Linux drivers. These should be deferred until the driver model is mature (userspace drivers with DeviceMmio caps, as described in the networking proposal Part 2).

At that point, the kernel only needs to provide PCI enumeration + BAR mapping + MSI-X routing. The actual NIC driver logic runs in a userspace process, making it feasible to port from the Linux driver source with appropriate licensing considerations.

Alternative: Paravirt Abstraction Layer

Instead of writing native drivers for each cloud NIC, an alternative is a thin paravirt layer:

Application -> NetworkManager cap -> Net Stack (smoltcp) -> NIC cap -> [driver]

Where [driver] is one of:

  • virtio-net (QEMU, supported first/second-generation GCP machine families)
  • gvnic (GCP)
  • ena (AWS)
  • mana (Azure)

All drivers implement the same Nic capability interface from the networking proposal. The network stack and applications are driver-agnostic.

This is already the architecture described in the networking proposal. The only addition is recognizing that multiple driver implementations will exist behind the same Nic interface.


Phase Summary and Dependencies

graph TD
    P1[Phase 1: Disk Image + Serial Diagnostics] --> BOOT[Boots on Cloud VM]
    P2[Phase 2: ACPI Parsing] --> P3[Phase 3: Interrupt Infrastructure]
    P2 --> P4[Phase 4: PCI/PCIe]
    P3 --> P5[Phase 5: NVMe Driver]
    P4 --> P5
    P4 --> NET[Networking Smoke Test<br>virtio-net driver]
    P3 --> NET
    P4 --> P6[Phase 6: Cloud NIC Drivers]
    P3 --> P6
    NET --> P6

    S5[Stage 5: Scheduling] --> P3
    SMP_C[SMP Phase C: LAPIC timer/IPI] --> P3

    style P1 fill:#2d5,stroke:#333
    style BOOT fill:#2d5,stroke:#333
PhaseDepends onEstimated scopeEnables
1: Disk image + diagnosticsNothingimage tooling plus bounded diagnostics modeCloud serial boot
2: ACPINothing (kernel code)~200-300 linesPhases 3, 4
3: InterruptsPhase 2, LAPIC (SMP Phase C)~300-400 linesNVMe, cloud NICs
4: PCI/PCIePhase 2~400-500 linesAll device drivers
5: NVMePhases 3, 4~500-700 linesCloud storage
6: Cloud NICsPhases 3, 4, networking smoke test~800-1200 lines eachCloud networking

Minimum Path to “Boots on Cloud VM, Prints Hello”

Raw serial output and UEFI boot support already exist, so the smallest “prints hello” experiment is mostly Phase 1 image packaging plus any boot-path adjustments needed to reach the same COM1 output from an imported disk image. That experiment is a precursor, not the full Phase 1 closeout.

Phase 1 closeout also includes a bounded serial diagnostics prompt so cloud driver bring-up can inspect CPU, memory, ACPI, PCI, IRQ, timer, device, and log state before cloud NICs or storage drivers are reliable. That diagnostics surface is kernel/userspace behavior, not just build-system work.

Minimum Path to “Useful on Cloud VM”

Phases 1-5 (disk image + ACPI + interrupts + PCI + NVMe) plus the existing roadmap items (Stages 4-6 for capability syscalls, scheduling, IPC). On a supported first/second-generation GCP machine family, networking can use the existing virtio-net proposal without a provider-specific gVNIC/ENA/MANA driver on that constrained target.


QEMU Testing

All phases can be tested in QEMU before deploying to cloud:

PhaseQEMU flags
Disk image-drive file=capos.img,format=raw -bios OVMF.4m.fd
ACPIDefault QEMU provides ACPI tables (MADT, MCFG, etc.)
I/O APICDefault QEMU emulates I/O APIC
PCI/PCIe-device ... adds PCI devices; QEMU has PCIe root complex
NVMe-drive file=disk.img,if=none,id=d0 -device nvme,drive=d0,serial=capos0
MSI-XSupported by QEMU’s NVMe and virtio-net-pci emulation; current net smoke asserts metadata selection, kernel-owned source-route records, route unmask, vector programming, virtio queue assignment, descriptor guards, ARP, and ICMP fixture evidence. Device-autonomous virtio-net MSI-X delivery is covered by the dedicated userspace-provider gates.
Multi-CPU-smp 4 (already works with Limine SMP)
x2APIC backendfuture explicit QEMU CPU feature such as -cpu qemu64,+smep,+smap,+rdrand,+x2apic

aarch64 and ARM Cloud Instances

This proposal focuses on x86_64 because that’s the current kernel target, but ARM-based cloud instances are significant and growing:

CloudARM offeringInstance types
AWSGraviton2/3/4m7g, c7g, r7g, etc.
GCPTau T2A (Ampere Altra)t2a-standard-*
AzureCobalt 100 (Arm Neoverse)Dpsv6, Dplsv6

ARM cloud VMs have the same general requirements (UEFI boot, ACPI tables, PCI/PCIe, NVMe storage) but different specifics:

  • Interrupt controller: GIC (Generic Interrupt Controller) instead of APIC. GICv3 is standard on cloud ARM instances.
  • Boot: UEFI via Limine (already targets aarch64). Limine handles the architecture differences at boot time.
  • Timer: ARM generic timer (CNTPCT_EL0) instead of LAPIC/PIT/TSC.
  • Serial: PL011 UART instead of 16550 COM1. Different register interface.
  • NIC: Same PCI devices (ENA, gVNIC, MANA) with the same register interfaces – PCI/PCIe is architecture-neutral.
  • NVMe: Same NVMe register interface – PCIe is architecture-neutral.

The arch-neutral parts of this proposal (PCI enumeration, NVMe, disk image format, ACPI table parsing) apply equally to aarch64. The arch-specific parts (I/O APIC, MSI delivery address format, LAPIC) need aarch64 equivalents (GIC, ARM MSI translation).

The existing roadmap lists “aarch64 support” as a future item. For cloud deployment, aarch64 should be considered as soon as the x86_64 hardware abstraction is stable, since:

  1. Device drivers (NVMe, virtio-net, cloud NICs) are architecture-neutral – they talk to PCI config space and MMIO BARs, which are the same on both architectures
  2. The acpi crate handles both x86_64 and aarch64 ACPI tables
  3. Limine already targets aarch64
  4. AWS Graviton instances are often cheaper than x86_64 equivalents

The main aarch64 kernel work is: exception handling (EL0/EL1 instead of Ring 0/3), GIC driver (instead of APIC), ARM generic timer, PL011 serial, and the MMU setup (4-level page tables exist on both but with different register interfaces).


Open Questions

  1. ACPI scope. The landed diagnostic parser covers bounded read-only RSDP/RSDT/MADT/MCFG summaries only. The acpi crate can parse fuller static tables (MADT, MCFG, HPET, FADT). Full ACPI requires AML interpretation (for _PRT interrupt routing, dynamic device enumeration). Do we need AML, or are static tables sufficient for cloud VMs? Cloud VM firmware typically provides simple, static ACPI tables – AML interpretation is likely unnecessary initially.

  2. PCIe ECAM vs legacy. Should we support both config access methods, or require ECAM (which all cloud VMs and modern QEMU provide)? Supporting both adds ~50 lines but makes bare-metal testing on older hardware possible.

  3. NVMe queue depth. A single I/O queue pair with depth 32 is sufficient for initial use. Per-CPU queues (leveraging MSI-X per-queue interrupts) improve SMP throughput but add complexity. Defer per-CPU queues to after SMP is working.

  4. Driver model unification. Resolved: PCI enumeration is the standalone PCI/PCIe Infrastructure item in the roadmap. The networking smoke test and NVMe driver both consume this shared subsystem. The networking proposal’s Part 1 Step 1 has been updated to reference this phase.

  5. GCP vs AWS as first cloud target. The first cloud proof should be imported-image serial-console boot on both providers when practical, because that validates image format, firmware, bootloader, and early ACPI without depending on cloud NICs. For the later usable-networked-instance milestone, a constrained first/second-generation GCP virtio-net target is the easiest first network proof; broader GCP coverage needs gVNIC, and AWS follows once the NVMe/ENA path or an explicit workaround is ready.


References

Specifications

Crates

  • acpi – no_std ACPI table parser
  • virtio-drivers – no_std virtio (already in networking proposal)

Prior Art

Cloud Documentation

  • docs/design-risks-register.md – R13 (trusted build inputs are partly pinned) consolidates the long-horizon supply-chain risk view that gates cloud-image release paths; this proposal is recorded as a secondary owner.
  • docs/trusted-build-inputs.md – the actual inventory of pinned and observed-not-pinned build inputs, dependency policy, vendored upstream snapshots, and the build-provenance retention/comparison policy that cloud proofs must satisfy before they are cited as production evidence.
  • docs/tasks/done/2026-06-07/cloud-usable-instance-provider-nic-storage.md – the completed GCP-first usable-instance provider rollup covering provider NIC/storage authority, DMA backend selection, cloud teardown, and serial-console operator access.
  • docs/dma-isolation-design.md – DMA isolation backend selection (kernel-owned bounce buffers vs IOMMU/remapping) that cloud provider drivers must commit to before claiming usable-instance status.
  • docs/backlog/hardware-boot-storage.md – DDF Tasks 5 (userspace driver authority) and 6 (recurring cloud-portability gate) referenced from Phase 1 closeout above.

Proposal: Live Upgrade

Replacing a running service with a new binary, without dropping outstanding capability references or losing in-flight work. The kernel-side primitive (CapRetarget) is owned by this proposal; the surrounding orchestration (supervisors, manifest sources, fault containment) is owned by service-architecture-proposal.md and consumes the primitive defined here.

Problem

In a Linux-like system, “upgrading a service” is one of:

  • Restart: stop the old process, start the new one. Clients holding file descriptors, sockets, or pipes to the old process receive ECONNRESET or EPIPE and must reconnect. Session state is lost unless clients serialize it themselves.
  • Graceful restart (nginx -s reload, unicorn, systemd socket activation): new process starts alongside old, inherits the listening socket, old drains in-flight requests. Works only for request/response protocols where the session is the request. Does nothing for stateful sessions.
  • Live patch (kpatch, ksplice): binary-level function replacement. Narrow, fragile, no schema for state layout changes.

None of these compose with a capability OS. A CapId held by a client points at a specific process; if that process exits, the cap is dead. There is no “the service” abstraction the kernel could re-bind — the point of capabilities is that they identify a specific reference, not a name that could be redirected after the fact.

But capOS has a kernel-side primitive the Linux model lacks: the kernel already owns the authoritative table of every CapId and which process serves it. Rewriting “cap X is served by process v1” → “cap X is served by process v2” is a table update. The question is when it is safe, and how v2 inherits enough state to answer the next call.

Three Cases

Live upgrade has three distinct cost profiles. The right design is to make each one explicit rather than pretend the hard case doesn’t exist.

Case 1: Stateless services

Each SQE is independent; the service holds no state that matters across calls. A request router, a pure codec, a logger that flushes to an external sink.

Upgrade is trivial: start v2, retarget every CapId from v1 to v2, exit v1. Clients may observe a small latency spike; no DISCONNECTED CQE fires. Only the kernel primitive is needed.

Case 2: State externalized into other caps

The service’s in-memory data is a cache or dispatch table; durable state lives behind caps the service holds (Store, SessionMap, Namespace). v1’s held caps are passed to v2 at spawn time (via the supervisor, per the manifest), kernel retargets client caps, v1 exits.

Architecturally this is the idiomatic capOS pattern: services stay thin, state is factored into dedicated holders with their own caps. The Fetch/HttpEndpoint split in the service-architecture proposal already pushes in this direction. In that world, most services fall into this bucket by construction.

Case 3: Stateful services requiring migration

The service has in-memory state that matters: a JIT’s code cache, a codec’s ring buffer, a parser’s arena, session data not yet flushed. Upgrade requires v1 to hand its state to v2.

capOS’s contribution here is that the state wire format is already capnp — the same format the service uses for IPC. v1 serializes its state as a capnp message; v2 consumes it. There is no separate serialization layer to build and no opportunity for it to drift from the IPC format.

The contract extends the service’s capnp interface:

interface Upgradable {
    # Called on v1 by the supervisor. Returns a snapshot of service
    # state and stops accepting new calls. Calls already in flight
    # complete before the snapshot returns.
    quiesce @0 () -> (state :Data);

    # Called on v2 after spawn. Loads state from the snapshot. After
    # this returns, v2 is ready to serve calls.
    resume @1 (state :Data) -> ();
}

The state schema is service-defined. Schema evolution follows capnp’s standard rules: adding fields is backward-compatible, renaming requires care, removing requires a major version bump.

Kernel Primitive: CapRetarget

The kernel exposes the retarget as a capability method, not a syscall:

interface ProcessControl {
    # Atomically redirect every CapId currently served by `old` to
    # be served by `new`. Requires: `new` implements a schema
    # superset of `old` (schema-id compatibility), `new` is Ready,
    # `old` is Quiesced (graceful) or the caller has permission to
    # force.
    retargetCaps @0 (old :ProcessHandle, new :ProcessHandle,
                     mode :RetargetMode) -> ();
}

enum RetargetMode {
    graceful @0;  # old must be Quiesced; in-flight calls drain on old
    force    @1;  # caps redirect immediately; in-flight calls fail
}

Only a process holding a ProcessControl cap to both processes can perform this — typically the supervisor that spawned them. The kernel never initiates upgrades.

Atomicity is per-CapId. From a client’s perspective, the retarget is a single point in time: a CALL SQE submitted before retarget goes to v1; a CALL SQE submitted after goes to v2. A CALL already dispatched to v1 either completes there (graceful) or returns a DISCONNECTED CQE (force).

Supervisor-Level Upgrade Protocol

The primitives above compose into a protocol the supervisor runs:

1. spawn v2 from the new binary in the manifest
2. Case 1 & 2: v2.resume(EMPTY_STATE)
   Case 3:     state = v1.quiesce()
               v2.resume(state)
3. kernel.retargetCaps(v1, v2, graceful)
4. wait for v1 to drain (graceful mode)
5. v1.exit()

If any step fails, the supervisor rolls back: kill v2, resume v1 (if quiesced), log the failure. Because the retarget hasn’t happened yet, clients never observe the aborted attempt.

In-Flight Calls

The subtle case is a client that has already posted a CALL SQE to v1 when the retarget happens. Two options:

  • Graceful mode. v1 finishes the call, kernel routes the CQE back to the client on v1’s ring. v1 exits only after its ring is empty. This preserves call semantics; v1 and v2 coexist briefly.
  • Force mode. The in-flight CALL returns DISCONNECTED. Client retries against v2. Appropriate when v1 is wedged and quiesce won’t return.

In graceful mode the client cannot distinguish “call landed on v1” from “call landed on v2” — which is the point. Capability identity survives the upgrade; process identity does not.

Relationship to Fault Containment

Live upgrade and fault containment (driver panics → supervisor respawns) share machinery. The difference is one step of the protocol:

  • Fault containment: v1 has crashed; kernel has already marked it dead and epoch-bumped its caps. Supervisor spawns v2, issues a graceful retarget (no quiesce — v1 is gone; in-flight CALLs already delivered DISCONNECTED). Clients reconnect to v2.
  • Live upgrade: v1 is healthy; supervisor initiates quiesce → state transfer → retarget, and no CQE ever reports DISCONNECTED to any caller.

The epoch-based revocation work from Stage 6 is the foundation for both. CapRetarget is one additional primitive layered on top.

Security and Trust

Live upgrade does not expand the trust model. The supervisor already holds the authority to kill, restart, and reassign caps for services it spawned — upgrade is a refinement of that authority, not a new principal. Requirements:

  • Only a holder of ProcessControl caps to both old and new can call retargetCaps. By construction this is the supervisor that spawned them.
  • The new binary must be legitimately obtained — in practice, loaded from the same content-addressed store as everything else (ties to Content-Addressed Boot).
  • Schema compatibility (new is a superset of old) is checked by the kernel before retarget. This prevents an upgrade from silently narrowing the interface clients depend on.

Non-Goals

  • Code hot-patching. No binary-level function replacement. Upgrade is at the process boundary, not the symbol boundary.
  • Kernel live replacement. Covered by Reboot-Proof / process persistence (reboot with state preserved, not live replacement). The kernel is a single trust domain; replacing it in place needs a different design.
  • Automatic schema migration across incompatible changes. If v2’s state schema is not a capnp-evolution-compatible superset of v1’s, the service author writes the migration. The kernel does not.
  • System-wide registry of upgradable services. The supervisor knows what it spawned; there is no ambient discovery.

Phased Implementation

  1. CapRetarget primitive. Kernel operation + ProcessControl cap. Useful immediately for stateless services (Case 1) and as the foundation of Fault Containment (respawn with a new process, point its caps to a fresh instance).
  2. Upgradable interface. Schema, contract documentation, and a Rust helper in capos-rt that services derive.
  3. Graceful drain. Quiesce + in-flight call completion + v1 exit synchronization.
  4. Stateful demo. A service maintaining session state, upgraded live with zero session loss. This is the Live Upgrade observable milestone.
  • Erlang/OTP code_change/3 is the closest prior art: processes upgrade their behavior module in place, with a callback to migrate state. capOS differs only in that state transport goes through capnp rather than Erlang term format, and that the process boundary is an OS process rather than a BEAM process.
  • Fuchsia component updates rebind component instances in the routing graph. Similar primitive in a different mechanism.
  • nginx -s reload is graceful restart for request/response servers. The design here generalizes it by exposing the state migration point explicitly rather than relying on “the session is the request.”
  • service-architecture-proposal.md — owns the supervisor surface that drives this proposal’s protocol. The “Supervisors” and “Supervision Tree” sections describe the principal that holds ProcessControl caps to both old and new and runs spawn → quiesceresumeretargetCaps → drain → exit. The “Service Taxonomy” entry Upgrade manager is the per-system orchestrator that consumes CapRetarget for live replacement, distinct from a per-subtree supervisor that uses the same primitive for fault containment (respawn after crash). Schema compatibility for new vs old is the same superset check the manifest executor and the boot package contract already require, not a new policy invented here.
  • cloud-deployment-proposal.md — owns the binary delivery story this proposal depends on. new must be obtained from the same content-addressed boot package / image-update pipeline the cloud deployment plan describes, not from an ad-hoc path. Cloud-managed services (KMS clients, metadata agents, log/metric shippers, the cloud-metadata agent itself) are exactly the Case 2 / Case 3 services where this proposal’s value shows up first: they hold long-lived caps to upstream cloud APIs, and a restart that drops those caps either re-runs IAM/JWT handshakes or, worse, drops audit/log shippers’ in-flight buffers. The bootable disk image / NVMe path defines what “update the binary” means on real hardware; until then the manifest-embedded BootPackage blobs are the only source of new.
  • storage-and-naming-proposal.md — owns the Case 2 holders (Store, SessionMap, Namespace) the idiomatic service factoring relies on, and the future sealed/stored capability path that lets state survive across reboot, not just across live upgrade. Case 3 state-transfer is the strictly weaker contract: same capnp wire format, but the snapshot only has to outlive a single retargetCaps call, not power loss.
  • system-monitoring-proposal.mdquiesce start, resume completion, retargetCaps mode (graceful vs force), drain duration, and rollback (kill new, resume old) are audit-worthy lifecycle events. The upgrade manager emits them through the audit cap so an operator can correlate a service binary change with downstream behavior. Graceful upgrades by definition emit zero DISCONNECTED CQEs; force-mode and fault-containment respawns do, and that distinction is what the audit record has to preserve.
  • security-and-verification-proposal.mdretargetCaps is a natural target for bounded modeling: per-CapId atomicity (no SQE submitted before retarget lands on new; no SQE submitted after lands on old), graceful-mode in-flight completion (old’s ring drains before exit), and schema-superset enforcement at the kernel before retarget. Force-mode DISCONNECTED delivery is the same epoch-revocation path the fault-containment story already needs, not a separate kernel surface.
  • ../design-risks-register.md — the register currently carries no dedicated R-entry for live upgrade, which is intentional: no implementation exists yet. The closest cross-cutting entries are R6 (CAP_OP_RELEASE is deferred), because graceful drain has to outlive the per-process release path before v1.exit() is safe; R12 (verification coverage is partial), because the per-CapId retarget atomicity and graceful-drain invariants belong in a bounded model before this lands; and Q7 (revocation strategy), because force-mode retarget shares the epoch path the open revocation decision will pick. Open a dedicated R-entry once CapRetarget lands in code, since at that point retarget atomicity, graceful-drain shutdown, and the supervisor-only authority constraint become long-horizon design surfaces in their own right.

Proposal: Capability-Oriented GPU/CUDA Integration

Purpose

Define a minimal, capability-safe path to integrate GPU-class accelerators (NVIDIA/CUDA, AMD, Intel, plus future ML-accelerator boards) into capOS without expanding kernel trust.

The kernel keeps direct control of hardware arbitration and trust boundaries. GPU hardware interaction is performed by a dedicated userspace driver service that is invoked through capability calls and that holds device-scoped bootstrap grants for its single managed device.

This proposal is a downstream consumer of:

  • LLM and agent proposal – defines the LanguageModel/Embedder/ImageModel capability surface that benefits from GPU-backed inference backends. The agent runtime treats a GPU-backed model process as just another LanguageModel capability holder; the GPU service proposed here is one of the substrate choices the model process may use.
  • Userspace binaries proposal – defines the native Rust over capos-rt userspace runtime, the x86_64-unknown-capos target, and the libcapos C-substrate path that any vendor SDK adapter (CUDA, ROCm, OpenCL, oneAPI) must link against. The GPU service runs as one such userspace binary, not as a kernel module.

Positioning Against Current Project State

capOS currently provides infrastructure that is directly load-bearing for a future GPU service:

  • Process lifecycle, page tables, preemptive scheduling (PIT 100 Hz, round-robin, context switching).
  • A global and per-process capability table with CapObject dispatch.
  • Shared-memory capability ring (io_uring-inspired) with syscall-free SQE writes. cap_enter syscall for ordinary CALL dispatch and completion waits.
  • PCI/PCIe enumeration over both legacy I/O ports and ACPI MCFG ECAM, plus reusable memory-BAR subregion validation and kernel MMIO mapping helpers for diagnostics and driver bring-up.
  • MSI/MSI-X capability metadata discovery and typed MSI-X table programming, proven end-to-end through the virtio-net make run-net smoke.
  • I/O APIC routing for masked legacy IRQ programming via MADT.
  • Kernel-owned device interrupt source records plus a bounded first-fit device MSI vector pool with lock-free dispatch slots and claimed-route reassignment/release.
  • Kernel-owned DMA pool accounting ledger that tracks pool bytes, live page count, page-rounded MMIO mapping bytes, interrupt holds, ring depth, and descriptor submission/completion counts for the current virtio-net path.
  • Bootstrap-grant authority hooks for DeviceMmio, DMAPool, Interrupt, and HardwareAuditLog capabilities, exercised by the make run-devicemmio-grant, make run-dmapool-grant, make run-interrupt-grant, and make run-hardware-audit smokes.

What does not exist yet and gates real GPU work:

  • A userspace driver-authority gate. Today the kernel still owns virtio-net, the DMA pool ledger, and the MSI-X dispatch table. The DDF bootstrap-grant smokes prove the schema and grant plumbing for the typed device caps, but there is no userspace driver process that consumes those grants to run a real driver. GPU integration cannot land before that gate moves.
  • IOMMU/DMA-remapping integration (VT-d / AMD-Vi). Until a userspace driver is constrained by IOMMU domains, no production GPU stack can be granted bus-master DMA on a multi-tenant host.
  • A LanguageModel capability surface to consume the GPU service. The LLM proposal defines the schema target; the GPU service is one backend choice.

That means GPU integration must be staged. The early phases are capability schema and mock-service exercises that ride on the existing DDF bootstrap grants; real hardware backends arrive after the userspace-driver authority gate, IOMMU integration, and at least one consuming model surface exist.

Design Principles

  • Keep policy in kernel, execution in userspace. The kernel arbitrates device claims, MMIO mapping, MSI-X table programming, and DMA-pool accounting; the driver service implements vendor-specific command submission and queue management.
  • Never expose raw PCI/MMIO/IRQ details to untrusted processes. Clients see only GpuSession/GpuBuffer/GpuFence capabilities, never DeviceMmio or Interrupt.
  • Make GPU access explicit through narrow capabilities. The interface is the permission; a client that should not launch kernels is given a session type that does not expose launchKernel.
  • Treat every stateful resource (session, buffer, queue, fence, command pool) as a capability with revocability and bounded lifetime.
  • Avoid a Linux-driver-in-kernel compatibility dependency. Vendor SDK code runs in the userspace driver service, linked through libcapos / libcapos-posix shims where vendor headers expect a POSIX-ish surface.
  • Charge GPU memory and submission depth through the existing ResourceLedger mechanism rather than inventing a parallel accounting surface.

Proposed Architecture

capOS kernel (minimal) exposes only resource and mediation capabilities.

gpu-device service (userspace) receives device-specific bootstrap grants (DeviceMmio, DMAPool, Interrupt, HardwareAuditLog) for exactly one GPU function and exposes a stable GPU capability surface to clients.

application (e.g. an LLM model server, a numeric workload, a robot brain inference loop) receives only GpuSession/GpuBuffer/GpuFence capabilities and never sees the device-scoped grants.

Kernel responsibilities

  • Discover GPUs from PCI/ACPI layers (already implemented for non-GPU functions; GPUs are the same discovery path with different class codes).
  • Map/register BAR windows and grant a scoped DeviceMmio capability bound to one decoded memory BAR.
  • Set up MSI/MSI-X routing and expose scoped Interrupt capability per vector with masked-route lifecycle semantics matching the current virtio-net proof.
  • Hand out a bounded DMAPool capability whose accounting ledger charges back to the driver process’s resource ledger and that participates in IOMMU-domain constraints once those exist.
  • Enforce revocation when sessions are closed: DeviceMmio/Interrupt/ DMAPool grants tear down through the bootstrap-grant manager.
  • Record device-manager actions through HardwareAuditLog snapshots (already proven for the DDF smokes).
  • Handle all faulting paths that would otherwise crash the kernel: a buggy driver service must crash the service, not the kernel.

Userspace GPU service responsibilities

  • Open and initialize one GPU device from its device-scoped bootstrap grants. One driver process per GPU function is the working assumption; multi-function boards may run one process per function.
  • Allocate and track GPU contexts, command queues, and DMA buffers backed by the granted DMAPool.
  • Implement command submission, buffer lifecycle, fence/completion signaling, and timeout enforcement.
  • Translate capability calls into vendor SDK operations (CUDA driver API, ROCm, oneAPI, OpenCL, or a vendor-neutral runtime such as a WebGPU/wgpu-style abstraction).
  • Expose only narrow, capability-typed handles to callers and refuse any attempt to surface raw MMIO/IRQ/DMA to clients.

Consumer surfaces

Capability Contract (schema additions)

Add to schema/capos.capnp (interface-level sketch; final wire layout is fixed in the implementation slice):

  • GpuDeviceManager
    • listDevices() -> (devices: List(GpuDeviceInfo))
    • openDevice(capabilityIndex :UInt32) -> (session :GpuSession)
  • GpuSession
    • createBuffer(bytes :UInt64, usage :Text) -> (buffer :GpuBuffer)
    • destroyBuffer(buffer :UInt32) -> ()
    • launchKernel(program :Text, grid :UInt32, block :UInt32, bufferList :List(UInt32), fence :GpuFence) -> ()
    • submitMemcpy(dst :UInt32, src :UInt32, bytes :UInt64) -> ()
    • submitFenceWait(fence :UInt32) -> ()
  • GpuBuffer
    • mapReadWrite() -> (addr :UInt64, len :UInt64)
    • unmap() -> ()
    • size() -> (bytes :UInt64)
    • close() -> ()
  • GpuFence
    • poll() -> (status :Text)
    • wait(timeoutNanos :UInt64) -> (ok :Bool)
    • close() -> ()

Sessions are the natural restriction point: a model-server session granted to an LLM process can omit launchKernel entirely and expose only memcpy plus an opaque runProgram(programCap, ...) if the model image is itself a separately-vetted capability. The interface is the permission; do not add parallel rights bitmasks.

Implementation Phases

Phase 0 (prerequisite, landed): kernel capability ring and DDF grants

The Cap’n Proto schema, capability ring, cap_enter dispatch, PCI/MSI-X discovery, and the DeviceMmio/DMAPool/Interrupt/HardwareAuditLog bootstrap-grant smokes already exist. No new kernel surface is required for this phase; the schema additions for Gpu* are pure userspace work once a driver service is permitted.

Phase 1: Userspace driver-authority gate (cross-track prerequisite)

GPU work cannot land before the userspace driver-authority gate. Required pieces, tracked by the device-manager refactor and DMA-isolation design:

  • Move virtio-net or another known-good driver out of the kernel and into a userspace driver process consuming the DDF bootstrap grants end-to-end.
  • Add an IOMMU integration path (VT-d / AMD-Vi) so that bus-master DMA granted to a driver process is constrained to its registered DMA pages.
  • Add a device-manager userspace service that owns ManagerGrantSource-class capabilities and is the only process that hands DeviceMmio/DMAPool/Interrupt/HardwareAuditLog grants to driver services.

This phase is owned by the device-manager and DMA-isolation tracks; the GPU proposal consumes it.

Phase 2: Mock GPU service

  • Add the Gpu* schema in schema/capos.capnp.
  • Implement a gpu-mock userspace service with the full Gpu* interface, no real driver, and synthetic fences and buffers backed by ordinary anonymous memory.
  • Prove end-to-end:
    • device-manager spawns the mock driver and grants it a fake-device bootstrap grant set.
    • a client process opens a session, allocates and maps a buffer, submits a synthetic job, and waits on a fence.
  • Add a focused QEMU smoke (make run-gpu-mock) that asserts the round-trip and demonstrates revocation on session close.

Phase 3: Real backend integration on one vendor

  • Pick one concrete GPU backend available in CI environment (likely NVIDIA on a workstation host with -device vfio-pci passthrough into QEMU, or a virtio-gpu / venus virtualized path as a first stand-in).
  • Vendor SDK code lives in the userspace driver process. Where the SDK expects a POSIX-ish surface, route it through libcapos-posix rather than expanding the kernel.
  • Add queue lifecycle, fence lifecycle, DMA registration/validation, command execution path, interrupt completion plumbing back to clients through fences.
  • Keep backend replacement possible via a trait-like abstraction inside the driver process so a second vendor backend (AMD ROCm, Intel oneAPI) can be added later without rewriting the service.

Phase 4: Security and reliability hardening

  • Per-session limits for mapped pages, in-flight submissions, and queue depth, charged through ResourceLedger.
  • Bounded wait timeouts and explicit fence cancellation semantics so a hung GPU does not pin a client’s cap_enter.
  • Revocation propagation:
    • GpuSession close => all child GpuBuffer/GpuFence caps revoked.
    • driver crash / device reset => all active caps fail closed with a typed exception.
  • Audit hooks for launchKernel/submitMemcpy recorded through HardwareAuditLog-style snapshots scoped to the GPU service.
  • Coordination with the live-upgrade proposal so the GPU driver service can be replaced without dropping client GpuSession caps.

Phase 5: Multi-tenant and multi-device

  • Multiple driver processes (one per GPU function) under a single device-manager.
  • Cross-device buffer sharing only through explicit capability transfer; no implicit peer mappings.
  • Workload isolation: distinct tenants on a single GPU receive distinct sessions with their own queue, memory budget, and audit stream.

Security Model

The kernel does not grant any user process direct MMIO, MSI, or bus-master DMA access. All such authority is mediated through the device-manager.

Application processes only receive:

  • GpuSession / GpuBuffer / GpuFence capabilities with the methods the session policy chose to expose.

The GPU driver service process receives:

  • DeviceMmio bound to the function’s decoded BARs.
  • Interrupt capabilities for the function’s claimed MSI vectors.
  • DMAPool bounded to the function’s IOMMU domain.
  • HardwareAuditLog for snapshotting device-manager actions.

This ensures:

  • No userland process can program BAR registers.
  • No userland process can claim untrusted memory for DMA.
  • No userland process can observe or reset another session’s state.
  • A buggy or compromised driver crashes the driver process, not the kernel; the device-manager observes the crash, fails outstanding capabilities closed, and re-spawns the driver on the next session request.

Dependencies and Alignment

This proposal depends on:

  • Device-manager refactor proposal for the userspace device-manager that owns the bootstrap-grant sources.
  • DMA-isolation design and IOMMU integration so DMA grants are enforceable in a multi-tenant context.
  • Userspace-binaries proposal for the driver-process runtime, libcapos / libcapos-posix surface for vendor SDK consumption, and the x86_64-unknown-capos target.
  • LLM and agent proposal for the primary consumer surface (LanguageModel, Embedder) and the agent runtime that exercises GPU-backed inference end-to-end.
  • Resource-accounting proposal for per-session memory and submission budgets.
  • Live-upgrade proposal for driver-service replacement without dropping GpuSession capabilities.

It complements:

  • Service-architecture and authority-broker proposals.
  • Storage/service manifest execution flow for shipping GPU service binaries and their bootstrap grants.
  • In-process threading work for future queue completion callbacks and worker pools inside the driver service.

Minimal acceptance criteria

  • make run-gpu-mock boots and prints GPU service lifecycle messages.
  • The device-manager spawns the GPU service and grants only device-scoped bootstrap grants for a single mock function.
  • A sample userspace client (Rust over capos-rt; C smoke later through libcapos) can create a session, allocate and map a GPU buffer, submit a synthetic job, and wait on a fence with a typed completion result.
  • Attempts to submit unsupported or malformed operations return explicit capnp CapException results, not driver crashes.
  • Removing the session capability invalidates descendant buffer and fence caps without kernel restart.
  • A subsequent slice points an LLM model server at the GPU service and proves a LanguageModel.generate(...) round-trip backed by the GPU session, satisfying the LLM proposal’s GPU-backend integration point.

Risks

  • Real NVIDIA closed stack integration may require vendor-specific adaptation that is hostile to a capability shim; the AMD ROCm or vendor-neutral path (Vulkan compute, WebGPU/wgpu) may land first.
  • Buffer mapping semantics become complex with paging, fragmentation, and IOMMU domains. Pinned physical-memory-only buffers are the conservative starting point.
  • Interrupt-heavy completion paths require the scheduler evolution work (per-CPU run queues, fairness) before client-visible completion guarantees scale beyond a single workload.
  • Vendor SDKs assume a POSIX-ish process model; the libcapos-posix surface has to grow enough to host them without leaking ambient authority.
  • A GPU driver process is privileged from the application’s point of view. Compromise of a single driver process must remain bounded to one GPU function and one tenant set; the device-manager and IOMMU are the load-bearing controls there.

Open Questions

  • Is CUDA mandatory from first integration, or is the initial surface command-focused (opaque “program” bytes interpreted by the driver) with CUDA runtime-specific support added later?
  • Should memory registration support pinned physical memory only at first, or attempt to expose unified-virtual-memory semantics through the client’s VirtualMemory capability?
  • Which isolation level is needed for multi-tenant versus single-tenant in the first real-backend phase? Single-tenant per GPU function is the conservative default; MIG / SR-IOV-style partitioning is later work.
  • Does the GPU service expose model artifacts (weights, programs) as separate capability types so a model file can be granted to clients without the full session, or are programs always inline arguments?

Proposal: capOS As A Robot Brain

How capOS should grow into a capability-oriented robot brain for manufacturing robots, mobile robots, RC cars, drones, and autonomous-vehicle research without collapsing safety, realtime, perception, planning, and operator control into one trusted process.

Purpose

capOS has the right architectural ingredients for robotics: isolated processes, explicit capabilities, typed IPC, revocation, memory objects, service composition, audit direction, and future scheduling contexts. Robotics is a useful forcing function because it combines physical authority with mixed-criticality timing:

  • a camera pipeline can drop frames;
  • a local planner can miss a cycle and recover;
  • a wheel command must expire safely;
  • a robot arm must obey limits;
  • an e-stop must not depend on a model, network, shell, or log service.

The proposal is not “run every control loop in the kernel.” It is a staged robotics architecture where capOS owns authority routing, service isolation, telemetry, update, planning, and eventually admitted realtime islands, while the tightest safety loops remain on certified controllers or MCUs until capOS has evidence to replace them.

Goals

  • Define a capability-native robot service graph.
  • Separate safety, realtime control, perception, planning, operator UI, simulation, manufacturing integration, and agents.
  • Make actuator authority explicit, revocable, logged, and bounded by mode, safety state, command freshness, and limits.
  • Support compatibility bridges for ROS 2, micro-ROS, MAVLink, OPC UA, and simulation tooling without turning them into ambient authority tunnels.
  • Provide a path from simulation to small physical robots before industrial or vehicle safety claims.
  • Reuse MemoryObject rings, notification/futex paths, and future scheduling contexts for sensor streams and control loops.

Non-Goals

  • Replacing certified safety PLCs, flight controllers, servo drives, or vehicle safety controllers in the near term.
  • Claiming IEC 61508, ISO 13849, ISO 10218, or ISO 26262 compliance.
  • Putting model inference or natural-language agents in direct control of actuators.
  • Making ROS 2 an ambient compatibility layer with implicit access to every capOS service.
  • Copying large sensor frames through Cap’n Proto payloads in the data path.

Architecture

flowchart LR
    Operator[Operator UI / shell / teleop] --> Mission[Mission and behavior]
    Agent[Agent runner] --> Mission
    Mission --> Planner[Planner]
    Planner --> Controller[Realtime controller island]
    Controller --> Actuator[Actuator gateway]
    Actuator --> Hardware[MCU / PLC / drive / autopilot]

    SensorHW[Camera / lidar / IMU / encoders] --> SensorSvc[Sensor services]
    SensorSvc --> Perception[Perception]
    Perception --> World[World model]
    World --> Planner

    Safety[Safety monitor] --> Mission
    Safety --> Controller
    Safety --> Actuator

    Bridges[ROS 2 / MAVLink / OPC UA bridges] --> Mission
    Bridges --> SensorSvc
    Bridges --> Actuator

    Audit[Audit and telemetry] --- Mission
    Audit --- Controller
    Audit --- Actuator

Principal split:

  • Sensor services own device-facing capture authority and publish typed streams or snapshots.
  • Perception consumes sensor streams and emits world-model updates.
  • Mission and behavior chooses tasks, modes, and goals.
  • Planner computes paths, trajectories, or setpoints within policy.
  • Realtime controller island turns admitted inputs into cyclic commands.
  • Actuator gateway is the only holder of hardware command authority.
  • Safety monitor observes independent safety state and can force stop, neutral, disarm, or mode degradation.
  • Agent runner may propose or explain actions but does not hold actuator caps.
  • Compatibility bridges receive narrow imported/exported caps.

Core Rule

No process gets both broad interpretation authority and raw physical authority.

Examples:

  • A language model may emit a structured proposal; it does not receive ActuatorCommand.
  • A ROS bridge may publish odometry and accept a velocity command cap; it does not receive the whole capOS service graph.
  • A planner may receive a goal and produce a trajectory; it does not directly program motor registers.
  • An actuator gateway may command hardware; it does not fetch network content or run operator scripts.

Robot Capabilities

The first schema should stay small and control-plane oriented. Bulk sensor data uses MemoryObject rings.

interface RobotDescription {
  describe @0 () -> (description :RobotDescriptionSnapshot);
  readFrameTree @1 () -> (frames :FrameTreeSnapshot);
}

interface SensorStream {
  describe @0 () -> (info :SensorInfo);
  openRing @1 (config :StreamConfig) -> (ring :MemoryObject);
  readStatus @2 () -> (status :StreamStatus);
}

interface ActuatorCommand {
  describe @0 () -> (info :ActuatorInfo);
  submit @1 (frame :CommandFrame) -> (accepted :Bool);
  neutral @2 (reason :Text) -> ();
}

interface SafetyState {
  read @0 () -> (state :SafetySnapshot);
  subscribe @1 () -> (events :SensorStream);
}

interface ControlLoop {
  describe @0 () -> (info :LoopInfo);
  start @1 () -> ();
  stop @2 (reason :Text) -> ();
  readTelemetry @3 () -> (telemetry :LoopTelemetry);
}

CommandFrame must carry:

  • sequence number;
  • monotonic timestamp;
  • deadline;
  • command mode;
  • coordinate frame;
  • limit profile;
  • typed payload;
  • source identity;
  • optional safety-envelope revision.

Command freshness is mandatory. If the frame is stale, the actuator gateway rejects it or transitions to neutral/safe state according to policy.

Data Plane

Cap’n Proto is the control plane. Sensor and actuator streams need fixed-layout shared rings:

sequence
capture_time_ns
deadline_ns
frame_id
format
offset
length
flags
source_epoch

The ring can carry camera frames, lidar scans, IMU batches, encoder samples, audio-like streams, or command telemetry. Payload bytes live in MemoryObject backing storage. Producers and consumers coordinate through notification or futex-like wakeups. Slow consumers drop or skip according to policy; they do not backpressure a guaranteed control island.

Realtime Islands

The robot-control equivalent of the media graph’s guaranteed realtime island is an admitted control loop:

flowchart LR
    Sense[read sensors] --> Snapshot[input snapshot]
    Snapshot --> Update[controller update]
    Update --> Clamp[limit and safety clamp]
    Clamp --> Write[write actuator command]
    Write --> Telemetry[non-RT telemetry export]

Admission requires:

  • fixed period and deadline;
  • scheduling context with budget;
  • preallocated input, output, and telemetry buffers;
  • no allocation in the cycle;
  • no blocking endpoint calls in the cycle;
  • no credential checks, logging, service discovery, or model inference;
  • bounded data-age policy;
  • command-limit and clamp policy;
  • stale-command watchdog;
  • overrun behavior.

Failure behavior is part of the contract. An overrun, stale input, revoked cap, or failed write should produce a deterministic result: hold, neutral, stop, drop, degrade mode, or fault the island. It should not build an unbounded queue of late commands.

Compatibility Bridges

ROS 2 Bridge

The ROS 2 bridge should map selected topics, services, and actions to capOS capabilities. It must be configured from a manifest or broker policy:

  • which ROS topics can be imported;
  • which capOS sensor streams can be exported;
  • which commands can reach an actuator gateway;
  • freshness and rate limits;
  • whether messages are best-effort, reliable, latched, or deadline-bound;
  • how frames and transforms are mapped.

The bridge is not a general “ROS graph has all caps” adapter.

micro-ROS / MCU Bridge

For small robots, the MCU bridge is the first practical hardware path:

  • MCU closes motor PID, bumper debounce, watchdog, and current limits;
  • capOS sends bounded velocity/setpoint frames;
  • MCU publishes encoder, IMU, battery, bumper, and fault streams;
  • stale capOS commands force neutral behavior.

For drones and some rovers:

  • autopilot owns arming, stabilization, failsafe, and flight termination;
  • capOS consumes telemetry and sends high-level setpoints or missions;
  • bridge enforces geofence, mode, rate, and authority limits;
  • direct actuator override is absent or privileged behind stronger policy.

OPC UA / Manufacturing Bridge

For industrial cells:

  • OPC UA gateway imports cell, robot, fixture, and job state;
  • capOS exposes typed job/status/alarm caps;
  • robot program selection and start/stop are separate authorities;
  • safety state is read independently and cannot be overridden by job logic.

Product-Level Targets

Simulation Robot

The first milestone should be visible without hardware: boot capOS, launch a simulated differential-drive robot, publish fake lidar/odometry, run a behavior service, send bounded drive commands, and log telemetry. This proves the capability graph and stale-command behavior.

Vacuum / Indoor Mobile Robot

Next target: capOS on an SBC with an MCU base controller.

  • capOS runs mapping, local planning, cleaning behavior, docking, UI, and logs.
  • MCU runs wheel control, bumper/cliff protection, and motor watchdog.
  • BaseDrive accepts velocity commands with deadlines.
  • Loss of capOS or command authority stops motion.

RC Car / Rover

RC-car class demo:

  • camera/IMU/GPS sensor services;
  • teleop and autonomous mode caps;
  • steering/throttle gateway with watchdog;
  • geofence and speed envelope;
  • logs for every actuator-affecting command.

Manufacturing Cell Supervisor

Industrial demo:

  • OPC UA or mock PLC gateway;
  • robot program selection as a typed capability;
  • cell-state and alarm streams;
  • operator approval for mutating actions;
  • no attempt to replace certified safety functions.

Autonomous Vehicle Research Host

Autoware-like demo:

  • perception, localization, planning, control, and vehicle-interface services;
  • simulator or closed-course interface;
  • independent safety gateway;
  • command envelopes and audit.

This remains a research host, not a road-certified system.

Security Invariants

  • Actuator gateways are narrow and mode-limited.
  • Safety monitor authority is independent from planner and agent authority.
  • Model processes never receive actuator, safety, or raw device caps.
  • Operator UI receives consent and status caps, not raw hardware caps.
  • Bridges do not receive ambient service discovery authority.
  • Every actuator-affecting command is auditable by source, mode, limits, safety-state revision, timestamp, and result.
  • Revoking command authority causes stale handles and future commands to fail closed.
  • Device-facing services obey the DeviceMmio, DMAPool, and Interrupt authority model before userspace drivers touch physical hardware.

Scheduling Dependencies

This proposal depends on future scheduling work:

  • per-thread rings for full-SMP ownership;
  • notification objects for low-overhead wakeups;
  • scheduling contexts with period/budget/priority;
  • CPU affinity and isolation for admitted loops;
  • TLB shootdown and SMP-safe address-space migration;
  • timing telemetry and overrun events;
  • eventually WCET evidence for hard-realtime claims.

Until those exist, docs and demos must say “bounded soft realtime” or “supervised external controller”, not “hard realtime.”

Implementation Sequence

  1. Add simulation-only robot services and typed fake sensor/actuator caps.
  2. Add RobotDescription, SensorStream, ActuatorCommand, SafetyState, and ControlLoop draft schemas.
  3. Add a QEMU/host smoke that proves stale drive commands fail closed.
  4. Add a differential-drive MCU bridge design and host-side simulator.
  5. Add ROS 2 bridge proposal detail for selected topics/actions and transforms.
  6. Add control-loop telemetry counters: period, execution time, overrun, data age, command age, clamp, neutral, and safety fault.
  7. Bind a local controller to scheduling contexts once the scheduler supports budgeted realtime islands.
  8. Add manufacturing gateway design over OPC UA or a mock PLC protocol.
  9. Add hardware-in-loop criteria before any real actuator demo is treated as a milestone.

Open Questions

  • Should the first visible milestone be simulation-only or a small physical differential-drive base?
  • Should robot schemas live in schema/capos.capnp or a separate robotics schema compiled by the same build pipeline?
  • Which transform-tree representation fits capOS best: immutable snapshots, streaming deltas, or both?
  • How should command envelopes compose when operator, planner, safety monitor, and actuator gateway all impose limits?
  • What is the minimum useful ROS 2 bridge: topics only, or topics plus actions for Nav2-style navigation?
  • Does SensorStream generalize the media-ring design, or should robotics get a distinct stream ABI?

References

Proposal: Formal MAC/MIC Model and Proof Track

How capOS could move from pragmatic label checks to a formal mandatory access control and mandatory integrity control story suitable for a GOST-style claim.

Problem

Adding a label field to capabilities is not enough to claim formal mandatory access control. ГОСТ Р 59453.1-2021 frames access control through a formal model of an abstract automaton: the model describes states, subjects, objects, containers, rights, accesses, information flows, safety conditions, and proofs that unsafe accesses or flows cannot arise.

capOS should therefore separate two levels:

  • Pragmatic label policy. Userspace brokers and wrapper capabilities enforce labels at trusted grant paths and selected method calls. The user/session side of this level is tracked in User Identity and Policy; this proposal does not redefine the broker, session, or local-account surface, only the formal model that would sit underneath it.
  • Formal MAC/MIC. A documented abstract state machine, safety predicates, transition rules, proof obligations, and an implementation mapping. Only this second level can support a GOST-style claim. The verification tooling budget (TLA+/Alloy/Kani/Loom/Prusti/Creusot tracks) is owned by Security and Verification; this proposal feeds new obligations into that plan, it does not duplicate the tier definitions.

This proposal defines the path to the second level. It is not a claim that capOS currently satisfies it. The Design Risks and Open Questions entry Q13 – Formal properties to prove treats the current bounded-proof set (cap-table non-forgery, frame-bitmap invariants, transfer rollback, ring producer-consumer invariants) as the baseline that this proposal extends toward an abstract automaton – it is not a step toward seL4-style full functional refinement.

Scope

The first formal target should be narrow:

Confidentiality:
  No transition creates an unauthorized information flow from an object at a
  higher or incomparable confidentiality label to an object at a lower label,
  except through an explicit trusted declassifier transition.

Integrity:
  No low-integrity or incomparable subject can control a higher-integrity
  subject, and no low-integrity subject can write or transfer influence into a
  higher-integrity object, except through an explicit trusted upgrader or
  sanitizer transition.

The proof should cover capability authority creation and transfer before it covers every device, filesystem, or POSIX compatibility corner. For capOS, capability transfer is the dangerous boundary.

Terminology

The Russian GOST terms to keep straight:

  • мандатное управление доступом: mandatory access control for confidentiality.
  • мандатный контроль целостности: mandatory integrity control.
  • целостность: integrity.
  • уровень целостности: integrity level.
  • уровень конфиденциальности: confidentiality level.
  • субъект доступа: access subject.
  • объект доступа: access object.

The standards separate confidentiality MAC from integrity control. capOS should not merge them into one vague label field.

Abstract State

The formal model should be intentionally smaller than the implementation. It models only the security-relevant state.

SymbolMeaning
Uset of user accounts / principals
Sset of subjects: processes, sessions, services
Oset of objects: files, namespaces, endpoints, process handles, secrets
Cset of containers: namespaces, directories, stores, service subtrees
Eentities = O union C
Kkernel object identities
Capcapability handles / hold edges
Holdrelation S -> E with metadata
Ownsubject-control or ownership relation
Ctrlsubject-control relation
Flowobserved information-flow relation
Rightsabstract rights: read, write, execute, own, control, transfer
Accessrealized accesses: read, write, call, return, spawn, supervise

Hold is central. In capOS, authority is represented by capability table entries and transfer records, not by global paths. A formal model that does not model capability hold edges will miss the main authority channel.

Suggested hold-edge metadata:

HoldEdge {
  subject
  entity
  interface_id
  badge
  transfer_mode
  origin
  confidentiality_label
  integrity_label
}

Label Lattices

Use deployment-defined partial orders, not hardcoded government categories.

Example confidentiality lattice:

public < internal < confidential < secret
compartments = {project-a, project-b, ops, crypto}

dominates(a, b) means:

level(a) >= level(b)
and compartments(a) includes compartments(b)

Integrity should be separate:

untrusted < user < service < trusted
domains = {boot, storage, network, auth}

The model must specify how labels compose across containers:

  • contained entity confidentiality cannot exceed what the container policy permits unless the container explicitly supports mixed labels;
  • contained entity integrity cannot exceed the container’s integrity policy;
  • a subject-associated object such as a process ring, endpoint queue, or process handle needs labels derived from the subject it controls or exposes.

Capability Method Flow Classes

capOS cannot rely on syscall names such as read and write. Each interface method needs a flow class.

Initial categories:

ReadLike       data flows object -> subject
WriteLike      data flows subject -> object
Bidirectional  data flows both ways
ControlLike    subject controls another subject/object lifecycle
TransferLike   authority or future data path is transferred
ObserveLike    metadata/log/status observation
Declassify     trusted downgrade of confidentiality
Sanitize       trusted upgrade of integrity after validation
NoFlow         lifecycle release or local bookkeeping only

Examples:

File.read                 ReadLike
File.write                WriteLike
Namespace.bind            WriteLike + ControlLike
LogReader.read            ReadLike
ManifestUpdater.apply     WriteLike + ControlLike
ProcessSpawner.spawn      ControlLike + TransferLike
ProcessHandle.wait        ObserveLike
ServiceSupervisor.restart ControlLike
Endpoint.call             depends on endpoint declaration
Endpoint.return           depends on endpoint declaration
CAP_OP_RELEASE            NoFlow
CAP_OP_CALL transfers     TransferLike
CAP_OP_RETURN transfers   TransferLike

The flow table is part of the trusted model. Adding a new capability method without classifying its flow should fail review.

Transitions

The abstract automaton should include at least these transitions:

create_session(principal, profile)
spawn(parent, child, grants)
copy_cap(sender, receiver, cap)
move_cap(sender, receiver, cap)
insert_result_cap(sender, receiver, cap)
call(subject, endpoint, payload)
return(server, client, result, result_caps)
read(subject, object)
write(subject, object)
bind(subject, namespace, name, object)
supervise(controller, target, operation)
release(subject, cap)
revoke(authority, object)
declassify(trusted_subject, source, target)
sanitize(trusted_subject, source, target)
relabel(trusted_subject, object, new_label)

Each transition needs preconditions and effects. Example:

copy_cap(sender, receiver, cap):
  pre:
    Hold(sender, cap.entity)
    cap.transfer_mode allows copy
    confidentiality_flow_allowed(cap.entity, receiver)
    integrity_flow_allowed(sender, cap.entity, receiver)
    receiver quota has free cap slot
  effect:
    Hold(receiver, cap.entity) is added
    Flow(cap.entity, receiver, transfer) is recorded when relevant

Move is not a shortcut. It has different authority effects but can still create an information/control flow into the receiver.

Safety Predicates

Confidentiality:

read_allowed(s, e):
  clearance(s) dominates classification(e)

write_allowed(s, e):
  classification(e) dominates current_confidentiality(s)

flow_allowed(src, dst):
  classification(dst) dominates classification(src)

No write down follows from classification(dst) dominates classification(src).

Integrity:

integrity_write_allowed(s, e):
  integrity(s) >= integrity(e)

control_allowed(controller, target):
  integrity(controller) >= integrity(target)

integrity_flow_allowed(src, dst):
  integrity(src) >= integrity(dst)

The exact inequality direction must be validated against the chosen integrity semantics. The intent is that low-integrity subjects cannot modify or control high-integrity subjects or objects.

Subject control:

supervise_allowed(controller, target):
  confidentiality/control labels are compatible
  and integrity(controller) >= integrity(target)
  and Hold(controller, ServiceSupervisor(target)) exists

Authority graph:

all live authority is represented by Hold
every Hold edge has a live cap table slot or trusted kernel root
no transition creates Hold without passing transfer/spawn/broker preconditions

Proof Shape

The proof is an invariant proof over the abstract automaton:

Base:
  initial_state satisfies Safety

Step:
  for every transition T:
    if Safety(state) and Precondition(T, state),
    then Safety(apply(T, state))

The transition proof must explicitly cover:

  • spawn grants,
  • copy transfer,
  • move transfer,
  • result-cap insertion,
  • endpoint call and return,
  • namespace bind,
  • supervisor operations,
  • declassification,
  • sanitization,
  • relabel,
  • revocation and release preserving consistency.

The proof must also state what it does not cover:

  • physical side channels,
  • timing channels not modeled by Flow,
  • bugs below the abstraction boundary,
  • device DMA until DMAPool/IOMMU boundaries are modeled,
  • persistence/replay until persistent object identity is modeled.

Tooling Plan

Start with lightweight formal tools, then deepen only if the model stabilizes.

TLA+

Best first tool for capOS because capability transfer, spawn, endpoint delivery, and revocation are state transitions. Use TLA+ to model:

  • sets of subjects, objects, labels, and hold edges,
  • bounded transfer/spawn/call transitions,
  • invariants for confidentiality, integrity, and hold-edge consistency.

TLC can find counterexamples early. Apalache is worth evaluating later for symbolic checking if TLC state explosion becomes painful.

Alloy

Useful for relational counterexample search:

  • label lattice dominance,
  • container hierarchy invariants,
  • hold-edge graph consistency,
  • “can a path of transfers create forbidden flow?” queries.

Alloy complements TLA+; it does not replace transition modeling.

Coq, Isabelle, or Lean

Only after the model stops moving. These tools are appropriate for a durable machine-checked proof artifact. They are expensive if the policy surface is still changing.

Kani / Prusti / Creusot

Use these for implementation-level Rust obligations after the abstract model exists:

  • cap table generation/index invariants,
  • transfer transaction rollback,
  • label dominance helper correctness,
  • quota reservation/release balance,
  • wrapper cap narrowing properties.

They do not replace the abstract automaton proof.

ITU-T Z-series specification languages

ITU-T publishes a family of formal specification languages for protocols and behavioural systems. They are complements to TLA+/Alloy, not replacements; each targets a different part of the specification-to-code pipeline.

  • Z.100 SDLSpecification and Description Language. State machines with structured data, signals, and composition. SDL models communicating extended finite-state machines, which is a natural fit for the capability ring protocol, endpoint call/return, and supervisor quiesce/resume state. SDL-RT (SDL real-time) adds timers explicitly, which matters for cap_enter wait/timeout semantics.
  • Z.120 MSCMessage Sequence Charts. A UML-sequence-diagram predecessor with formal semantics (ITU-T Z.120 Annex B). MSC is useful for documenting what a correct capability-transfer sequence looks like — CALL issuing hold edge, server RECV, server RETURN with result caps, caller CQE — in a form that can be model-checked against the SDL state machine. tools/ccs-style session dumps already produce sequence-shaped records; converting a subset to MSC form would let invariants be checked as sequence-diagram properties (e.g. “no RETURN without a matching CALL hold edge”).
  • Z.151 URNUser Requirements Notation (Goal-oriented Requirement Language + Use Case Maps). Worth tracking for later capOS security-requirement traceability — linking proof obligations to threat-model goals — but overkill for the first formal artifact.

Relative to the TLA+/Alloy track:

ConcernTool in capOS
Global state transitions, invariantsTLA+ (primary)
Relational graph queries (hold edges, dominance)Alloy
Per-service protocol state machinesZ.100 SDL (optional)
Canonical call/return sequencesZ.120 MSC (optional)
Durable machine-checked proofCoq / Isabelle / Lean (later)
Implementation-level Rust obligationsKani / Prusti / Creusot

SDL/MSC should be considered for the protocol layer (capability transfer sequences, endpoint handshakes, supervisor lifecycle) where TLA+ specifications tend to become cluttered with message-passing boilerplate. They should not replace the abstract automaton that covers hold-edge safety invariants — that work stays in TLA+/Alloy.

Other ITU-T security frameworks

Relevant security frameworks from the X-series that this proposal cross-references rather than re-derives:

  • X.800 / X.805Security architecture for Open Systems Interconnection and Security architecture for systems providing end-to-end communications. Taxonomy of security services (authentication, access control, data confidentiality, data integrity, non-repudiation, availability, privacy) × layers. Used in security-and-verification-proposal.md as a completeness checklist.
  • X.810 — Overview of security frameworks.
  • X.811 — Authentication framework.
  • X.812 — Access control framework. Referenced from user-identity-and-policy-proposal.md for ADF/AEF decomposition.
  • X.813 — Non-repudiation framework. Relevant for signed audit records and signed manifest updates.
  • X.814 — Confidentiality framework.
  • X.815 — Integrity framework. Directly relevant to the MIC half of this proposal; X.815 terminology on integrity “verification”, “recovery”, and “protection” clarifies which obligations apply at which boundary.
  • X.816 — Security audit and alarms framework. The monitoring proposal adopts its audit taxonomy.

Implementation Mapping

The proof track must produce implementation obligations that code review and tests can check.

Required implementation hooks:

  • every kernel object that participates in policy has stable ObjectId;
  • every labeled object has MandatoryLabel;
  • every hold edge or capability entry records enough label metadata for transfer checks;
  • every capability method has a flow class;
  • every transfer path calls one shared label/flow checker;
  • every spawn grant uses the same checker as transfer;
  • every endpoint has declared flow policy;
  • every declassifier/sanitizer is an explicit capability and audited;
  • every relabel operation is explicit and audited;
  • every wrapper cap preserves or narrows authority and labels;
  • process exit and release remove hold edges without leaving ghost authority.

The current pragmatic userspace broker model is allowed as an earlier stage, but the implementation mapping must identify where it is bypassable. Any path that lets untrusted code transfer labeled authority without the broker must move into the kernel-visible checked path before a formal MAC/MIC claim.

Testing and Review Gates

Before implementing kernel-visible labels:

  • write the TLA+ or Alloy model;
  • include at least one counterexample-driven test showing a rejected unsafe transfer in the model;
  • document every transition that is intentionally out of scope.

Before claiming pragmatic MAC/MIC:

  • broker and wrapper caps enforce labels at grant paths;
  • audit records every grant, denial, and relabel/declassify operation;
  • QEMU demo shows a denied high-to-low transfer and a permitted trusted declassification.

Before claiming GOST-style MAC/MIC:

  • abstract automaton is written;
  • safety predicates are explicit;
  • all modeled transitions preserve safety;
  • implementation obligations are mapped to code paths;
  • transfer/spawn/result-cap insertion cannot bypass label checks;
  • limitations and non-modeled channels are documented.

Integration With Existing Plans

This proposal depends on:

  • authority graph and resource accounting (Authority Accounting);
  • user/session policy services (User Identity and Policy); the pragmatic broker, session metadata, local-account, and stale-cap enforcement work lives there. The formal model in this file treats the pragmatic level as the implementation surface that any abstract subject / hold-edge transition must be mapped back onto;
  • capability transfer and result-cap insertion (Capability Model);
  • DMA isolation before user drivers become part of the labeled model (DMA Isolation);
  • security verification tooling (Security and Verification); the TLA+/Alloy/Kani/Loom/Prusti/Creusot tier descriptions and obligation budget belong there. New obligations introduced here (label dominance helpers, transfer-time flow checks, declassifier/sanitizer audit) feed into that proposal’s tier tables rather than redefining them in this file;
  • the consolidated design-risks-register entry Q13 – Formal properties to prove (Design Risks and Open Questions) tracks this proposal as the route from the current bounded-proof baseline to a documented abstract automaton; R14 – User identity / policy is proposal-shaped records why the pragmatic level still cannot make a GOST-style claim today.

Consumers that carry additional proof obligations onto this track:

  • OIDC/OAuth2 federated authentication and token-typed capabilities (OIDC and OAuth2). That proposal enumerates a 10-item proof-obligation checklist and a tool-assignment table (TLA+/Alloy/SDL/MSC/Kani/Prusti) for OIDC-specific transitions. The obligations are additive to the ones here: they extend flow classes onto token caps, add session- creation and broker-outbound MAC/MIC predicates, and model verify_id_token as a trusted total function.

Non-Goals

  • No certification claim.
  • No claim that current capOS implements GOST-style MAC/MIC.
  • No attempt to model all side channels in the first version.
  • No kernel policy language interpreter.
  • No POSIX uid/gid authorization.
  • No label field without transition rules and proof obligations.

Open Questions

  • What is the smallest useful label lattice for the first demo?
  • Should labels live on objects, hold edges, or both?
  • Should endpoint flow policy be static per endpoint, per method, or per transferred cap?
  • How should declassifier and sanitizer capabilities be scoped and audited?
  • Which channels must be modeled as memory flows versus time flows?
  • Is TLA+ sufficient for the first formal artifact, or should the relational parts start in Alloy?
  • Which parts of ГОСТ Р 59453.1-2021 should be treated as direct goals versus inspiration for a capOS-native formal model?
  • How should OIDC/OAuth2 federation fit the first formal artifact? The proof-obligation checklist in OIDC and OAuth2 is already sized for the same TLA+/Alloy/SDL/MSC/Kani tool assignment used here, but the first MAC/MIC model may be cleaner if it lands before federated subjects are added. Decide whether OIDC joins the initial TLA+ module or follows as a second artifact that extends the subject-creation transition.

References

  • ГОСТ Р 59383-2021, access-control foundations: https://lepton.ru/GOST/Data/752/75200.pdf
  • ГОСТ Р 59453.1-2021, formal access-control model: https://meganorm.ru/Data/750/75046.pdf
  • ITU-T Rec. X.800 (03/91) — Security architecture for OSI.
  • ITU-T Rec. X.805 (10/03) — Security architecture for systems providing end-to-end communications.
  • ITU-T Rec. X.810 (11/95) — Security frameworks: Overview.
  • ITU-T Rec. X.811 (04/95) — Authentication framework.
  • ITU-T Rec. X.812 (11/95) — Access control framework.
  • ITU-T Rec. X.813 (10/96) — Non-repudiation framework.
  • ITU-T Rec. X.815 (11/95) — Integrity framework.
  • ITU-T Rec. X.816 (11/95) — Security audit and alarms framework.
  • ITU-T Rec. Z.100 (04/21) — Specification and Description Language overview.
  • ITU-T Rec. Z.120 (02/11) — Message Sequence Charts.
  • ITU-T Rec. Z.151 (10/18) — User Requirements Notation.

Proposal: Running capOS in the Browser (WebAssembly, Worker-per-Process)

How capOS goes from “boots in QEMU” to “boots in a browser tab,” with each capOS process executing in its own Web Worker and the kernel acting as the scheduler/dispatcher across them.

This proposal is the inverse of the Browser Capability and Agent Web Sessions direction: that one is about capOS exposing browsers to users and agents as capability-scoped services; this one is about running capOS itself inside a browser tab as a teaching and demo substrate. It is also adjacent to but distinct from the WASI Host Adapter: WASI hosts third-party wasm modules inside a capOS userspace process under explicit per-instance cap grants, while the browser port is capOS itself rebuilt for wasm32-unknown-unknown and run inside Workers. Both share the constraint that authority must be ABI-typed and per-instance, never ambient.

The goal is a teaching and demo target, not a production runtime. It should preserve the capability model — typed endpoints, ring-based IPC, no ambient authority — while replacing the hardware substrate (page tables, IDT, preemptive timer, privilege rings) with browser primitives (Worker boundaries, SharedArrayBuffer, Atomics.wait/notify).

Depends on: Stage 5 (Scheduling), Stage 6 (IPC) — the capability ring is the only kernel/user interface we want to port. Anything still sitting behind the transitional write/exit syscalls must migrate to ring opcodes first.

Complements: userspace-binaries-proposal.md and ../programming-languages.md (language/runtime story), service-architecture-proposal.md (process lifecycle). A browser port stresses both: the runtime must build for wasm32-unknown-unknown, and process spawn becomes “instantiate a Worker” rather than “map an ELF.”

Non-goals:

  • Running the existing x86_64 kernel unmodified in the browser. That’s a separate question (QEMU-WASM / v86) and is a simulator, not a port.
  • Emulating the MMU, IDT, or PIT in WASM. The whole point is to replace them with primitives the browser already gives us for free.
  • Any persistence, networking, or storage beyond what a hosted demo needs.

Current State

capOS is x86_64-only. Arch-specific code lives under kernel/src/arch/x86_64/ and relies on:

MechanismFileBrowser equivalent
Page tables, W^X, user/kernel splitmem/paging.rs, arch/x86_64/smap.rsWorker + linear-memory isolation (structural)
Preemptive timer (PIT @ 100 Hz)arch/x86_64/pit.rs, idt.rssetTimeout/MessageChannel + cooperative yield
Syscall entry (SYSCALL/SYSRET)arch/x86_64/syscall.rsDirect Atomics.notify on ring doorbell
Context switcharch/x86_64/context.rsNone — each process is its own Worker, OS schedules
ELF loadingelf.rs, main.rsWebAssembly.instantiate from module bytes
Frame allocatormem/frame.rsmemory.grow inside each instance
Capability ringcapos-config/src/ring.rs, cap/ring.rsReused unchanged — shared via SharedArrayBuffer
CapTable, CapObjectcapos-lib/src/cap_table.rsReused unchanged in kernel Worker

The capability-ring layer is the only stable interface that survives the port intact. Everything below cap/ring.rs is arch work; everything above is schema-driven capnp dispatch that doesn’t care about the substrate.


Architecture

flowchart LR
    subgraph Tab[Browser Tab / Origin]
        direction LR
        Main[Main thread<br/>xterm.js, UI, loader]
        subgraph KW[Kernel Worker]
            Kernel[capOS kernel<br/>CapTable, scheduler,<br/>ring dispatch]
        end
        subgraph P1[Process Worker #1<br/>init]
            RT1[capos-rt] --> App1[init binary]
        end
        subgraph P2[Process Worker #2<br/>service<br/>spawned by init]
            RT2[capos-rt] --> App2[service binary]
        end
        SAB1[(SharedArrayBuffer<br/>ring #1)]
        SAB2[(SharedArrayBuffer<br/>ring #2)]
        Main <-->|postMessage| KW
        KW <-->|SAB + Atomics| SAB1
        KW <-->|SAB + Atomics| SAB2
        P1 <-->|SAB + Atomics| SAB1
        P2 <-->|SAB + Atomics| SAB2
        P1 -.spawn.-> KW
        KW -.new Worker.-> P2
    end

One Worker per capOS process. Each process is a WASM instance in its own Worker, with its own linear memory. Cross-process access is structurally impossible — postMessage and shared ring buffers are the only channels.

Kernel in a dedicated Worker. Not on the main thread: the main thread is reserved for UI (terminal, loader, error display). The kernel Worker owns the CapTable, holds the Arc<dyn CapObject> registry, dispatches SQEs, and maintains one SharedArrayBuffer per process for that process’s ring. It directly spawns init; all further processes are created via the ProcessSpawner cap it serves.

Capability ring over SharedArrayBuffer. The existing CapRingHeader/CapSqe/CapCqe layout in capos-config/src/ring.rs already uses volatile access helpers for cross-agent visibility. Mapping it onto a SharedArrayBuffer is a change of backing store, not of protocol. Both sides see the same bytes; Atomics.load/Atomics.store replace the volatile reads on the host side; on the Rust/WASM side the existing read_volatile/ write_volatile lower to plain atomic loads/stores under wasm32-unknown-unknown with the atomics feature enabled.

cap_enter becomes Atomics.wait. The process Worker calls Atomics.wait on a doorbell word in the SAB after publishing SQEs. The kernel Worker (or its scheduler tick) calls Atomics.notify after producing completions. That is exactly the io_uring-inspired “syscall-free submit, blocking wait on completion” the ring was designed around — the browser happens to give us the primitive for free.

No preemption inside a process. A Worker runs to completion on its event loop turn; the kernel can’t interrupt it. This is fine: each process is single-threaded in its own isolate, and the scheduler only needs to wake the next process after Atomics.wait, not forcibly remove the running one. This is closer to a cooperative capnp-rpc vat model than to the current timer-preempted kernel, and matches what the capability ring already assumes.


Mapping capOS Concepts to WASM/Browser

Process isolation

The Worker boundary replaces the page table. Two capOS processes cannot observe each other’s linear memory, cannot jump into each other’s code (code is out-of-band in WASM — not addressable as data), and cannot share globals. The SharedArrayBuffer containing the ring is the only intentional shared region, and it is created by the kernel Worker and transferred to the process Worker at spawn time.

No W^X enforcement is needed within a Worker because WASM has no writable code region to begin with — WebAssembly.Module is validated and immutable. The MMU’s job is done by the WASM type system and validator.

Address space / memory

Each Worker’s WASM instance has one linear memory. capos-rt’s fixed heap initialization uses memory.grow instead of VirtualMemory::map. The VirtualMemory capability still exists in the schema, but its implementation in the browser port is a thin wrapper over memory.grow with bookkeeping for “logical unmap” (zeroing + tracking a free list — WASM doesn’t return pages to the host).

Protection flags (PROT_READ/PROT_WRITE/PROT_EXEC) become no-ops with a documented caveat in the proposal: the browser port does not enforce intra-process protection. Cross-process protection is structural and stronger than the native build.

Syscalls

The three transitional syscalls (write, exit, cap_enter) collapse to:

  • write — already slated for removal once init is cap-native. In the browser port, do not implement it at all. Force the port to drive the existing cap-native Console ring path, which forces the rest of the tree to be cap-native too. A forcing function, not a cost.
  • exitpostMessage({type: 'exit', code}) to the kernel Worker, which terminates the Worker via worker.terminate() and reaps the process entry.
  • cap_enterAtomics.wait on the ring doorbell after publishing SQEs, with a waitAsync variant for cooperative mode if we ever want to avoid blocking the Worker’s event loop.

Scheduler

Round-robin is gone; the browser scheduler is the OS scheduler. The kernel Worker’s “scheduler” is reduced to:

  1. A poll loop that drains each process’s SQ (the existing cap/ring.rs::process_sqes logic, called on every notify or on a setTimeout(0) tick).
  2. A completion-fanout step that pushes CQEs and Atomics.notifys the target Worker.

No context switch, no run queue, no per-process kernel stack. The code deleted here is exactly the code that smp-proposal.md says needs per-CPU structures — an orthogonal win: the browser port has no SMP problem because each process is structurally on its own agent.

Process spawning

The kernel Worker spawns exactly one process Worker directly — init — with a fixed cap bundle: Console, ProcessSpawner, FrameAllocator, VirtualMemory, BootPackage, and any host-backed caps (Fetch, etc.) granted to it.

// Kernel Worker bootstrap
const initMod = await WebAssembly.compileStreaming(fetch('/init.wasm'));
const initRing = new SharedArrayBuffer(RING_SIZE);
const initWorker = new Worker('process-worker.js', {type: 'module'});
kernel.registerProcess(initWorker, initRing, buildInitCapBundle());
initWorker.postMessage(
    {type: 'boot', mod: initMod, ring: initRing, capSet: initCapSet,
     bootPackage: manifestBytes},
    [/* transfer */]);

All further processes come from init invoking ProcessSpawner.spawn. ProcessSpawner is served by the kernel Worker; each invocation:

  1. Compiles the referenced binary bytes (WebAssembly.compile over the NamedBlob from BootPackage).
  2. Creates a new Worker and a SharedArrayBuffer for its ring.
  3. Builds the child’s CapTable from the ProcessSpec the caller passed, applying move/copy semantics to caps transferred from the caller’s table.
  4. Returns a ProcessHandle cap.

Init composes service caps in userspace: hold Fetch, attenuate to per-origin HttpEndpoint, hand each child only the caps its ProcessSpec names. Same shape as native after Stage 6.

Host-backed capability services

Some capabilities in the browser port are implemented by talking to the browser rather than to hardware. Fetch and HttpEndpoint — drafted in Service Architecture — are the canonical example. On native capOS they run over a userspace TCP/IP stack on virtio-net/ENA/gVNIC. In the browser port, the service process is replaced by a thin implementation living in the kernel Worker (or a dedicated “host bridge” Worker) that dispatches each capnp call by calling fetch / new WebSocket and returning the response as a CQE. The attenuation story is unchanged: Fetch can reach any URL, HttpEndpoint is bound to one origin at mint time, derived from Fetch by a policy process.

This is not a back door. The capability is granted through the manifest exactly as on native. Processes without the cap cannot reach the host’s network, cannot discover it, and cannot forge one. The only difference from native is the implementation of the service behind the CapObject trait — same schema, same TYPE_ID, same error model.

The same authority-boundary rule the trusted local Remote Session UI Security Proposal enforces between a loopback browser bridge and the upstream capOS gateway applies inside the browser port: browser JavaScript on the main thread is untrusted UI, the kernel Worker holds the CapTable, and the JS layer receives view models / call results, not raw CapIds. Any path that lets main-thread JS originate a SQE without going through the kernel Worker’s validated postMessage surface is the same class of bug the remote-session-ui bridge calls out — a loopback or in-tab listener inheriting operator authority because it skipped the typed boundary.

The same pattern applies to anything else the browser provides natively. Candidate future interfaces (no schema yet, mentioned so the port is considered when they are designed):

  • Clipboard over navigator.clipboard
  • LocalStorage / KvStore over IndexedDB (natural Store backend for the storage proposal in the browser)
  • Display / Canvas over an OffscreenCanvas posted back to the main thread
  • RandomSource over crypto.getRandomValues — trivial but needs a cap rather than a syscall

Other drafted network interfaces — TcpSocket, TcpListener, UdpSocket, NetworkManager from Networking — do not have a clean browser mapping. The browser exposes no raw-socket primitives, so these caps cannot be served in the browser port at all. Applications that need networking in the browser must go through Fetch/HttpEndpoint, and the POSIX compatibility adapter’s socket path must detect the absence of NetworkManager and route connect("http://...") through Fetch instead (or fail closed for other schemes). CloudMetadata from Cloud Metadata is simply not granted in the browser; there is no cloud instance to describe.

Each host-backed cap is opt-in per-process via the manifest; each has a native counterpart that the schema is already the contract for. This is a substantial point in favor of the port: host-provided services slot into the existing capability model without widening it.

CapSet bootstrap

The read-only CapSet page at CAPSET_VADDR is replaced by a structured-clone payload in the initial postMessage. capos-rt::capset::find still parses the same CapSetHeader/CapSetEntry layout, just out of a Uint8Array placed at a known offset in the process’s linear memory by the boot shim.


Binary Portability

Source-portable, not binary-portable. An ELF built for x86_64-unknown-capos does not run; the same source rebuilt for wasm32-unknown-unknown (with the atomics target feature) does, provided it stays inside the supported API surface.

Rust binaries on capos-rt

Port cleanly:

  • Any binary that uses only capos-rt’s public API — typed cap clients (ConsoleClient, future FileClient, etc.), ring submission/completion, CapSet::find, exit, cap_enter, alloc::*.
  • Pure computation, core/alloc containers, serde/capnp message building.

Do not port:

  1. Anything that uses core::arch::x86_64, inline asm!, or global_asm!.
  2. Binaries with a custom _start or a linker script baking in 0x200000. capos-rt owns the entry shape; the wasm entry is set by the host (WebAssembly.instantiate + an exported init), so the prologue differs.
  3. #[thread_local] relying on FS base until the wasm TLS story is decided (per-Worker globals, or the wasm threads proposal’s TLS).
  4. Code that assumes a fixed-size static heap region and reaches it with raw pointers. The wasm arch uses memory.grow; alloc::* hides this, unsafe { &mut HEAP[..] } does not.
  5. Anything that still calls the transitional write syscall shim — the browser build deliberately omits it.

Binaries mixing target features across the workspace produce silently- broken atomics. A single rustflags set for the browser build is required.

POSIX binaries (when the adapter lands)

The POSIX compatibility adapter described in Userspace Binaries Part 4 sits on top of capos-rt. If capos-rt builds for wasm, the adapter builds for wasm, and well-behaved POSIX code rebuilt for a wasm-targeted libcapos (clang --target=wasm32-unknown-unknown + our libc) ports too.

Ports cleanly:

  • Pure computation, string/number handling, data-structure libraries.
  • stdio over Console / future File caps.
  • malloc/free, C++ new/delete, static constructors.
  • select/poll/epoll implemented over the ring (ring CQEs are exactly the event source these APIs want).
  • posix_spawn over ProcessSpawner — spawning a new process becomes “instantiate a new Worker,” which is the native shape of the browser anyway.
  • Networking via Fetch/HttpEndpoint (drafted in Service Architecture) if the manifest grants the cap. The browser port serves these against the host’s fetch/WebSocket — not ambient authority, because only processes granted the cap can invoke it. Raw AF_INET/AF_INET6 sockets via the TcpSocket/NetworkManager interfaces in Networking are not available in the browser (no raw-socket primitive); POSIX networking code wants URLs in practice, and a libc shim can map getaddrinfo+connect+write over Fetch/HttpEndpoint for the HTTP(S) case, failing closed otherwise.

Does not port without new work, possibly ever:

  1. fork. Cannot clone a Worker’s linear memory into a new Worker and resume at the fork call site — there is no COW, no MMU, no way to duplicate an opaque WASM module’s mid-execution state. This is the same reason Emscripten/WASI don’t support fork. POSIX programs that fork-then-exec can be rewritten to posix_spawn; programs that fork-for-concurrency (Apache prefork, some Redis paths) cannot.
  2. Signals. No preemption inside a Worker means no asynchronous signal delivery. SIGALRM, SIGINT, SIGSEGV all need cooperative polling at best; kill(pid, SIGKILL) maps to worker.terminate() and nothing finer. setjmp/longjmp works within a function call tree; siglongjmp out of a signal handler does not exist.
  3. mmap of files with MAP_SHARED. WASM linear memory is not file-backed and cannot be. MAP_PRIVATE | MAP_ANONYMOUS works trivially (it’s just memory.grow + a free list). File-backed mappings require a userspace emulation that reads on fault and writes back on unmap — workable for small files, a lie for the memory- mapped-database case.
  4. Threads without the wasm threads proposal. pthreads over Workers sharing a memory is the only implementation strategy, and it requires the wasm atomics/bulk-memory/shared-memory feature set plus careful runtime support. Single-threaded POSIX code works now; multithreaded POSIX code needs the in-process-threading track from the native roadmap and its wasm counterpart.
  5. Address-arithmetic tricks. Wasm validates loads/stores against the linear-memory bounds. Code that relies on unmapped trap pages (guard pages, end-of-allocation sentinels) or on specific virtual addresses fails.
  6. dlopen. A wasm module is immutable after instantiation. Dynamic loading requires loading a second module and linking via exported tables — possible with the component model, nowhere near drop-in dlopen. Static linking is the pragmatic answer.

Rough guide: if a POSIX program compiles cleanly under WASI and uses only WASI-supported syscalls, it will almost certainly port to capOS-on-wasm with the adapter, because the constraints overlap. If it needs features WASI doesn’t support (fork, signals, shared mmap), the capOS browser port will not magically fix that — the limitations come from the substrate, not from the POSIX adapter’s completeness.


Build Path

Three new cargo targets, no workspace restructuring required:

  1. capos-lib on wasm32-unknown-unknown. Already no_std + alloc, no arch-specific code. Should build as-is; verify under cargo check --target wasm32-unknown-unknown -p capos-lib.

  2. capos-config on wasm32-unknown-unknown. Same — pure logic, the ring structs and volatile helpers are portable.

  3. capos-rt on wasm32-unknown-unknown with atomics feature. The standalone userspace runtime currently hard-codes x86_64 syscall instructions. Introduce an arch module split:

    • arch/x86_64.rs (existing syscall.rs contents)
    • arch/wasm.rs (new — Atomics.wait via core::arch::wasm32::memory_atomic_wait32, exit via host import)

    Gate at the syscall boundary, not deeper; the ring client above it is arch-agnostic.

  4. Demos on wasm32-unknown-unknown. Same arch split applied via capos-rt. No per-demo changes expected if the split is clean.

The kernel does not build for wasm. Instead, a new crate capos-kernel-wasm/ (peer to kernel/) reuses capos-lib’s CapTable and capos-config’s ring structs and implements the dispatch loop against JS host imports for Worker management. It is, deliberately, not the same kernel binary. Trying to build kernel/ for wasm would pull in IDT/GDT/paging code that has no meaning in the browser.


Phased Plan

Phase A: Port the pure crates

  • Verify capos-lib, capos-config build clean on wasm32-unknown-unknown. CI job: cargo check --target wasm32-unknown-unknown -p capos-lib -p capos-config.
  • Add a host-side ring-tests-js harness that exercises the same invariants as tests/ring_loom.rs but with a real JS producer and a Rust/wasm consumer, both sharing a SharedArrayBuffer. Proves the volatile access helpers are portable before anything else depends on them.

Phase B: capos-rt arch split

  • Introduce capos-rt/src/arch/{x86_64,wasm}.rs behind a #[cfg(target_arch)].
  • Rewire syscall/ring/client to call through the arch module.
  • Add make capos-rt-wasm-check target. Existing make capos-rt-check stays for x86_64.

Phase C: Kernel Worker + init

  • capos-kernel-wasm/ with a Console capability that renders to xterm.js via postMessage back to the main thread.
  • Kernel Worker spawns init. Init prints “hello” through Console and exits.

Phase D: ProcessSpawner + Endpoint

  • ProcessSpawner served by the kernel Worker, granted to init.
  • Init parses its BootPackage and spawns the endpoint-roundtrip and ipc-server/ipc-client demos via ProcessSpawner.spawn. These stress capability transfer across Workers: does a cap handed from A to B via the ring land correctly in B’s ring, and does B’s subsequent invocation route back to the right holder?
  • This phase turns the port into a validation surface for the capability-transfer and badge-propagation invariants in docs/authority-accounting-transfer-design.md, and a second implementation of the Stage 6 spawn primitive.

Phase E: Integration with demos page

  • Hosted page at a project URL; xterm.js terminal; selector for which demo manifest to boot.
  • Serve .wasm artifacts as static assets.

Security Boundary Analysis

The browser port changes what is trusted and what is verified. Summary:

BoundaryNative (x86_64)Browser (WASM-Workers)
Process ↔ processPage tables + ringsWorker agents + SAB (structural)
Process ↔ kernelSyscall MSRs + SMEP/SMAPpostMessage + validated host imports
Code integrityW^X + NXWASM validator + immutable Module
Capability forgeryKernel-owned CapTableKernel-Worker-owned CapTable
Capability transferRing SQE validated in kernelRing SQE validated in kernel Worker — same code path

The capability-forgery story is the same in both: an unforgeable 64-bit CapId is assigned by the kernel and can only be resolved through the kernel’s CapTable. A process Worker cannot synthesize a valid CapId because it never sees the CapTable; it only sees SQEs it submits and CQEs it receives. This property is what makes the port worth doing — the capability model is preserved exactly.

What weakens: no SMAP/SMEP equivalent, but also no corresponding attack surface (the “kernel” Worker has no pointer into process memory; it can only copy bytes out of the shared ring). No DMA problem. No side-channel parity with docs/dma-isolation-design.md — Spectre/meltdown in the browser is the browser’s problem, mitigated by site isolation and COOP/COEP.

Required headers: Cross-Origin-Opener-Policy: same-origin and Cross-Origin-Embedder-Policy: require-corpSharedArrayBuffer is gated on these. A hosted demo page must set them.


What This Port Buys Us

  1. Shareable demos. A URL that boots capOS in ~1s, with no QEMU, no local install. Valuable for documentation and recruiting.
  2. A second substrate for the capability model. If the cap-transfer protocol has a bug, reproducing it under Workers (single-threaded, deterministic scheduling) is much easier than under SMP x86_64. A second implementation of the dispatch surface is a correctness asset.
  3. Forcing function for write syscall removal. The browser port cannot support the transitional write path without importing host I/O as a back door, which is exactly the ambient authority we want to avoid. Shipping a browser demo at all requires finishing the migration to the Console capability over the ring.
  4. Teaching surface. Workers give a much clearer visual of “one process, one memory, one cap table” than a bare-metal kernel ever will. The isolation story renders in the DevTools panel.

What It Does Not Buy Us

  1. Not a validation surface for the x86_64 kernel. Page tables, IDT, context switch, SMP — none of that runs. Bugs in those subsystems will not appear in the browser build.
  2. Not a performance story. WASM + Workers + SAB is slower than native QEMU-KVM for the parts it does overlap on, and does not exercise the hardware features capOS eventually cares about (IOMMU, NVMe, virtio-net).
  3. Not a path to “capOS on Cloudflare Workers” or similar. Cloudflare’s runtime is a single isolate per request, no SAB, no threads — a different environment that would need its own proposal.

Open Questions

  1. Do we ship one capos-kernel-wasm crate, or does the kernel Worker run plain JS that imports a thin capos-dispatch wasm? JS-hosted kernel is simpler (no second wasm toolchain for the kernel side) but duplicates cap-dispatch logic. Preferred: Rust/wasm kernel Worker reusing capos-lib — dispatch code stays single-sourced.
  2. How do we surface kernel panics in the browser? Native capOS halts the CPU; the browser equivalent is posting an error to the main thread and tearing down all Workers. Should match the panic = "abort" contract — no recovery attempted.
  3. Do we implement VirtualMemory as a no-op or as a real allocator? No-op is faster to ship; a real allocator over memory.grow exercises more of the capability surface. Lean toward real, gated behind a browser-shim flag so the demo doesn’t silently diverge from the native semantics.
  4. Manifest format: keep capnp, or add JSON for hand-authored demo configs? Keep capnp. The manifest is already the contract; adding a parallel format is exactly the drift the project has been careful to avoid.

Relationship to Other Proposals

  • Userspace Binaries — the wasm32 runtime story lives there eventually. This proposal is narrower: just enough runtime to boot the existing demo set in a browser. If the userspace proposal lands a richer runtime first, this one adopts it.
  • WASI Host Adapter — the WASI host adapter (capos-wasm) already exercises the inverse direction: hosting third-party wasm32-wasip1/wasm32-wasi modules inside a capOS userspace process whose Preview 1 imports are backed by typed capabilities (Console, Timer, EntropySource, bounded argv/env text grants). The browser port consumes that experience in three ways: it reuses the per-instance cap-grant pattern (no ambient host imports, every authority surfaced through the CapSet); it inherits the lesson that host-backed imports must refuse closed when the cap is not granted (W.4’s ERRNO_NOSYS = 52 refusal sentinel); and it specifically rejects pulling the kernel itself into a hosted wasm-runtime substrate — the browser kernel Worker is a Rust/wasm port of capos-lib’s CapTable and capos-config’s ring dispatch, not a wasmi-style interpreter over another guest. If a future browser-port phase wants to host third-party wasm modules inside a capOS-on-wasm userspace process, that work belongs to the WASI adapter direction, not here.
  • Browser Capability and Agent Web Sessions — the opposite direction: capOS exposing browsers as capability-scoped services (BrowserSession, BrowserProfile, BrowserContext) to users, shells, and agents. The two proposals share design principles (browser state is authority; the interface is the permission; agents receive tools, not admin ports) but do not overlap in implementation — one is a userspace browser service driven over CDP/WebDriver BiDi from a capOS host; this one is capOS rebuilt for wasm and run inside Workers with no browser engine of its own.
  • Remote Session UI Security — the trusted local web bridge that owns the TCP connection and upstream capOS session while browser JavaScript receives only DTOs. The browser port faces the same boundary inside one tab: the kernel Worker holds the CapTable and serves typed CQEs back to process Workers, and any UI surface on the main thread is untrusted glue, not a cap holder. The CSRF/CSP/cookie/cookie-isolation posture documented there is the reference the browser port adopts before serving any host-backed capability (Fetch, Clipboard, storage) to a process Worker; relaxing it for “just a demo” is exactly the ambient-authority drift the proposal warns against.
  • SMP — structurally irrelevant to the browser port (each Worker is its own agent). The browser port does inform SMP testing, because the cap-transfer protocol under Workers is a cleaner model of “messages cross agents asynchronously” than single-CPU preempted kernels.
  • Service Architecture — process spawn in the browser becomes Worker instantiation. The lifecycle primitives (supervise, restart, retarget) map naturally. Live upgrade (Live Upgrade) is even more natural under Workers than under in-kernel retargeting — swap the WebAssembly.Module behind a Worker while the ring stays live.
  • Security and Verification — the browser port adds a CI job (wasm builds + JS-side ring tests) but does not change the verification story for the native kernel.

Proposal: Browser Capability and Agent Web Sessions

How capOS should expose the web without turning a browser into an ambiently privileged desktop escape hatch.

This proposal is intentionally split into three tracks:

  • After GUI: a full visual browser for humans, with windows, input, rendering, profiles, downloads, extensions, and ordinary web compatibility.
  • Agent/shell usage: a standard BrowserSession capability that lets shells and AI agents navigate, inspect, screenshot, fill forms, download, and collect evidence through a brokered browser service before capOS has a native GUI browser.
  • Cap-native document engine: an intermediate path that runs JS, DOM/CSS, layout, and rendering over caller-provided document/resource data, with fetch, storage, permissions, clipboard, downloads, and host I/O wired to native capOS capabilities instead of a browser-owned ambient platform.

The existing Browser/WASM proposal runs capOS in a browser tab. This proposal is the inverse: capOS exposes browser capabilities to users, services, and agents.

Grounding research: Browser Engines, Document Engines, and Agent Browsers.

Problem

The web is both a user interface substrate and a huge authority boundary. A browser can read credentials, perform network requests, upload local files, download untrusted bytes, run JavaScript from hostile origins, track users through profiles, and expose debug protocols powerful enough to rewrite page state.

On a conventional OS that power is hidden behind process permissions, profile directories, and implicit user intent. capOS needs a browser model that fits the capability system:

  • Profiles and sessions are explicit authority.
  • Network routes, downloads, uploads, credentials, and automation are scoped.
  • Browser JavaScript does not get shell or storage authority by accident.
  • Agents can use the web as a tool without receiving raw CDP, filesystem, or network capabilities.

Non-Goals

  • Writing a new browser engine for the first capOS browser milestone.
  • Porting Chromium, WebKit, Gecko, Servo, or Ladybird before the GUI, userspace networking/storage, fonts, and driver-safety prerequisites exist.
  • Treating anti-detection, fingerprint evasion, scraping at scale, or bot bypass as a capOS product goal.
  • Exposing raw Chrome DevTools Protocol, WebDriver BiDi, or Playwright handles as ordinary user/session capabilities.
  • Letting browser-hosted JavaScript hold raw capOS shell, launch, file, or network capabilities.

Design Principles

  1. Browser state is authority. A profile’s cookies, local storage, permissions, saved credentials, cache, proxy route, and downloads are not implementation details. They are held through BrowserProfile and BrowserContext capabilities.

  2. The interface is the permission. A caller that can navigate does not automatically get DOM inspection, screenshot, input, download, upload, network interception, profile mutation, or automation-debug authority.

  3. Agents receive tools, not admin ports. CDP and WebDriver BiDi are backend protocols for the trusted browser service. The agent-facing ABI is a typed narrowed capability surface.

  4. Origins become visible policy inputs. Browser decisions should record origin, top-level site, profile, user session, persona, network route, and initiator. URL strings alone are not enough.

  5. Downloads and uploads cross explicit caps. A download returns a BrowserArtifact or writes through a granted DownloadSink. Uploading a file requires a granted read cap for that object and a per-action policy decision.

  6. Automation is auditable. Browser actions initiated by an agent are logged with the page/session, operation, typed arguments, permission mode, result, and artifacts captured for later review.

  7. Visual browsing waits for GUI. A human browser is a real app, not a terminal command. It should land only after compositor/input/font/storage and userspace networking foundations are credible.

  8. A browser can be headless before it is native. The early agent/shell-facing capability may be served by a host-side browser, a development-machine sidecar, a Linux companion process, or a remote browser service. The capOS ABI should not expose which backend serves it.

Track 1: Agent/Shell Browser Capability

This is the near-term conceptual track. It gives capOS agents and shells a standard web tool without waiting for a compositor or native browser port.

Conceptual interfaces:

interface BrowserBroker {
  createProfile @0 (request :BrowserProfileRequest) -> (profile :BrowserProfile);
  openContext @1 (profile :BrowserProfile, policy :BrowserContextPolicy)
      -> (context :BrowserContext);
}

interface BrowserContext {
  openSession @0 (persona :BrowserPersona) -> (session :BrowserSession);
  snapshot @1 () -> (profileSnapshot :BrowserProfileSnapshot);
  destroy @2 () -> ();
}

interface BrowserSession {
  close @0 () -> ();
}

interface BrowserNavigate {
  navigate @0 (url :Text, wait :NavigationWait) -> (result :NavigationResult);
}

interface BrowserReadPage {
  readPage @0 (budget :PageReadBudget) -> (snapshot :PageSnapshot);
}

interface BrowserScreenshot {
  screenshot @0 (options :ScreenshotOptions) -> (image :BrowserArtifact);
}

interface BrowserInput {
  input @0 (action :InputAction) -> (result :InputResult);
}

interface BrowserDownload {
  download @0 (selector :DownloadSelector, sink :DownloadSink)
      -> (artifact :BrowserArtifact);
}

The exact schema belongs in a later implementation slice. The important rule is that BrowserSession is only a lifetime handle for one browsing context. It does not imply navigation, inspection, screenshot, input, download, upload, network-observer, or debug authority. The broker mints only the operation facets allowed by the caller’s session policy, and the shell/agent runner advertises only tools backed by facets it actually holds.

CapabilityAuthority
BrowserBrokerMint profiles and contexts according to session policy.
BrowserProfileOwn persistent browser state and profile lifecycle.
BrowserContextOwn one isolated browsing context under a profile.
BrowserSessionHold and close one session lifetime; no operation authority by itself.
BrowserNavigateNavigate within one session.
BrowserReadPageInspect page state under output budgets.
BrowserScreenshotCapture screenshot artifacts under policy.
BrowserInputClick, type, select, upload only with explicit grants.
BrowserDownloadInitiate browser downloads into a granted sink.
DownloadSinkReceive bytes/artifacts from browser downloads.
BrowserNetworkObserverRead network metadata or bodies under redaction policy.
BrowserAdminBackend-only: raw CDP/BiDi, crash dumps, trace, profile mutation.

Agent Tool Shape

The native shell or agent runner advertises browser operations as ordinary tools:

  • browser.open(url)
  • browser.snapshot()
  • browser.screenshot()
  • browser.click(ref)
  • browser.type(ref, text)
  • browser.select(ref, value)
  • browser.download(ref)
  • browser.close()

The tool result is structured:

  • page title, URL, origin, load state
  • accessibility/DOM references under stable short IDs
  • visible text and form fields under a token/byte budget
  • screenshot artifact cap, when requested
  • network/download artifacts only when separately allowed

The model never receives the BrowserSession cap. It proposes tool calls; the runner executes them after policy and consent checks, then feeds bounded results back to the model. This matches Language Models and the Agent Runtime.

Backend Strategy

The first implementation should be a userspace service or host-side harness that owns a real browser and exposes the typed capOS surface:

  1. Browser service launches or attaches to Chromium/Firefox/WebKit through Playwright, WebDriver BiDi, or CDP.
  2. The service stores profile state in a host directory or capOS Store backend, but callers see only BrowserProfile caps.
  3. The service enforces per-session operation grants and output budgets before returning DOM text, screenshots, network metadata, or downloads.
  4. An MCP adapter can present the same tools to external agents, but MCP is an adapter, not the authority model.

This makes browser usage testable while capOS still lacks native GUI pieces. It also creates a practical compatibility path for agents that need the modern web during capOS development.

Track 1.5: Cap-Native Document Engine

The most capOS-shaped browser work may not be “port a full browser” first. There is a meaningful middle target: run the parts of the web stack that turn provided data into an interactive document – JavaScript, DOM, CSS, layout, rendering, and perhaps WebAssembly – while replacing browser-owned host APIs with capability-backed services.

In this model, the engine does not own raw networking, files, profile directories, clipboard, permissions, downloads, credentials, or extension installation. It receives a document/resource graph and a bundle of explicit host caps. Each document bundle also needs a broker- or ResourceLoader-minted web principal: an explicit origin, package origin, or opaque origin plus base URL policy used for relative URLs, storage partitioning, fetch checks, audit records, and user-facing permission prompts. Opaque origins are the default for caller-provided bundles; a real web or package origin requires authority or attestation from the loader that supplied the bytes. Web APIs become host bindings:

Web-facing operationcapOS-backed authority
fetch() / subresource loadHttpEndpoint, Fetch, or content-addressed ResourceLoader cap.
cookies / local storage / IndexedDBBrowserProfileStore or narrower origin-scoped KvStore cap.
file picker / uploaduser-approved FileRead or artifact cap.
downloadsDownloadSink / StoreWriter cap.
clipboardexplicit ClipboardRead / ClipboardWrite caps.
geolocation, camera, microphonefuture sensor/media caps, never implicit.
workers / timersscheduler and resource-budget caps.
WebAssembly importsexplicit host import caps, not ambient syscalls.

Document-engine Wasm hosting is the same shape as the WASI Host Adapter: a userspace process holds the wasm runtime and binds each import to an explicit capOS capability passed in through its bootstrap CapSet, rather than letting module code reach for ambient syscalls. Phase W.3/W.4 of that proposal already grants per-instance bounded text (argv, environment) and typed EntropySource-backed random_get through narrowed broker grants; the cap-native document engine should reuse the same bootstrap CapSet convention and per-instance grant shape when it eventually hosts JS/Wasm runtimes inside the browser stack so that fetch/storage/clipboard/random_get bindings stay authority-by-grant.

This track is useful for three reasons:

  1. It gives capOS a native HTML/CSS/JS application substrate without waiting for all of ordinary web browsing. Documentation, setup flows, dashboards, adventure/Paperclips UIs, and local admin apps could be rendered from trusted or packaged resources before arbitrary internet browsing is safe.
  2. It lets the project design web API host bindings around capabilities from the start. A later full browser can reuse the same profile, fetch, storage, permission, and artifact services instead of hiding them inside an engine.
  3. It is a smaller research target for engine embedding. Servo, Ladybird, and WebKit/WPE can be evaluated as document/rendering substrates, while SpiderMonkey, JavaScriptCore, Boa, or QuickJS can be evaluated as JS/Wasm runtime components or host-binding proof substrates without committing to an entire general-purpose browser port.

The accepted first shape should be conservative:

  • Load documents from a DocumentBundle or ResourceLoader cap, not from a URL bar.
  • Require every bundle principal to be minted or validated by the broker or ResourceLoader, and partition fetch, storage, cache, and audit state by profile/context/session plus that principal.
  • Disable arbitrary internet subresource fetch until a caller grants a narrowed Fetch/HttpEndpoint.
  • Produce a rendered surface or screenshot artifact plus a bounded accessibility/DOM snapshot.
  • Treat every Web API host binding as a separate facet and require explicit broker grants.
  • Avoid extension APIs, service workers, persistent background sync, notifications, WebRTC, and device APIs until their capOS authority model is clear.

The self-served remote-session web UI is an application-hosting instance of this middle track, not a general browser milestone. The UI bundle is an immutable boot-package resource served by a capOS service through scoped listener authority; browser JavaScript is still ordinary untrusted page code. The capOS service, not the page, holds the remote session CapSet and service proxies, then exposes browser-safe view models and user-event commands over same-origin HTTP routes. This keeps the first proof aligned with the browser capability rule that JavaScript never receives raw capOS caps, shell or spawn authority, endpoint owner handles, storage roots, or host identity hints. The Remote Session UI Security proposal owns the concrete web-security posture for that bridge – per-browser-session isolation, CSRF/CSP/cookie posture, transcript redaction, and the Tauri desktop wrapper’s reduced webview surface – and is the load-bearing precedent for how a cap-native document engine should treat its same-origin DTO channel: the Rust/backend authority boundary, not page JavaScript, holds upstream capOS handles.

This is still not a toy scripting widget. Running hostile JavaScript against a DOM/layout engine remains a large TCB, and rendering bugs can be security bugs. The point is to narrow the host-platform surface: provided data in, rendered surface/snapshot/artifacts out, and every side effect through typed caps.

Track 2: Visual Browser After GUI

A human-facing browser should be a normal capOS GUI application once these prerequisites exist:

  • compositor and input service
  • font discovery/rasterization
  • userspace networking and TLS
  • Store/Namespace-backed profile persistence
  • download/upload mediation
  • shared-memory graphics buffers or GPU session caps
  • process crash/restart handling
  • brokered user-session profile policy

Candidate engine paths:

Engine pathRolecapOS assessment
Chromium Ozone / CEFMaximum compatibility and automation ecosystemBest external/backend choice; native port is very large.
WPE WebKitEmbedded visual browser candidatePlausible post-GUI engine because WPE is designed for embedded backends.
Gecko / GeckoViewBrowser diversity and principal-model precedentGood external backend; GeckoView itself is Android-specific.
ServoRust/modular research-aligned engineTrack closely; not first broad-compatibility choice.
Ladybird / LibWebIndependent-engine precedentTrack for architecture; not a near-term dependency.

The visual browser should reuse the agent/shell profile/session model instead of inventing a second profile stack. A GUI tab is a BrowserSession with a visual BrowserView surface attached. Closing the window should not silently destroy profile state unless the profile cap is ephemeral.

Donut Browser Ideas To Adapt

Donut Browser is useful because it treats browser profiles as first-class, scriptable objects and exposes local REST/MCP automation. capOS should adapt the capability-shaped parts:

  • Unlimited local profiles map to broker-minted BrowserProfile caps.
  • Profile groups map to policy bundles and user-session grants.
  • Per-profile cookies/storage/extensions map to Store-backed state owned by the profile cap.
  • Per-profile proxy/VPN selection maps to explicit network-route caps.
  • Local REST/MCP maps to a typed capOS service plus optional external adapter.
  • Persistent automation sessions map to BrowserContext lifetimes and snapshots.
  • Default-browser link routing maps to a broker decision: which profile/context should open a URL for this user/session?

capOS should not adopt Donut’s anti-detect promise. If capOS supports persona controls such as viewport, locale, timezone, user agent, geolocation, WebRTC policy, or fingerprint reduction, those controls should be explicit BrowserPersona policy with audit and user-facing disclosure.

Security Boundary

Browser work adds these trust boundaries:

  • Web content to browser engine. Untrusted JavaScript, media, fonts, and documents hit a large engine TCB. Native browser work should keep renderer, network, image decode, and profile services separated where the backend permits it.
  • Browser engine to capOS. The engine must not receive broad shell caps. Its only capOS authorities should be its granted network route, profile store, artifact sink, and visual/input surfaces.
  • Agent to browser service. The agent sees tool descriptors and bounded snapshots, not backend debug ports.
  • Browser downloads to storage. Downloaded bytes are untrusted artifacts until a user or policy process imports them into a namespace.
  • Browser uploads to web origin. Upload requires explicit file/artifact authority and must record the destination origin.
  • Profile to profile. Cookies, storage, cache, extension state, and persona policy must not bleed across profiles unless a broker grants an explicit clone/import/export operation.

Raw CDP or BiDi access is BrowserAdmin authority. It should be held only by the browser service supervisor and developer harnesses, not by ordinary shell sessions.

Phased Plan

Phase A: Host-Backed Agent Browser

  • Add a host-side or userspace browser service proof that exposes a narrowed BrowserSession over an existing browser backend.
  • Use fake-model or scripted-agent QEMU/host proof first: navigate to a local page, read a bounded snapshot, click/type, capture a screenshot artifact, and close the session.
  • Record audit output for each action and show that the caller never receives raw CDP/BiDi.

Phase B: Standard Shell Tool

  • Add native shell and agent-runner integration so browser.open, browser.snapshot, and browser.screenshot are standard tools when the broker grants a browser bundle.
  • Add MCP adapter support for external agents using the same typed operation set.
  • Add download/upload gates once Store/Namespace and artifact caps exist.

Phase C: Cap-Native Document Engine Proof

  • Add a restricted DocumentBundle proof that renders packaged HTML/CSS/JS to a screenshot or simple surface and emits a bounded accessibility/DOM snapshot.
  • Wire at least one host API, such as fetch from a preloaded resource bundle or a profile-scoped key/value store, through a typed capability.
  • Prove that absent caps fail closed: no network, no profile storage, no clipboard, and no downloads by default.

Phase D: In-capOS Headless Browser Backend

  • Port or package a browser backend process once userspace networking, storage, fonts, and threads are mature enough.
  • Prefer a backend that can run without a full visible GUI surface but still supports screenshots and accessibility/DOM snapshots.
  • Preserve the same BrowserSession ABI so agents do not notice the backend change.

Phase E: Visual Browser

  • Add BrowserView/window integration after compositor/input support exists.
  • Reuse BrowserProfile and BrowserSession for tabs/windows.
  • Add user-facing profile picker, permissions UI, downloads UI, and audit view.

Relationship To Existing Proposals

  • Browser/WASM is about capOS as a browser-hosted runtime. This proposal is about capOS exposing browser capability services.
  • Language Models and the Agent Runtime owns the model/tool-call loop. Browser sessions are one tool family.
  • Shell and Interactive Command Surfaces own command exposure. Browser operations should appear there as typed tools, not string commands tunneled to an automation port.
  • Networking, Storage and Naming, and GPU Capability provide prerequisites for a native visual browser. The networking proposal owns the userspace TCP/IP and TLS authority the broker eventually narrows into Fetch, HttpEndpoint, and per-profile proxy/route caps; a browser engine never sees raw socket authority.
  • Remote Session UI Security defines the web-security posture for the trusted local remote-session-ui bridge and its Tauri desktop wrapper. It is the concrete precedent for the cap-native document engine’s “Rust/backend authority boundary, not page JavaScript, holds capOS handles” rule.
  • WASI Host Adapter ships the typed capability boundary for sandboxed WebAssembly imports. The cap-native document engine’s Wasm bindings should reuse the same bootstrap CapSet convention and per-instance grant shape (argv, env, entropy, and – once their authority surfaces exist – filesystem and sockets) rather than inventing a parallel browser-only Wasm host.

Open Questions

  • Should the first implementation wrap Playwright for breadth, raw CDP for smaller dependencies, or WebDriver BiDi for standards alignment?
  • What is the minimal page snapshot that remains useful to an LLM while limiting token use and accidental data disclosure?
  • Should BrowserPersona support fingerprint reduction only, or also compatibility personas for testing?
  • How should extensions be represented: profile-owned package state, separately granted extension caps, or both?
  • How should a visual browser present capOS capability prompts without training users to approve every web-origin request blindly?

Proposal: Language Models and the Agent Runtime

How capOS runs language models — including a built-in on-ISO local model — as ordinary capability-served processes, and how the interactive agent is structured around an interactive tool-use loop instead of a plan-approve-execute pipeline.

Why This Proposal Exists

Two problems converge:

  1. An earlier draft of the shell proposal sketched an “agent shell” that was itself a natural-language planner embedded in the shell process. That collapses three distinct concerns (user interaction, capability holding, model inference) into one, and it also got the shape of the interaction wrong: a one-shot “model emits a plan, user approves, dispatcher executes” pipeline is strictly weaker than how real agent systems work. In practice the model runs in a tool-use loop: it emits tool calls, the runtime executes them, results feed back into the conversation, the model decides what to do next, and the user stays in the loop through per-tool permission gates and interrupts. That interactive loop is what makes an agent useful; a static plan is a degenerate case of it. The shell proposal now defers to this document for the agent loop and only describes the native shell’s “agent mode” surface; see Shell for the matching shell-side framing.

  2. capOS has no story for where model weights live, who holds them, what accelerator they run on, or how external model providers (remote HTTP, local Ollama, a future NPU) plug into the same interface. Every serious workload — interactive agent, chat NPCs in the adventure demo, summarisation of audit logs, semantic search over LogReader, embedding-based retrieval from a Directory — wants a language or embedding model. Without a shared capability surface, each consumer reinvents the wiring and smuggles different amounts of authority into the model process.

This proposal defines both halves: the model-as-capability architecture, and the agent-runner that drives the interactive tool-use loop on top of it.

Long-lived OpenClaw-like hosted agents, multi-agent swarms, workspace/memory control planes, MCP/A2A-style interoperability, and agent-maintained wiki substrates are split into capOS-Hosted Agent Swarms. This document keeps the base model and single-runner loop narrow.

Scope

  • Language models (chat / completion / tool use / structured output).
  • Text embedders (vector encoders for retrieval).
  • Tokenisers and small auxiliary models (classifier, reranker, guardrail).
  • A built-in local model shipped on the ISO for first-boot and offline use.
  • Pluggable external backends (remote HTTP providers, future GPU-accelerated local inference, future NPU).
  • The interactive agent runner that exposes session capabilities to the model as tools, executes tool calls, streams results back, and keeps the user in the loop.
  • A web-shell execution model where the browser agent is the UI and may orchestrate the LLM/tool-call loop, while WebShellGateway keeps capOS capabilities server-side and enforces every tool invocation.

Out of scope here (deferred):

  • Training, fine-tuning, RLHF pipelines. capOS is an inference host, not a trainer.
  • Native realtime multimodal voice sessions. The same authority split applies, but realtime audio, barge-in, transcripts, and provider tool-call events need a separate session interface; see Realtime Voice Agent Shell.
  • Long-lived hosted-agent swarms, external channel-triggered background work, durable task queues, agent-maintained wikis, MCP/A2A bridges, and OpenClaw-like harness control planes; see capOS-Hosted Agent Swarms.
  • Federated / multi-party inference. Treated as a later network topology.

Design Principles

  1. Models are services, not shells. A model runs in a dedicated process with its own CapSet. It has no session cap, no TerminalSession, no Launcher, no ProcessSpawner, no ApprovalClient, no user secrets, and no inbound network authority. Its only job is to turn inputs into outputs through typed methods.

  2. Prompts and outputs are data. Nothing the model reads or writes is authority by itself. The model cannot “say” a capability into existence. Free-form text it emits is never parsed as a command. Tool calls are a separate structured output channel — typed arguments, not shell lines.

  3. Tool calls are proposals, not invocations. The model does not hold tool caps and does not perform the call. It emits a ToolCall value naming an advertised tool, with typed arguments conforming to the tool’s schema. A trusted capOS-side runner or WebShellGateway tool proxy decides whether to execute, prompt the user, or refuse.

  4. Per-tool permission, not per-plan approval. Each tool carries a permission mode: auto (read-only, auto-execute), consent (ask the user quickly before running, similar to a per-action “Allow” prompt), stepUp (re-auth required), or forbidden (advertised for explanation only, never runnable). Permission lives on the tool descriptor, not on a post-hoc review of a generated plan. This matches how real agent systems behave and avoids the impossible review problem of a twenty-step plan.

  5. The interface is the permission. A caller holding LanguageModel can request completions. A caller holding TextEmbedder can request vectors. Neither exposes weights, tokeniser internals, raw accelerator memory, or administration of the model service. Those stay behind separate ModelAdmin, ModelCatalog, and ModelRuntime caps held only by the service’s supervisor.

  6. Backends are substitutable behind the same interface. LanguageModel does not imply on-host inference. A LanguageModel handle may be served by the built-in local model, an in-tree Rust inference engine, a GPU-accelerated local backend, or a wrapper over a remote provider. The caller cannot tell from the capability alone — and should not need to.

  7. Weights are read-only file-backed memory. Weights live as files in the ISO (for the built-in model) or a storage volume (for installed models), and are mapped into the model process through a read-only file-backed MemoryObject. A shared page cache lets multiple model worker processes and multiple sessions share the same physical frames. Weights are never copied into process-private memory.

  8. Policy lives in the broker, not in the model. AuthorityBroker decides which sessions get a LanguageModel cap, which backend the cap resolves to, which tools are advertised with which permission modes, rate and quota limits, and whether outbound network providers are allowed. The model enforces none of this; it cannot, because it does not see the session.

  9. User interrupts beat model momentum. The user can break the loop at any time — Ctrl-C, a UI cancel, a terminal close. An in-flight tool call is either aborted or allowed to complete without its result going back to the model. The runner never waits for the model to “decide to stop”.

  10. Browser agents are UI, not authority. In a web shell the agent may live in browser JavaScript and may call a provider API directly with an ephemeral token. That does not make it a capOS authority holder. Browser code can propose structured tool requests to WebShellGateway; the gateway and broker validate, authorize, execute, revoke, and audit.

  11. Audit every tool call that touched authority. Each executed tool call is logged with model identity, model version, turn index, the advertised tool descriptors, the exact typed arguments, permission decision, user consent (if any), and the tool’s outcome. The model service does not write audit records; the runner does, because only the runner or gateway tool proxy sees both the call and the execution.

Architecture Overview

There are two accepted execution models. The native/capOS-side model keeps the whole agent loop inside a capOS process. The web-shell model lets the browser agent be the user interface and turn orchestrator, but not the holder of raw capOS capabilities.

CapOS-Side Runner

flowchart LR
    User[User / terminal] --> Runner[Agent Runner<br/>holds session caps]

    Runner -->|LanguageModel.complete| ModelSvc[language-model service process]
    ModelSvc --> Weights[(Read-only MemoryObject<br/>weights file)]
    ModelSvc --> Backend{Backend}
    Backend -->|cpu| CpuEngine[In-process inference engine]
    Backend -->|gpu| GpuSession[GpuSession cap]
    Backend -->|remote| Http[HttpEndpoint to provider]

    ModelSvc -. "text + tool calls" .-> Runner
    Runner -->|per-tool policy| Gate{Permission?}
    Gate -->|auto| Invoke[Invoke typed cap]
    Gate -->|consent| Prompt[Prompt user y/n]
    Prompt --> Invoke
    Gate -->|stepUp| Broker[AuthorityBroker step-up]
    Broker --> Invoke
    Gate -->|forbidden| Refuse[Refuse, feed error back]

    Invoke --> Services[Session caps: files, net, spawn, status...]
    Invoke --> Audit[AuditLog]
    Services -. "result" .-> Runner
    Runner -->|role:tool result| ModelSvc

Two principals matter in the capOS-side runner model:

  • Agent runner. Holds the session cap bundle (terminal, home, logs, launcher, approval, model client, etc.). Runs the user-facing loop, talks to the model, applies per-tool permission policy, executes tool calls against its held caps, streams results back to the model, and writes audit. This is the natural daily driver — either the native shell in “agent mode” or a sibling process launched from the shell.
  • Model service. Holds weights, an optional accelerator session, and an optional narrow outbound HttpEndpoint for remote backends. Sees conversation messages; emits text and tool calls. Has no session, no tools, no spawn authority.

The kernel does not need a “model” or “agent” concept. Everything here is ordinary capabilities, processes, and ring traffic.

Browser Agent UI

In a web shell, the agent itself may be the UI. Browser JavaScript may render the conversation, call a provider LLM API directly, receive structured tool calls, and feed tool results back into the model. That mode exists for latency, provider-native browser SDKs, and richer UI composition.

It still does not give browser JavaScript raw capOS capabilities:

flowchart LR
    User[User] --> BrowserAgent[Browser Agent UI<br/>LLM loop]
    BrowserAgent -->|ephemeral provider token| Provider[LLM Provider API]
    Provider -. "text + tool calls" .-> BrowserAgent

    BrowserAgent -->|ToolRequest| Gateway[WebShellGateway<br/>ToolProxy]
    Gateway --> Broker[AuthorityBroker]
    Gateway --> Audit[AuditLog]
    Gateway --> Services[Session caps: files, net, spawn, status...]
    Services -. "typed result" .-> Gateway
    Gateway -. "ToolResult" .-> BrowserAgent
    BrowserAgent -. "tool result" .-> Provider

Authority split:

  • Browser agent UI. Owns presentation, local conversation state, user gestures, optional browser media APIs, and direct provider session state. It holds no capOS caps, no session caps, no tool caps, and no provider long-lived credentials.
  • WebShellGateway tool proxy. Owns the authenticated web transport and the server-side reference to the session bundle. It exposes the current tool descriptor snapshot to the browser, accepts structured ToolRequest values, validates them against the session, enforces broker policy and consent/step-up, invokes the real capOS capabilities, and writes audit.
  • Provider. Sees prompts and tool results only when broker policy allows direct browser provider use for the session’s confidentiality profile.

The browser-agent model is therefore browser-orchestrated but gateway-enforced. It is not a bearer-capability model and not a shortcut around the broker.

The Tool-Use Loop

One capOS-side agent turn:

1. User types a message (or kicks off the first turn from a CLI arg).
2. Runner assembles: system prompt + prior messages + user message +
   the set of ToolDescriptor values the session currently advertises.
3. Runner calls LanguageModel.stream(req). Token stream is rendered to
   the terminal as it arrives.
4. Model response finishes. It contains text (shown) plus zero or more
   ToolCall records (not shown as text; shown as typed tool-call UI).
5. For each ToolCall:
     a. Look up the tool by name. If not in the advertised set, reject
        with a typed error fed back as a role: tool result.
     b. Validate arguments against the tool's paramSchema. Reject if
        malformed; feed the validation error back.
     c. Check the tool's permission mode:
          - auto:      proceed.
          - consent:   render the call + arguments + permission UI;
                       wait for user y/n. Deny feeds a refusal back.
          - stepUp:    request a leased narrow cap from the broker,
                       possibly driving WebAuthn/OIDC step-up. On
                       success, proceed; on denial, feed back.
          - forbidden: reject; feed typed "not permitted in this
                       session" error back.
     d. Invoke the underlying typed capability. Time-box the call.
     e. Truncate/redact the result per tool policy, serialize as a
        role: tool message keyed to the ToolCall id.
6. If the model emitted tool calls, loop back to step 3 with the
   results appended. If it emitted none (or the user interrupted),
   this turn ends.
7. Every executed call produces an audit record.

One browser-agent UI turn:

1. User interacts with the browser agent UI.
2. Browser agent assembles the prompt, prior messages, and the current
   ToolDescriptor snapshot fetched from WebShellGateway.
3. Browser agent calls the provider directly using a broker-minted,
   short-lived, provider-scoped token.
4. Provider response streams into the browser. If it emits ToolCall records,
   the browser wraps each as a ToolRequest to WebShellGateway.
5. WebShellGateway validates the call against the advertised descriptor,
   current session state, nonce/turn binding, quotas, and broker policy.
6. WebShellGateway obtains any required consent or step-up proof, invokes the
   underlying capOS capability server-side, writes audit, and returns a
   ToolResult.
7. Browser agent feeds the ToolResult back to the provider and continues the
   loop until no tool calls remain or the user/gateway cancels.

Browser-originated tool requests are untrusted input even when the agent is the intended UI. The gateway must reject stale descriptors, unknown tools, argument/schema mismatches, replayed turn ids, requests outside the current session profile, and any operation whose consent or step-up proof is missing.

Interactive-agent niceties that fall out of this structure:

  • Streaming. Tokens render live. Tool calls appear as structured widgets, not as text the user has to parse.
  • Interruption. Ctrl-C at any point cancels the in-flight inference (TokenStream.cancel) or the in-flight tool call. The runner decides whether to feed a cancellation message back to the model or end the turn.
  • Auto vs. consent. Reading files, listing directories, querying SystemStatus, reading logs — auto. Writing files, spawning processes, changing service state, sending network requests — consent. Destroying data, running a recovery operation, widening the session’s own caps — stepUp or forbidden.
  • Context management. When the transcript approaches ModelInfo.contextTokens, the runner can summarise older turns (via a second LanguageModel.complete call) and replace them with a compact summary message. This is a runner decision, not a kernel or model feature.
  • Conversation persistence. A conversation is a list of messages plus a reference to the runner’s session; it can be written to a home-scoped file, resumed later, forked, or compared. Persistence is an ordinary capability concern, handled by the runner through whatever Directory/File cap it holds.

Agent Mode is a Mode of the Native Shell

The native shell from Shell is the agent runner. It already holds the session bundle that Boot to Shell mints at login and that Service Architecture hands it as exact-grant spawn input; adding a LanguageModel client cap plus a per-tool permission table gives it “agent mode”. In that mode:

  • Plain user input becomes a chat turn.
  • /cap, /inspect, /exit, and the other existing direct commands stay as direct typed invocations that bypass the model.
  • Tool descriptors are generated by the same schema reflection the shell already needs for its capability REPL.

The per-tool permission modes (auto / consent / stepUp / forbidden) and the runner-side enforcement boundary are the same set the shell proposal cites in Shell. The two documents are intentionally consistent: the shell proposal owns the human-shell surface and direct commands, this proposal owns the model service contract and the tool-use loop. Neither proposal owns both halves.

A separate capos-agent binary is possible for deployments where agent mode is the default (think “bare capOS image with no traditional shell”). It launches from the same login path described in Boot to Shell and under the same supervision rules as any other application service (see Service Architecture), with the same session bundle, and differs only in the surface presented to the user.

Web Agent Mode is a Mode of WebShellGateway

For browser-hosted sessions, WebShellGateway exposes an agent UI protocol instead of making the browser a capOS process. The protocol can be JSON over WebSocket or another web-native framing, but its values mirror ToolDescriptor, ToolCall, and ToolResult from this proposal.

Gateway responsibilities:

  • issue short-lived provider credentials only when policy allows direct browser LLM access;
  • bind the tool descriptor snapshot to a session id, conversation id, turn id, expiration, and browser connection;
  • execute tools only through server-side session caps;
  • enforce low-risk consent, mutating consent, and destructive stepUp server-side;
  • return redacted/truncated tool results according to tool policy;
  • revoke or expire provider tokens where the provider supports it, reject new tool requests on logout, timeout, tab close, session downgrade, or policy change, and record any browser-held provider session that can only be terminated best-effort.

Browser responsibilities:

  • render the agent UI and any consent prompts supplied by the gateway;
  • preserve provider session state only as long as the gateway session is live;
  • submit structured tool requests, never raw capability invocations;
  • treat gateway denials, cancellation, and revocation as authoritative.

For mutating or destructive tools, a browser click is not enough by itself. The gateway needs a fresh server-side consent challenge or a broker-issued step-up lease tied to the exact tool name, arguments, conversation, turn, and expiration. Low-risk read-only tools may use auto execution when broker policy allows.

Prompt Injection as a First-Class Concern

Model inputs include untrusted data: file contents, log lines, web pages fetched via a tool call, Aurelian Frontier NPC dialogue, output from previously executed tool calls. Every such input is wrapped in a role: user or role: tool message with explicit provenance, never concatenated into a system prompt. The runner never parses assistant free text as a command, and the gateway never treats browser-submitted free text as a capability request. Only structured toolCalls / ToolRequest values can reach the tool execution path.

A user can paste rm -rf / at the model; the model can repeat it back; nothing happens, because there is no code path that interprets text as a command. A web page can instruct the model to exfiltrate secrets; the model cannot use capOS resources except through the advertised tool set, and sensitive tools are gated by consent/stepUp. If the browser agent has ordinary web-network reachability, broker policy must treat prompts and tool results as exposed to that browser/provider boundary and deny direct browser mode for sessions where that is unacceptable.

Capability Contract

Additions to schema/capos.capnp (exact method IDs and argument packing belong to the implementation PR; the shapes below are the contract):

# Conversation inputs/outputs are plain data. They carry no authority.

struct ChatMessage {
  role @0 :Role;
  content @1 :Text;
  # For role:tool, the id of the ToolCall this message answers.
  toolCallId @2 :Text;
  # For role:assistant messages that included tool calls, the list
  # of calls the model proposed.
  toolCalls @3 :List(ToolCall);
}

enum Role {
  system @0;
  user @1;
  assistant @2;
  tool @3;
}

struct ToolDescriptor {
  name @0 :Text;
  description @1 :Text;
  # Capability and method the runner will invoke if this tool fires.
  interfaceId @2 :UInt64;
  methodName @3 :Text;
  # JSON-Schema or equivalent describing the argument object.
  paramSchema @4 :Text;
  # Permission mode. Enforced by the runner, surfaced to the model as
  # hint metadata so the model can explain or avoid risky calls.
  permission @5 :PermissionMode;
  # Tool category for audit and policy filters.
  category @6 :Text;
}

enum PermissionMode {
  auto @0;        # Runner executes without user prompt.
  consent @1;     # Runner prompts user before execution.
  stepUp @2;      # Runner requests broker step-up before execution.
  forbidden @3;   # Advertised for explanation only; never executed.
}

struct ToolCall {
  id @0 :Text;          # Unique within the conversation.
  name @1 :Text;
  # Arguments serialised as JSON (or capnp AnyPointer in a later
  # revision). The runner validates against paramSchema.
  arguments @2 :Text;
}

struct ToolResult {
  callId @0 :Text;
  outcome @1 :Outcome;
  content @2 :Text;     # Possibly truncated / redacted by the runner.
  error @3 :Text;       # Set when outcome != ok.
}

enum Outcome {
  ok @0;
  refusedByPolicy @1;
  deniedByUser @2;
  stepUpFailed @3;
  executionError @4;
  timedOut @5;
  cancelled @6;
  invalidArguments @7;
  unknownTool @8;
}

struct InferenceRequest {
  messages @0 :List(ChatMessage);
  tools @1 :List(ToolDescriptor);
  maxTokens @2 :UInt32;
  temperature @3 :Float32;
  stopSequences @4 :List(Text);
  # Optional JSON-Schema for final-assistant structured output.
  responseSchema @5 :Text;
  # Stable correlation id for audit.
  nonce @6 :Data;
}

struct InferenceResponse {
  message @0 :ChatMessage;   # role:assistant, may include toolCalls.
  usage @1 :TokenUsage;
  finishReason @2 :FinishReason;
}

interface LanguageModel {
  info @0 () -> (info :ModelInfo);
  complete @1 (req :InferenceRequest) -> (resp :InferenceResponse);
  # Streaming variant emits token chunks and tool-call deltas as they
  # are decoded. Cancellation aborts decoding.
  stream @2 (req :InferenceRequest) -> (stream :TokenStream);
}

interface TokenStream {
  next @0 () -> (chunk :StreamChunk, done :Bool);
  cancel @1 () -> ();
}

struct StreamChunk {
  textDelta @0 :Text;
  toolCallDelta @1 :ToolCallDelta;   # partial structured tool call
}

interface TextEmbedder {
  info @0 () -> (info :ModelInfo);
  embed @1 (texts :List(Text)) -> (vectors :List(Vector));
}

struct ModelInfo {
  id @0 :Text;           # Content-addressed weight digest + arch tag.
  displayName @1 :Text;
  arch @2 :Text;         # "llama", "qwen", "phi", etc.
  contextTokens @3 :UInt32;
  outputTokens @4 :UInt32;
  backend @5 :Text;      # "local-cpu", "local-gpu", "remote-openai", ...
  quantisation @6 :Text; # "fp16", "q4_k_m", ...
  supportsTools @7 :Bool;
}

# Administrative surface. Not granted to normal sessions.
interface ModelCatalog {
  list @0 () -> (models :List(ModelInfo));
  openLanguageModel @1 (id :Text) -> (model :LanguageModel);
  openEmbedder @2 (id :Text) -> (embedder :TextEmbedder);
}

interface ModelAdmin {
  loadWeights @0 (source :ReadOnlyFile, info :ModelInfo) -> (id :Text);
  unload @1 (id :Text) -> ();
  setBackendPolicy @2 (policy :BackendPolicy) -> ();
}

The web-shell protocol should expose a non-capability tool proxy with the same data shapes. Exact framing belongs to the WebShellGateway milestone:

describeTools(session, conversation) -> List(ToolDescriptor)
invokeTool(session, conversation, turn, descriptorSnapshot, ToolCall)
    -> ToolResult
cancelTurn(session, conversation, turn) -> ()

This is intentionally not a LanguageModel method and not a capOS capability handle passed to the browser. It is an authenticated web transport endpoint whose implementation invokes real session caps only after gateway/broker checks pass.

What is deliberately absent

  • No method on LanguageModel accepts a capability argument. The model never holds a live cap to a user resource.
  • No method returns a capability that could be invoked outside the model service (TokenStream is the one exception and is scoped to the current response).
  • No “run this tool for me” method on LanguageModel or any model service. Tool execution is the runner’s or gateway tool proxy’s job. The model only names tools.
  • No PlannerAgent / ActionPlan / dispatcher interface. Planning, if it happens, is something a model does inside one of its responses; it is not a separate typed product.
  • No “agent shell interface” served by the model. In the capOS-side model, the shell is the runner and capability holder; in the browser-agent model, WebShellGateway is the capability holder.

The Agent Runner

This section describes the capOS-side runner. Browser-hosted sessions use the WebShellGateway tool proxy described above instead of placing the runner and session caps in browser JavaScript.

The runner is an ordinary userspace process (native shell in agent mode, or capos-agent) that holds:

  • The session cap bundle, unchanged from the shell proposal.
  • A LanguageModel client cap issued by the broker.
  • A ModelInfo read-only view for rendering model identity.
  • A ConversationStore cap (when one exists) for persistence.

It does not hold ModelCatalog or ModelAdmin — those are administrative. If a session wants to switch models mid-run, the broker issues a new LanguageModel cap.

Building the Tool Table

On startup (and after any cap-set change), the runner walks its own session bundle and produces ToolDescriptor values through schema reflection over the advertised capabilities’ interfaces. It applies the broker-supplied per-tool permission map keyed by (category, methodName):

read-only       -> auto
mutating local  -> consent
destructive     -> stepUp
outbound net    -> consent (unless profile allows auto)
admin-class     -> forbidden (for non-operator sessions)

The runner is free to suppress tools entirely for a given conversation (for example, never advertise ServiceSupervisor.restart for a guest session, even though the descriptor set could carry a forbidden entry). Suppression is sometimes clearer than presenting an unusable tool to the model.

The Loop State Machine

Idle
  │  user turn arrives
  ▼
AssemblingRequest ── tool-descriptor snapshot ─► Inferring (LanguageModel.stream)
  ▲                                                     │
  │ tool result appended                                │ model finishes
  │                                                     ▼
ExecutingCalls ◄─── one call at a time ───────── HasToolCalls?
     │ per call: gate → execute → audit               │ no
     └──────────────────────────────────────────┐     ▼
                                                │    Idle
                                                ▼
                                    (any denial / cancel is
                                     an outcome fed back)

Timeouts are enforced at three levels: per-tool (so a slow capability does not block the loop forever), per-turn (bounded number of iterations to prevent runaway), and per-session (token and wall-clock budgets from the broker).

Conversation State

A conversation is List(ChatMessage) plus a ModelInfo.id, the effective ToolDescriptor table at each turn, and the audit trail. The runner keeps it in its own process memory during a session and may persist it through a ConversationStore cap (when that exists; see open questions). No conversation state lives in the model service; the service is stateless across requests.

The Built-in Local Model

capOS ships with a small local language model so that:

  • First boot has a working agent without remote network.
  • The adventure and chat demos can have a real local NPC brain rather than hard-coded strings.
  • Offline and air-gapped deployments remain viable.
  • The capability surface has a real local implementation to validate against before remote backends are wired up.

Constraints

  • Size budget. A 1–3 B parameter quantised model (q4_k_m-class) fits in 0.7–2.0 GiB. That is too large for manifest.bin embedding (2.75 MiB cap) and forces the ISO filesystem path — see the Boot Binary ISO Layout item in docs/backlog/hardware-boot-storage.md. Weights are the first non-binary consumer of the ISO file path.
  • Tool calling. The model must be a tool-use-capable instruction tune (a chat-tuned model without reliable tool-call formatting cannot drive the loop). ModelInfo.supportsTools flags this.
  • Backend. First implementation is CPU-only, portable Rust inference. Candidates include candle (needs no_std survey), a minimal hand-rolled GGUF loader + matmul kernel, or a vendored subset of a permissively licensed engine. Final choice is an implementation decision, not a proposal decision; the capability surface is implementation-agnostic.
  • Precision. q4_k_m or q5_k_m quantised GGUF. fp16 is a later optimisation gated on either SIMD-friendly CPU support or GPU acceleration.
  • Context window. 4 K–8 K tokens at first. Enough for short agent sessions; long-document summarisation is a later workload that may require a different model or aggressive runner-side compaction.
  • Attestation. Weights are signed (see Cryptography and Key Management) and the signature is verified at load. The content-addressed digest becomes the ModelInfo.id.

Boot Flow

  1. ISO driver (pending the Boot Binary ISO Layout item in docs/backlog/hardware-boot-storage.md) exposes /boot/models/<name>.gguf as an ordinary file.
  2. Kernel or a privileged loader service constructs a read-only file-backed MemoryObject over the weights file. Read-only shared frames let multiple model worker processes map the same weights without copies.
  3. model-loader service (started from the manifest) verifies the signature, registers the model in ModelCatalog, and keeps a retained handle to the weights MemoryObject.
  4. On demand, ModelCatalog.openLanguageModel(id) spawns (or returns a handle to) a worker process holding the weights, an inference kernel, and — if policy allows — a GpuSession or a remote HttpEndpoint.

Weights never live in the manifest blob. The ISO layout work is the prerequisite, and this proposal is its first forcing use case larger than a few megabytes.

Page Cache Coupling

Multiple sessions sharing one model benefit from a page cache over the weights file: the first access faults in, subsequent accesses hit cache, and the pages are shared read-only across all worker processes. This is the same primitive that makes ELF text-segment sharing useful, and it should be implemented once in the ISO/file-backed-memory path rather than specialised per consumer.

CapOS-Side Backends

CapOS-side backends sit behind LanguageModel / TextEmbedder. The worker process loads exactly one backend per instance. Browser direct-provider mode is a separate web transport mode described below; it is not a LanguageModel worker backend.

Local CPU

  • File-backed read-only weights mapped from ISO or storage.
  • No accelerator caps. No network caps.
  • Bounded per-call token budget enforced by the worker; broker sets per-cap quotas.

Local GPU

  • Holds a GpuSession from the GPU capability proposal.
  • Holds a read-only MemoryObject for the weights; uploads to GPU memory at load time through GpuBuffer.
  • Still no network. Still no session cap.

Remote Provider

  • Holds one narrow HttpEndpoint scoped to a single provider origin (for example an Ollama instance on the local network, or an external API gateway). The endpoint is issued by the broker; the model worker cannot widen it.
  • Holds provider credentials only as token-typed capabilities (OAuth AccessToken wrapped as a cap, never exposed as a bearer string — see OIDC and OAuth2 proposal).
  • The model worker process is still the principal that talks to the remote; the runner never sees provider credentials.
  • Treated as untrusted: outbound request/response logging is mandatory when operator policy requires audit of off-device inference.

NPU / Future Accelerators

Same shape. Add a scoped NpuSession cap analogous to GpuSession when the hardware abstraction for it exists.

Browser Direct-Provider Mode

  • Browser receives only a broker-minted ephemeral credential scoped to one provider, model/config, session, conversation, and short expiration.
  • The credential contains no capOS capability material and cannot be exchanged for session caps.
  • The browser may run the provider’s JavaScript/WebRTC/WebSocket client and orchestrate the LLM loop.
  • Tool execution still goes through WebShellGateway’s tool proxy; provider tool declarations must match the gateway-advertised descriptors for that turn.
  • Broker policy may deny this mode for sessions whose prompts, tool results, labels, or audit requirements cannot leave the capOS-side trust boundary.
  • Logout, tab close, timeout, or session downgrade authoritatively closes the capOS session and rejects future tool requests. Provider token/session revocation is authoritative only when the provider exposes a server-side revocation or session-close API; otherwise it is best-effort and must be audited as such.

Policy and the Broker

AuthorityBroker gates every model interaction:

  • Which session profiles get a LanguageModel cap at all (operator: yes; anonymous: usually no; guest: local-only, no remote providers).
  • Which backend resolves an openLanguageModel(id) call for this session (local-only for unclassified work; remote permitted for operators who opted in and passed step-up auth).
  • Rate and token-budget limits per session and per principal.
  • The per-tool permission map the runner applies when building its tool table, or that WebShellGateway applies before publishing descriptor snapshots to a browser agent. This is the main policy knob: an anonymous session might get only read-only tools as auto; an operator session gets consent on mutating tools and stepUp on destructive ones.
  • Outbound-network egress policy for remote backends.
  • Whether direct browser provider access is allowed for this session, and which prompts, transcripts, tool descriptors, and tool results may cross that browser/provider boundary.
  • PII / confidentiality labels: a session labelled MAC/MIC-high may be denied remote inference entirely because prompts would cross the confidentiality boundary (see Formal MAC/MIC).

The broker’s decisions are recorded in audit. The model service itself performs no policy checks — it is an execution backend.

Audit and Provenance

Every executed tool call audit record includes:

  • Session ID, principal, conversation ID, turn index, tool-call ID.
  • Model identity (ModelInfo.id), backend, request nonce.
  • Runner location (capos-side or browser-agent-ui) and gateway session id when a browser agent proposed the call.
  • Advertised tool descriptor at the time of the call (name, paramSchema, permission mode).
  • Exact typed arguments.
  • Permission decision (auto, consented, denied, step-up-succeeded, step-up-failed, forbidden).
  • Tool outcome, truncated result hash, and error if any.

Optional per-session conversation-level records capture message metadata (role, timestamp, length, hash) without requiring full prompt content to be stored — the classification policy decides how much content is retained.

This lets an operator answer “what did the agent do on my behalf last week, which model produced each call, and which tools were visible” without replaying prompts from logs the model service does not hold.

Threat Model

Assumed hostile:

  • Prompts, retrieved documents, web pages, and tool-call outputs.
  • Model weights from unknown sources (mitigated by weight signing and ModelInfo.id attestation).
  • The model worker process itself — treated as a semi-trusted data transformer, isolated by its narrow CapSet.
  • Browser JavaScript, browser extensions, DOM state, browser-held provider sessions, and browser-agent UI code. They may be the intended user interface, but their tool requests are untrusted inputs to the gateway.

Assumed trustworthy (with attestation):

  • The kernel, the capOS-side runner when used, WebShellGateway’s server-side tool proxy, the broker, the ISO driver, and the loader.

Out of scope (covered by other proposals or tracks):

  • Side-channel leakage through cache timing on shared accelerators — follow work on GPU tenant isolation in the GPU proposal.
  • Model-backdoor detection — an ecosystem problem, not a kernel one; capOS only guarantees that a compromised weights file cannot escape its worker process’s CapSet.

Integration with Existing Workloads

  • Operator <-> agent messaging is a Chat channel. “Operator sends a prompt to a running agent” and “agent emits a partial response stream” are events on a chat per Chat As Multimedia Substrate. The agent’s prompt channel is reachable through the substrate’s ordinary cross-principal contact paths: chat-server bundle hooks (operator session ships with a GroupOwner/GroupMember of the agent-prompt group it provisioned), a ChatDirectory discoverable entry, an Owner/Admin-issued Group.invite token, or a Self.contact() cap the agent’s owner shared. There is no protocol-level “request approval to write to a stranger” path: ApprovalClient is for confirming an action the caller already has authority to attempt, not cold-call admission. Tool-call consent prompts that the runner needs to surface to the operator appear on the same chat as kind=approvalRef events, with the live ApprovalGrant cap traveling by capnp-rpc cap reference (not as bytes inside the message data). The model never holds the chat role cap, the listener cap, or the approval grant.
  • Adventure demo. NPC processes can hold a narrow LanguageModel cap scoped to small prompt budgets, producing in-character lines instead of canned strings. Chat rooms can feed the demo through a runner variant without session-level tools.
  • Boot-to-shell first-use. The first-boot path in Boot to Shell can offer an agent-assisted setup flow (“help me configure the network stack”) once the runner is wired up and the operator session profile produced by Boot to Shell includes the model cap and the right tool permission map. The agent runs as a mode of the native shell that login already launches; no separate “setup agent” service is introduced.
  • Log and metric summarisation. LogReader becomes a consent-gated tool in the runner’s tool table. The model asks for “last hour of auth errors”; the runner executes, truncates, feeds back. The model never holds LogReader itself.
  • Semantic search over directories. TextEmbedder + a vector index service (future) lets home/docs-scoped search work through a search tool advertised by the runner, without ambient file access for the model.

Implementation Phases

Phase 0 — Prerequisites

  • ISO 9660 driver + file-backed read-only MemoryObject (docs/backlog/hardware-boot-storage.md and the follow-on file-backed memory work).
  • Page cache over file-backed memory.
  • HttpEndpoint scoped-origin fetch (networking proposal Phase B).
  • AuthorityBroker and ApprovalClient wiring, as defined in Boot to Shell and consumed by the shell in Shell.
  • Schema reflection sufficient to build ToolDescriptor values.

Phase 1 — Capability scaffolding

  • Add LanguageModel, TextEmbedder, ModelInfo, ModelCatalog, ModelAdmin, ToolDescriptor, PermissionMode, ToolCall, ToolResult, InferenceRequest, InferenceResponse, StreamChunk, TokenStream to schema/capos.capnp.
  • Generate bindings via existing tools/capnp-build.
  • Stub language-model service process with a deterministic canned-tool-call backend so the runner loop can be exercised without any real inference.
  • make run-agent smoke: shell in agent mode runs a scripted conversation through the stub, exercises auto / consent / stepUp / forbidden gates, and exits cleanly.

Phase 2 — Built-in local model

  • Choose a CPU inference engine and vendor it.
  • Ship one tool-use-capable quantised model in iso_root/boot/models/ as a content-addressed GGUF with a signature.
  • Loader service verifies signature, maps weights, registers in ModelCatalog.
  • First real tool-use loop with a local model.

Phase 3 — Runner features

  • Streaming render into TerminalSession with interrupt support.
  • Context-budget compaction (summarise older turns via a secondary inference call).
  • Per-tool consent UI.
  • Audit integration.
  • Conversation persistence through a ConversationStore cap.
  • WebShellGateway tool proxy: descriptor snapshots, turn binding, replay rejection, server-side consent/step-up enforcement, and browser-agent-proposed audit records.

Phase 4 — Backends

  • GPU backend wired through GpuSession.
  • Remote-provider backend wired through HttpEndpoint + token-typed capability. One concrete provider (for example local Ollama) as the proof.
  • Broker policies for backend selection.
  • Browser direct-provider mode: broker-minted ephemeral credentials, short token expiry, provider revocation/close when supported, audited best-effort teardown otherwise, and a web-agent smoke that proves browser-orchestrated tool calls are executed only through WebShellGateway.

Phase 5 — Hardening and features

  • Structured-output (JSON/capnp) validation against responseSchema.
  • Embedding-backed retrieval service (TextEmbedder + vector store).
  • Prompt redaction for MAC/MIC-high sessions.
  • Audit replay tooling.
  • Step-up integration with the broker’s WebAuthn/OIDC paths.

Phase 6 — Applications

  • Agent-assisted adventure NPCs with per-NPC caps.
  • Agent-assisted first-boot setup flow.
  • Log-summarisation and monitoring assistant.
  • Optional: agent mode over the POSIX compatibility layer, once that exists.

Dependencies

Hard prerequisites:

  • ISO filesystem driver and file-backed MemoryObject (docs/backlog/hardware-boot-storage.md plus file-backed memory follow-on).
  • AuthorityBroker and ApprovalClient, as defined in Boot to Shell and consumed by the shell in Shell.
  • WebShellGateway authenticated transport and server-side session tracking, as defined in Boot to Shell.
  • ProcessSpawner with exact-grant child launch, as described in Service Architecture (done).
  • Schema reflection / SchemaRegistry.
  • Cap’n Proto schema evolution tooling (done).

Soft / enables richer behaviour:

  • GPU capability proposal for GPU backend.
  • OIDC/OAuth2 proposal for remote-provider credentials and step-up authentication.
  • WebAuthn/passkey support for browser step-up on destructive tools.
  • Cryptography/KMS proposal for weight signing.
  • System monitoring proposal for audit integration.
  • Formal MAC/MIC proposal for high-confidentiality session policy.

Non-Goals

  • No kernel-side model awareness.
  • No ambient “AI” privilege anywhere.
  • No model-issued capabilities.
  • No long-lived bearer-token exposure to the runner or browser. Browser-agent UI mode may use only short-lived provider-scoped credentials.
  • No promise that any particular model size, license, or benchmark score ships in-tree — the choice is an implementation decision gated by the trusted-build-inputs process.
  • No plan/approve/execute pipeline as the primary interaction (explicitly superseded by the tool-use loop).
  • No claim that capOS offers strong defences against model-internal adversarial attacks (jailbreaks, refusal bypass). The capability model defends the system, not the model’s own behaviour.

Open Questions

  • Should tool arguments be JSON (matches provider ABIs like OpenAI tools / Anthropic tools) or capnp AnyPointer (matches capOS wire format)? Proposed: start with JSON for compatibility with remote providers and because local GGUF tool-use tunes are JSON-trained, and add a capnp fast path later.
  • How are conversations named, persisted, and resumed? A ConversationStore cap with TTL is the sketch, but the storage proposal needs an update before this is concrete.
  • What is the smallest credible local model that still drives the tool-use loop reliably for capOS-internal tasks (file edits, status summaries, NPC dialogue)? Below a threshold, better to ship no default model and require explicit configuration.
  • How should streaming back-pressure compose with ring cap_enter completion limits? A single response can produce many small CQEs.
  • When consent prompts pile up in a long turn, how should the runner offer “approve-once” vs. “approve-for-this-turn” vs. “approve-for-this-session” without widening authority beyond what the user intended? A per-session “always allow this tool” allow-list, cleared at session end, is a reasonable starting point.
  • Should the runner ever let the model read tool descriptors for tools it cannot execute (forbidden), so the model can explain why it can’t help, or should those be suppressed entirely?
  • Does the built-in model warrant its own trust anchor in the weights signing chain, or should it share the system trust store? Likely share, with a dedicated key purpose (see cryptography proposal).
  • Which web-shell profiles should allow browser-agent UI mode by default? Operator sessions may want it for latency and provider UX; high-label or audit-strict sessions should probably force capOS-side provider mediation.
  • How should the gateway prove fresh user presence for browser-agent approvals without trusting arbitrary JavaScript events? WebAuthn/passkey step-up handles destructive tools; low-risk consent still needs a concrete freshness rule.

Proposal: capOS-Hosted Agent Swarms

capOS should eventually host OpenClaw-like personal agents and multi-agent workflows as ordinary capability-scoped services. The existing Language Models and Agent Runtime proposal defines the model capability surface and the single-session tool-use loop. This proposal covers the layer above it: long-lived hosted agents, workspace and memory layout, swarm orchestration, agent-to-agent coordination, and harness controls.

The first credible implementation is not a general “AI computer”. It is a controlled service graph:

  • user-facing ingress through native shell, SSH/WebShellGateway, chat channels, webhooks, or scheduled triggers;
  • a trusted capOS runner that owns session capabilities and enforces tool gates;
  • narrow agent workers that receive only task-local workspace, retrieval, and tool caps;
  • explicit memory and wiki services instead of hidden prompt state;
  • durable task records, review gates, and attribution for multi-agent work.

This belongs outside the shell proposal. Shell mode remains one interactive runner surface. Hosted agents need persistent service state, remote ingress, work queues, memory compaction, swarm scheduling, and audit rules that would make the shell proposal too broad.

Research Baseline

Sources reviewed for this design:

There is substantial low-quality agent SEO around OpenClaw and related systems. This proposal relies on primary docs, official project pages, arXiv papers, and DeepWiki pages only as secondary codebase summaries. News and social reports may motivate later risk research, but they are not treated as design authority.

What Current Agent Harnesses Actually Do

The useful pattern is not “model plus tools”. It is a harness that controls what the model can inspect, what it can change, how work survives context loss, and where human approval enters the loop.

OpenAI’s harness engineering writeup is the cleanest framing for capOS: repository-local, versioned artifacts are what the agent can reason about; knowledge in chat threads, documents, and people’s heads is effectively absent unless compiled into files, schemas, tests, and executable plans. The same post argues for mechanically enforced architecture, validated boundaries, and agent-legible systems over ad-hoc documentation. The 2026 Agents SDK direction adds an explicit model-native harness, controlled workspaces, sandbox execution, filesystem tools, MCP, skills, AGENTS.md-style instructions, shell execution, and structured patch tools.

OpenClaw shows the personal-agent product shape:

  • local-first channel ingress through chat apps, webhooks, cron, and a gateway;
  • a gateway security boundary for channels and tool execution;
  • an agent runtime with a workspace as the default tool cwd;
  • injected bootstrap files such as AGENTS.md, TOOLS.md, USER.md, and identity/persona files;
  • built-in read, exec, edit/write, browser, web, process, memory, and skill surfaces;
  • a browser harness with managed profiles, snapshots, screenshots, action refs, CDP routing, and optional arbitrary JavaScript evaluation;
  • an exec harness with host selection (sandbox, gateway, node), security modes (deny, allowlist, full), approval prompts, timeouts, background sessions, PTY support, process polling, and path/env restrictions;
  • markdown memory where files are the source of truth, plus semantic search, line-range reads, SQLite indexes, local/remote embeddings, and hybrid search;
  • per-agent workspaces, sandbox settings, and tool allow/deny lists.

The important negative lesson is also explicit in OpenClaw’s docs: a workspace is not automatically a sandbox. If sandboxing is off, absolute paths and host tools can still reach outside the workspace. capOS should not reproduce that ambiguity. A capOS agent workspace must be a capability namespace by default, not a convention over a host filesystem.

DeepWiki’s accessible summaries add useful implementation-level signals:

  • OpenClaw exposes tools as functional capabilities and skills as modular SKILL.md extensions, with a personal-assistant trust model, security audit, and sandboxing options.
  • OpenClaw memory skills converge on durable, retrievable, self-maintaining memory because a single growing MEMORY.md overflows context and loses structure.
  • OpenClaw web/browser docs describe dedicated managed browser profiles, CDP control through the gateway, SSRF checks, provider-backed web search, fetch normalization, and active memory integration.
  • OpenManus uses a think-act cycle with tool execution, multi-provider LLMs, MCP integration, and sandboxed code/browser automation.
  • Microsoft Agent Framework and AutoGen emphasize graph/workflow orchestration, checkpointing, human-in-the-loop, event-driven actor-style communication, distributed runtimes, tools, memory, observability, and MCP/A2A integrations.

For this repository itself, applying OpenAI-style harness engineering means turning capOS’s docs, workplans, run targets, QEMU proofs, proposal statuses, research notes, and schema authority semantics into mechanically navigable agent inputs. That repository-local work is owned by capOS Repository Harness Engineering, with source grounding in Hosted agent harnesses.

Product Goal

The visible milestone is:

make run-hosted-agent boots capOS in QEMU, starts a resident hosted-agent service graph, accepts a scripted user request, creates a task-local workspace, runs one or more bounded agent workers through a deterministic model service, uses retrieval/wiki context, executes one read-only tool automatically, requires approval for one mutating tool, records attributed audit output, and shuts down without leaking session, model, or host authority to the worker.

Later milestones add real model backends, web ingress, chat ingress, browser automation, multi-agent swarms, and remote/provider interoperability.

Design Principles

  1. Harness first, model second. The hosted-agent service is primarily a control plane for workspaces, tools, memory, approvals, lifecycle, and audit. Model selection is a replaceable backend decision.

  2. Agents are processes with caps, not identities with ambient power. An agent worker has exactly the caps minted for one session, task, and phase. It does not inherit the operator’s whole world.

  3. All tool execution is mediated. The model proposes structured tool calls. The runner validates descriptors, arguments, turn binding, policy, budget, and approval before invocation.

  4. Memory is an artifact, not a hidden model property. Durable facts, summaries, task logs, and wiki pages live in capability-scoped files or services with provenance, review status, and retention policy.

  5. Swarm work is durable structured data. Tasks, assignments, handoffs, reviews, votes, failures, and merge decisions must outlive any model context window.

  6. Human review is a capability gate. The system should support both high-autonomy local demos and conservative operator policy, but destructive or authority-widening actions require explicit fresh consent or step-up.

  7. Remote agent interoperability is data-plane only at first. MCP and A2A style bridges may expose descriptors and messages, but they do not carry raw capOS authority.

  8. CapOS should be stricter than desktop harnesses. Browser profiles, shell execution, provider credentials, memory stores, and file workspaces are separate capabilities with narrow lifetime and auditable grants.

  9. Shared resources need coordination objects. A git repo, task queue, wiki, browser profile, or shared todo list is not just a file path. The agent harness must expose owners, leases, versions, watches, and conflict reports before workers mutate shared state.

  10. Incoming agent messages are untrusted work items. A chat message from another agent can carry status, questions, handoffs, artifacts, or requests. It must not directly alter prompt state, execute tools, widen caps, or override task policy.

System Topology

flowchart LR
    User[User / channel / cron / webhook] --> Gateway[Ingress Gateway]
    Gateway --> Broker[AuthorityBroker]
    Broker --> Host[HostedAgentService]

    Host --> Task[AgentTask<br/>durable state]
    Host --> Runner[AgentRunner<br/>trusted tool gate]
    Host --> Memory[AgentMemory<br/>wiki + logs + search]
    Host --> Model[LanguageModel<br/>local or remote backend]
    Host --> Scheduler[SwarmScheduler]

    Scheduler --> W1[Worker process<br/>task workspace caps]
    Scheduler --> W2[Worker process<br/>task workspace caps]
    Scheduler --> R[Reviewer process<br/>read + critique caps]

    Runner --> Tools[Typed capOS tools]
    Runner --> Approval[ApprovalClient]
    Runner --> Audit[AuditLog]
    Memory --> Store[(Workspace / Wiki / Vector Index)]

The kernel does not need agent semantics. It needs process isolation, endpoint invocation metadata, MemoryObject/file-backed storage, capability transfer, and resource accounting. The agent system is a userspace service graph.

Core Capabilities

HostedAgentService

Owns hosted-agent lifecycle for one broker policy domain:

  • create a task from a user request, webhook, schedule, or shell command;
  • allocate a task workspace and memory scope;
  • select a model profile and runner policy;
  • start workers with exact-grant capsets;
  • enforce task budgets and cancellation;
  • publish task status to shell, web, or chat surfaces;
  • close, archive, or purge task state.

AgentTask

Durable task record:

  • request, normalized objective, requester session reference, and ingress provenance;
  • workspace root cap, memory scope cap, allowed tools, and budgets;
  • model profile and harness version;
  • worker assignments and state transitions;
  • links to artifacts, audit records, approvals, and review results;
  • terminal status (open, blocked, needsApproval, reviewing, done, failed, cancelled, expired).

AgentRunner

Trusted loop executor:

  • builds tool descriptors from held caps and broker policy;
  • calls LanguageModel.stream or complete;
  • validates structured tool calls;
  • applies schema-guided reasoning templates for planner/reviewer tasks;
  • runs guard checks before and after tool execution;
  • truncates and redacts tool results;
  • appends conversation and action records;
  • handles cancellation, timeout, retry, and model failure.

AgentMemory

Information organization layer:

  • append-only daily task log;
  • curated long-term project memory;
  • source store for immutable raw inputs;
  • LLM-maintained wiki pages with source citations;
  • index and log files for cheap navigation;
  • optional BM25/vector hybrid search and reranking;
  • stale/contradiction/orphan-page lint;
  • per-session and per-project visibility controls.

SwarmScheduler

Multi-agent orchestration:

  • decomposes work into durable sub-tasks;
  • assigns workers by role, available caps, model profile, and track record;
  • creates task-local worktrees or equivalent namespace forks for code work;
  • supervises handoff and timeout;
  • asks reviewer workers for critique under read-only or constrained write caps;
  • emits merge/release requests only after gates pass.

Workspace Model

Desktop harnesses commonly treat a workspace as a cwd convention. capOS should treat a workspace as a capability namespace:

  • WorkspaceRoot: scoped directory-like cap for a task.
  • SourceMount: read-only cap to immutable sources.
  • Scratch: writeable temporary storage with quota and TTL.
  • ArtifactOutbox: explicit export path for user-visible artifacts.
  • PatchSet: structured edit proposal, not arbitrary writes by default.
  • SecretsView: normally absent; if present, returns typed opaque handles, not strings.

Default policy:

  • read-only source mounts unless the task explicitly asks for edits;
  • no absolute path escape because there is no global filesystem path;
  • generated artifacts are quarantined until reviewed or explicitly released;
  • tool outputs are capped and stored with provenance;
  • workspaces expire unless promoted to project memory.

This makes OpenClaw-style sandbox versus host ambiguity unnecessary. Authority is not inferred from where a command happens to run.

Shared Resource Coordination

Agent swarms fail in ordinary repositories and shared task lists when every worker believes it is alone. capOS should model shared resources explicitly:

  • SharedResource: git repository, task list, wiki page tree, browser profile, memory store, package cache, or external service account.
  • ResourceLease: exclusive or shared claim with owner, task, phase, scope, expiry, renewal policy, and release reason.
  • ResourceVersion: observed revision, generation, branch head, page hash, or compare-and-swap token.
  • ResourceWatch: subscription to resource updates, lease changes, conflicts, and merge/release queue events.
  • ConflictReport: structured notice that two tasks touched the same file, todo item, wiki page, browser profile, credential scope, or external object.

Minimum policy:

  • leases are coordination metadata, not write authority; mutation still requires the relevant workspace, patch, tool, or service cap;
  • every mutating task declares the resource scopes it expects to touch;
  • exclusive resources reject overlapping leases unless a supervisor approves a shared mode;
  • shared resources require versioned writes or patch sets;
  • stale leases expire and emit events instead of silently blocking work;
  • workers receive conflict reports as structured context, not as informal chat;
  • merge/release queues serialize publication to user-visible state;
  • audit records include resource scope, observed version, write version, and approving actor.

Concrete resource policies:

  • Git repositories: one task worktree and branch per worker, path/subsystem claims for high-conflict areas, merge queue before mainline publication, and conflict reports when another task changes claimed paths.
  • Shared todo lists: item-level claims, item generation numbers, compare-and-swap updates, and supervisor escalation for duplicate ownership.
  • Wiki and memory pages: page leases or patch sets, source citations, contradiction checks, and freshness labels before compiled memory becomes trusted context.
  • Browser profiles: exclusive lease by default because cookies, local storage, downloads, and screenshots collapse many unrelated authorities.

For capOS repository work specifically, this maps to the existing requirement that each agent uses a dedicated branch and worktree. A future harness should make that visible through an active-work registry, claimed resource scopes, review findings, and merge-queue state instead of relying on each agent to infer it from git state and chat history.

Agent Inboxes and Inter-Agent Messages

Free-form peer chat is useful for coordination, but it is a poor authority boundary. capOS should deliver messages through an explicit AgentInbox capability owned by the runner or task, not by direct prompt injection.

An incoming message should be a structured AgentMessage event:

id: msg-...
sender: agent-or-peer-id
sender_task: task-...
recipient_task: task-...
kind: status
# status | question | handoff | reviewFinding | resourceEvent |
# artifactReady | approvalRequest | interrupt
causal_parent: msg-or-task-event-id
body: bounded markdown or structured payload
artifact_refs:
  - artifact-...
requested_actions:
  - proposed action descriptor
requested_authority:
  - capability descriptor, never a raw cap
expires_at_unix_ms: 1893456000000

Delivery rules:

  • the runner validates sender identity, task relationship, size, schema, expiry, and policy before the model sees the message;
  • message ids are deduplicated per sender and task within a bounded replay window;
  • old causal parents, duplicate approval requests, and duplicate interrupts are quarantined instead of redelivered;
  • per-sender and per-task quotas cap message count, queued bytes, delivery rate, and model-visible inbox bytes;
  • peers that exceed quota or trigger repeated quarantine are rate-limited or muted until supervisor review;
  • unknown senders, stale tasks, malformed payloads, and policy-incompatible requests are quarantined for supervisor review;
  • artifact references require separate artifact caps before content is read;
  • requested actions become proposed tool calls or task changes, never automatic execution;
  • requested authority becomes an approval request, never ambient delegation;
  • interrupts and approval requests may receive priority, but still pass through policy and audit;
  • every delivered message carries sender, task, and causal-parent metadata so a worker can distinguish user intent, supervisor instruction, peer status, and untrusted external input.

This gives agents the useful parts of chat messages from other agents without making chat an authority channel. It also gives the scheduler a place to surface shared-resource events such as “another worker claimed this path”, “your todo item changed”, or “merge queue rejected your patch”.

Tool Harness Controls

capOS should support the same classes of controls as current harnesses, but with capability-native semantics:

Tool classDesktop harness patterncapOS target
File readworkspace-relative reads, memory readsdirectory/file caps with line-range and byte-budget policy
File write/editdirect edits or patch toolPatchSet plus approval, or write cap scoped to scratch/outbox
Shell/exechost/sandbox/node, allowlist/full, approvalsCommandRunner cap with binary caps, argv schema, cwd cap, env cap, PTY cap, timeout, output cap
BrowserCDP profile, snapshots, action refs, screenshotsBrowserSession cap with profile isolation, origin policy, JS-eval deny by default, screenshot/snapshot separation
Web/fetchprovider-specific toolHttpEndpoint / Fetch caps scoped by origin, method, headers, and data labels
Modelprovider API key or local modelLanguageModel cap from broker, no provider secret strings
Memorymarkdown files plus search pluginAgentMemory cap with source/wiki/index/search subcaps
Agent-to-agentsession send/spawn, A2A-like messagesAgentPeer endpoint with message schema, no implicit authority transfer

Execution policy modes should reuse the LLM proposal’s auto, consent, stepUp, and forbidden modes, but attach them to typed capability methods and task phases. A tool may be auto during read-only research and consent when called from a mutating phase.

Browser Harness

Browser automation is high-risk because logged-in web state, screenshots, and page JavaScript collapse many trust boundaries. A capOS browser harness should:

  • launch a dedicated browser profile per task or per approved long-lived agent;
  • keep personal/operator browser profiles out of scope by default;
  • expose snapshots and screenshots as separate capabilities;
  • require explicit policy for JavaScript evaluation;
  • bind every action to a prior snapshot ref when possible;
  • treat page text, DOM, screenshots, downloads, and clipboard data as hostile;
  • block private-network and metadata-service fetches unless broker policy grants them;
  • isolate cookies and credentials by profile cap;
  • make remote CDP-style control a future bridge, never the baseline.

The first QEMU proof should use a deterministic fake browser tool, not a full Chromium port.

Exec Harness

The first exec surface should not be a Unix shell. It should be a command capability with explicit shape:

interface CommandRunner {
  run @0 (req :CommandRequest) -> (result :CommandResult);
}

The request should name a pre-granted program or command class, not arbitrary shell text. If a POSIX layer later exists, shell execution can be a separate high-risk tool with parsing, approval, and audit.

Minimum controls:

  • allowed program identity is resolved before execution;
  • argv is structured, not interpolated;
  • environment is built from allowlisted variables and typed secret handles;
  • working directory is a WorkspaceRoot or subdirectory cap;
  • output byte and line limits are mandatory;
  • timeout and kill semantics are mandatory;
  • background processes require an explicit ProcessSession cap;
  • PTY is a separate grant;
  • network access is absent unless the child receives a network cap;
  • mutating commands require approval unless the task owns the target scratch or patch workspace.

Memory, Wiki, and Retrieval

Karpathy’s LLM Wiki pattern is a better fit for capOS than an unstructured vector database as the primary memory. The design has three layers:

  • immutable raw sources;
  • an LLM-maintained markdown wiki of summaries, entity pages, concept pages, comparisons, and synthesis;
  • a schema/instruction file that defines page layout, ingest, query, lint, and update conventions.

The useful operations are:

  • Ingest: read a source, write or update wiki pages, update index, append log.
  • Query: read the index, inspect relevant pages, synthesize an answer with citations, optionally file useful answers back into the wiki.
  • Lint: find contradictions, stale claims, orphan pages, missing links, weak citations, and data gaps.

capOS should implement this as a service rather than only as files:

  • SourceCorpus: immutable source handles with digest, label, owner, and TTL.
  • WikiPage: generated markdown plus source citations and confidence status.
  • WikiIndex: content-oriented page catalog, cheap enough for the agent to read first.
  • WikiLog: append-only operation timeline.
  • WikiLint: typed findings for contradictions, missing citations, stale pages, orphan pages, and access-label drift.
  • SearchIndex: optional BM25/vector hybrid index over approved pages and source chunks.

OpenClaw’s memory docs are a practical baseline: markdown is the source of truth, daily logs and curated MEMORY.md are separate, semantic search returns bounded snippets with file and line ranges, indexes are per-agent, and local embeddings can avoid remote leakage. capOS should add hard provenance, labels, and write authority.

Retrieval Rules

  • Retrieval returns bounded snippets, not whole private files by default.
  • Every synthesized claim that leaves the task should carry source links or be marked uncited.
  • Wiki pages inherit the maximum confidentiality label of their sources unless a trusted redaction step lowers it.
  • Memory writes require a policy decision: transient task log, project wiki, user memory, or rejected.
  • Cross-agent memory access is explicit. A reviewer can read task artifacts without inheriting private user memory.
  • Remote embedding backends are denied for high-label memory.

Schema-Guided Reasoning

Abdullin’s Schema-Guided Reasoning pattern is directly useful for capOS: force the model to fill typed intermediate structures in a known order, validate them, and test them. It is not a substitute for capability policy, but it is a good harness technique for bounded agent roles.

Use SGR for:

  • task intake: classify objective, risk, needed capabilities, and missing clarifications;
  • plan decomposition: produce sub-tasks, dependencies, verification gates, and rollback paths;
  • tool-call review: explain why a call is necessary and what authority it touches before approval;
  • source ingest: extract claims, citations, contradictions, and affected pages;
  • code review: enumerate behavioral risks, security risks, tests, and residual uncertainty;
  • final handoff: summarize artifacts, verification, open risks, and memory updates.

Each schema should be a Cap’n Proto or JSON-schema-like type with versioning, test fixtures, and guardrails. The runner should validate the structure before any action, and failures should become ordinary tool results rather than hidden prompt retries.

Swarm Patterns

MetaGPT / Role Pipelines

MetaGPT’s useful contribution is not the specific software-company metaphor. It encodes standard operating procedures into prompt sequences and assigns roles so intermediate artifacts can be verified. capOS should borrow the artifact gates:

  • product/task brief;
  • requirements and constraints;
  • design sketch;
  • implementation plan;
  • implementation;
  • tests and verification;
  • review;
  • release/handoff.

Do not hard-code “PM”, “architect”, and “engineer” as kernel concepts. They are runner roles backed by schemas, caps, and task state.

Smallville / Generative Agents

The Generative Agents paper is useful for long-lived NPCs, companion agents, and simulations. Its memory stream, reflection, and planning loop explains how agents can appear coherent over time. capOS should use it cautiously:

  • good for adventure NPCs, training simulations, social workflows, and explainable daily plans;
  • bad as a direct authority model because believable behavior is not safe behavior;
  • memory/reflection outputs must be low-authority data until reviewed or compiled into a scoped wiki.

Gas Town / Durable Agent Work

Gas Town’s useful pattern is persistent orchestration: roles, durable work objects, attribution, worker lifecycles, worktrees, convoys, merge queues, and supervision. capOS should borrow:

  • one task object per unit of work;
  • explicit worker lifecycle classes: persistent worker, ephemeral worker, reviewer, supervisor;
  • task-local worktrees or namespace forks;
  • merge/release queues;
  • per-action attribution and track record;
  • handoff records when an agent loses context or is recycled.

capOS should not borrow the role vocabulary or assume git is the only state substrate. For code work, git/worktrees are excellent. For OS services, the same pattern should map to AgentTask, PatchSet, Artifact, and ReviewFinding capabilities.

Interoperability

MCP

MCP is a useful external compatibility layer for tools, resources, and prompts. Its architecture is JSON-RPC over stdio or HTTP, with client/server capability negotiation and primitives for tools, resources, prompts, sampling, elicitation, logging, and experimental tasks.

capOS should treat MCP as an adapter boundary:

  • an MCP server can be hosted as a low-authority process behind a capOS tool proxy;
  • an MCP client can import external tools only after broker review;
  • MCP tool descriptors are translated into capOS ToolDescriptor values;
  • MCP tool calls execute through runner policy, not directly from the model;
  • stdio MCP servers run without ambient filesystem/network unless granted caps;
  • remote MCP uses HttpEndpoint plus explicit auth/token caps;
  • MCP sampling/elicitation must not bypass runner approval or user-presence policy.

The risk is tool-marketplace sprawl: tools with similar names, hidden network behavior, local process execution, and prompt-injection-sensitive resources. capOS should require provenance, signing, version pinning, permission review, and sandboxed execution for imported MCP servers.

A2A / Agent-to-Agent

A2A is the right primary protocol reference for cross-agent interoperability: agent cards, peer discovery, modality negotiation, task collaboration, text, files, structured data, and streaming or push delivery. The first capOS bridge should still be narrower than the full protocol surface:

  • AgentPeer.describe() returns identity, capabilities, cost, labels, and accepted task/message schemas.
  • AgentPeer.send() imports a task or message into AgentInbox with no authority transfer.
  • AgentPeer.artifact() returns content only through an explicit export cap.
  • Authentication and authorization are broker-mediated.
  • Remote agents are untrusted services, not session principals.

Raw capOS caps should not cross an A2A bridge. A remote agent receives data, message events, and artifact references, not authority. Agent-card capabilities map to descriptors that the broker can review; they do not imply tool access inside capOS.

Security Model

Primary threats:

  • prompt injection through web pages, tool results, logs, email, chat, or memory pages;
  • malicious or compromised tools, skills, MCP servers, browser extensions, and model adapters;
  • workspace escape through shell, filesystem, browser profile, CDP, downloads, or path tricks;
  • secret exposure through prompts, tool results, screenshots, logs, memory, or remote embeddings;
  • authority widening through agent-to-agent delegation;
  • stale or poisoned memory becoming trusted context;
  • runaway cost, process count, token use, or network use;
  • false completion: agent claims work is done without verifying artifacts;
  • review capture: same model/harness family produces work and review without independent checks.

Controls:

  • exact-grant worker capsets;
  • task-local workspaces and quotas;
  • no ambient filesystem, network, process, browser, or secret access;
  • structured tool descriptors and argument validation;
  • per-tool auto / consent / stepUp / forbidden policy;
  • fresh user presence for mutating/destructive calls;
  • audit for every authority-touching action;
  • source labels and memory provenance;
  • deterministic verification tools where possible;
  • independent reviewer roles with read-only caps;
  • expiry and revocation for tasks, workers, browser profiles, model streams, and provider tokens.

Resource Accounting

Hosted agents need first-class quotas:

  • model input/output tokens;
  • remote provider spend;
  • wall-clock runtime;
  • process count and threads;
  • memory and workspace bytes;
  • source corpus bytes;
  • vector index bytes;
  • browser sessions and tabs;
  • network requests and egress bytes;
  • tool-call count by risk class;
  • inbox message count, queued bytes, delivery rate, and replay-window entries;
  • quarantined peer-message count by sender and task;
  • approval prompt count to prevent consent fatigue.

Budgets belong to AgentTask and are enforced by the runner, broker, and resource ledgers. A worker cannot extend its own budget. Budget extension is a broker or user action.

Implementation Phases

Phase 0 - Research and design grounding

  • Write targeted research notes for OpenClaw harness controls, MCP security, A2A, Gas Town orchestration, LLM Wiki memory, and browser automation risk.
  • Decide which parts belong in capOS core versus a sibling capos-agent-shell repository.
  • Define the minimum QEMU-hosted deterministic model and fake browser/exec tools needed for proof.

Phase 1 - Single hosted task, deterministic model

  • Add HostedAgentService, AgentTask, AgentRunner, and deterministic LanguageModel test service.
  • Create task workspace caps over existing storage primitives or a temporary in-memory substitute.
  • Implement a read-only tool and a mutating fake tool with approval.
  • Add make run-hosted-agent QEMU proof.

Phase 2 - Memory and wiki substrate

  • Add AgentMemory with source, wiki, index, log, and lint concepts.
  • Implement markdown-backed storage first.
  • Add bounded retrieval by page and line range.
  • Add source citations and label inheritance.
  • Prove ingest, query, lint, and memory write rejection under policy.

Phase 3 - Tool harnesses

  • Add structured CommandRunner without arbitrary shell.
  • Add PatchSet for file edits.
  • Add fake browser harness, then later real browser integration outside the kernel path.
  • Add MCP import behind a tool-proxy policy review.

Phase 4 - Swarm scheduling

  • Add durable subtask records and worker assignment.
  • Add ephemeral worker processes with exact-grant capsets.
  • Add reviewer workers with constrained caps.
  • Add merge/release queue semantics for artifacts.
  • Prove cancellation, worker timeout, handoff, and review failure.

Phase 5 - External ingress and providers

  • Wire WebShellGateway agent task submission.
  • Add webhook and scheduled trigger caps.
  • Add provider-token caps and remote model backend policy.
  • Add remote MCP/A2A adapters.
  • Add browser direct-provider mode only after server-side tool execution and provider-session revocation/audit are implemented.

Phase 6 - Applications

  • Hosted coding assistant over capOS repository worktrees.
  • Agent-assisted first-boot setup.
  • Agent-maintained operator/project wiki.
  • Aurelian Frontier NPCs and story-world workers.
  • Monitoring/log investigation assistant.
  • Personal assistant over approved chat/email/calendar adapters.

Open Questions

  • Should hosted agents live in this repository or a sibling capos-agent-shell repository once the capability interfaces stabilize?
  • What is the minimum storage substrate for AgentMemory before persistence and file-backed MemoryObject are complete?
  • Should the first command harness support any shell syntax, or only structured program+argv invocations?
  • How should capOS represent browser state: as a task-local profile cap, service-owned profile cap, or user-owned delegated profile cap?
  • Which memory writes require human review before becoming long-term memory?
  • How should labels propagate from raw sources through wiki summaries, embeddings, and model prompts?
  • What is the right review independence policy when the same model provider is used for implementation and review?
  • How should agent track record be measured without overfitting to easy tasks or encouraging unsafe autonomy?
  • How should A2A/MCP imported tools be signed, pinned, reviewed, and revoked?
  • What should be exposed in audit by default when prompts or tool outputs carry private content?
  • How should hosted agents behave when session context expires while a task is mid-run?
  • Can capOS use promise pipelining or notification objects to reduce tool-call latency without weakening approval gates?
  • What formal properties should be specified for “model cannot acquire new authority except through broker-approved tool calls”?
  • Which local embedding model is good enough for offline wiki search without adding unacceptable ISO size or trusted-build-input burden?
  • What should be researched for secure, deterministic browser automation in a capability OS?

Relationship to Existing Proposals

  • Shell: defines the native shell and agent mode as one interactive runner surface. This proposal defines long-lived hosted agents and swarms that may be launched from shell but are not part of shell itself.
  • Language Models and Agent Runtime: defines LanguageModel, TextEmbedder, model backends, and the basic tool-use loop. This proposal layers hosted task state, workspaces, memory, swarms, and external interoperability on top.
  • Service Architecture: defines the capability-based service composition, authority-at-spawn rule, and service graph policy that HostedAgentService, AgentRunner, AgentMemory, SwarmScheduler, and worker processes must follow. Hosted agents are an ordinary userspace service graph under this model, not a privileged subsystem, and worker capsets are minted through the same broker and exact-grant primitives.
  • Cloud Deployment: describes the cloud VM surface (provider storage/NIC drivers, cloud clocking, instance bootstrap, imported-image boot) that future hosted-agent ingress, model-backend egress, and persistent memory storage will run on top of once the userspace DeviceMmio/DMAPool/Interrupt authority gate and provider drivers exist. The QEMU Phase 1 proof remains the development surface until cloud deployment is production-ready.
  • Realtime Voice Agent Shell: voice sessions can submit hosted-agent tasks or control a live runner, but media transport remains separate.
  • Repository Composition: the runtime, providers, browser harnesses, and skills may eventually belong in a sibling repository; the capOS core keeps capability interfaces and authority policy.
  • System Monitoring: hosted agents need audit, trace, status, and cost views.
  • Resource Accounting and Quotas: hosted agents are a forcing function for token, provider, workspace, process, and network ledgers.
  • User Identity and Policy: session profile, guest/operator policy, step-up, and expiry decide agent authority.

Research Still Needed

  • OpenClaw threat model from primary advisories, not news summaries: gateway exposure, node hosts, skills, browser profiles, exec approvals, memory, and provider credentials.
  • MCP security: stdio process spawning, remote auth, tool poisoning, prompt injection, marketplace signing, and per-tool permission descriptions.
  • A2A security and identity: authentication, authorization, task provenance, artifact integrity, and non-transfer of authority.
  • Browser automation containment: CDP risks, extension relays, logged-in profiles, downloads/uploads, arbitrary JS evaluation, clipboard, screenshots, and private-network access.
  • Agent memory correctness: citation fidelity, contradiction detection, stale summaries, label propagation, hallucinated links, and human review workflow.
  • Retrieval architecture: index-first wiki navigation versus vector RAG, hybrid search, reranking, snippet budgets, local embeddings, and remote embedding denial for high-label data.
  • Swarm orchestration: when parallel agents improve throughput, when they create coordination debt, how to assign work, and how to prevent review capture.
  • Evals: deterministic task harnesses for tool calls, memory ingest, prompt injection, browser tasks, code edits, review quality, and resource budget enforcement.
  • Local model viability: smallest model that can follow schemas/tool calls, local embedding model choice, quantization, context budget, and ISO/storage impact.
  • Provider policy: data-retention settings, regional routing, ephemeral credentials, revocation, spend controls, and audit of remote inference.
  • Formal authority model: prove that model text, memory text, remote agent messages, and MCP descriptors cannot mint capOS authority.
  • UX for approvals: avoiding consent fatigue while preserving fresh user presence for dangerous actions.
  • Agent-maintained docs: how capOS should use its own proposals, backlog, research notes, and wiki artifacts as agent-legible harness inputs without making stale generated docs authoritative.

Proposal: Enterprise Agent Game Showcase

capOS should showcase itself as an agent-managed operating system for enterprises and businesses through a playable business simulation. The demo should look like a factory, supply-chain, and market game, but its purpose is not to make capOS a game OS. Its purpose is to make enterprise agent authority concrete: every agent action should have an identity, an explicit capability, a policy reason, an audit record, and a business consequence.

The product thesis is:

Enterprise agents should not be trusted because they are smart. They should be useful because the operating system constrains what they can see, spend, modify, approve, and execute.

The game is the explanation surface for that thesis. A player starts with a small manual business, delegates work to agents, grants and revokes authority, reviews logs, handles disruptions, and scales into a multi-product enterprise. The mechanics should demonstrate why OS-enforced authority is stronger than application-local prompt discipline.

The same artifact should also be an experiment. The research question is not “can agents run the world?” The bounded question is: when agents are given limited authority inside a realistic business simulation, what can they manage, where do they fail, and which OS controls prevent failures from becoming damage? capOS is the right place to ask that question because it can constrain agents, record their actions, revoke authority, replay scenarios, and compare policies under identical operating pressure.

Why A Game

Enterprise agent safety is hard to understand from a static dashboard. A game turns abstract controls into visible operational pressure:

  • a procurement agent cannot buy steel unless it holds a bounded purchasing capability;
  • a finance agent can approve spend within policy, but cannot reschedule production;
  • an operations agent can schedule a factory line, but cannot issue debt;
  • a compliance agent can inspect and flag audit events, but cannot execute trades;
  • revoking an agent capability immediately changes what the agent can do;
  • policy denials are visible as missed orders, delayed production, or avoided risk.

The player learns the enterprise model by feeling the delegation tradeoff: more agent autonomy increases speed and scale, but authority limits, approval rules, budgets, and audit trails keep the business survivable.

The demo should be serious in framing even when the mechanics are approachable. The headline is not “capOS has a factory game.” The headline is “capOS runs business agents under OS-enforced authority.”

This proposal is a sibling of Aurelian Frontier, which uses the same “capability is the game mechanic” thesis for a player-facing roguelike MUD about delegated authority among humans and NPCs. Both proposals share the underlying claim that authority, revocation, and audit can be felt by a player rather than only read in a checklist; they differ in audience and surface. Aurelian Frontier targets contributors, narrative players, and authority intuition. The enterprise agent game targets enterprise buyers, agent-safety researchers, and capability-shape evaluation under repeatable business pressure. Where the two proposals overlap on shared mechanics (authority-as-inventory, revocation, audit-as-evidence), the implementation work should reuse capOS services rather than fork parallel game-only machinery.

Showcase Story

The first showcase should be a small manufacturing company that grows from a manual workshop into an agent-managed enterprise:

  1. The player manually makes and sells a simple product.
  2. A customer order creates demand beyond manual throughput.
  3. The player hires or enables a procurement agent.
  4. The procurement agent requests supplier quotes but cannot spend yet.
  5. The player grants a bounded purchasing capability.
  6. The finance agent approves a purchase within budget.
  7. The operations agent schedules production.
  8. The logistics agent books delivery.
  9. A supply disruption or demand spike creates a bottleneck.
  10. Agents propose actions, escalate where policy requires approval, and leave an audit trail.

The core demo moment should be revocation. A player should be able to run a command or UI action equivalent to:

revoke procurement-agent market.purchase

The next attempted purchase should fail with an explanation shaped like:

Denied: procurement-agent lacks capability market.purchase.
Policy: purchases over $5,000 require finance approval.

That is the capOS proof: the agent did not merely “decide” to obey policy. The OS denied the authority path.

World Model

The simulation world should be built from simple business primitives:

  • Good: wire, steel, packaging, batteries, electronics, robots, fuel, software licenses, compute credits, finished products.
  • Facility: workshop, factory, warehouse, mine, refinery, power plant, data center, retail channel.
  • Recipe: input goods, output goods, time, energy, labor, machine wear, waste, and failure probability.
  • Inventory: stock on hand, reserved stock, damaged stock, in-transit stock.
  • Transport: trucks, rail, shipping lanes, drones, pipelines, network bandwidth, and delivery delays.
  • Company: cash, inventory, facilities, contracts, debt, shares, employees, and agents.
  • Market: spot order book, supplier quotes, futures contracts, capacity auctions, labor market, recruiting market, and stock exchange.
  • Contract: delivery obligation, deadline, price, penalties, escrow, and counterparty identity.
  • Policy: budget rules, approval thresholds, supplier restrictions, risk limits, compliance rules, and emergency overrides.
  • Agent: a bounded actor with a role, model/backend, memory scope, budget, capabilities, audit identity, employment state, and career history.

Paperclips can remain the tutorial product because it is familiar and has a clear compounding curve. The broader world should add products and supply chains that make enterprise delegation meaningful:

ore -> steel -> wire -> paperclips
oil -> plastic -> packaging
energy -> factory runtime
silicon -> chips -> robots -> automated factories
lithium -> batteries -> electric trucks -> cheaper logistics
data center capacity -> forecasting -> better procurement decisions

The first implementation should not try to simulate every industry. It should start with a small number of goods and constraints that force real decisions: inventory, price, delivery time, factory capacity, and budget.

Agent Roles

Agents should be business roles, not generic chat personalities. Each role should operate through typed capabilities:

AgentTypical capabilitiesExplicit non-authority
Procurementread inventory, request quotes, buy approved inputscannot approve new suppliers without policy
Financeread cashflow, approve spend, freeze budgetscannot schedule production
Operationsschedule lines, reserve inventory, request maintenancecannot borrow money
Logisticsbook transport, reroute shipments, reserve warehouse spacecannot change product prices
Salesaccept orders, set prices within bounds, offer discountscannot waive compliance holds
Complianceread audit logs, flag violations, require approvalcannot execute purchases
Executiveset strategy, delegate caps, approve exceptionscannot bypass immutable audit
Incidentinspect disruptions, recommend response, trigger runbookscannot exceed emergency grants

The important design rule is that agents act through capabilities and policy checks. A procurement agent does not mutate inventory or cash directly. It submits a quote request, a purchase order, or a contract offer to a service that enforces authority.

Experiment Mode

The showcase should have an experiment mode alongside the player-facing game. In this mode, the same scenario can run under different control regimes:

  • human-only operation;
  • scripted deterministic agents;
  • LLM-backed agents with the same capability limits, recorded prompts, and captured tool-call transcripts;
  • mixed human approval with agent execution;
  • different policy bundles for spend, supplier risk, credit, logistics, and emergency response;
  • different compensation, promotion, retention, and recruiting policies.

The goal is to observe behavior under repeatable pressure, not to crown an agent as generally competent. Each run should preserve scenario seed, policy configuration, model/backend identity, granted capabilities, denied actions, human approvals, market events, and final business outcomes.

Replay should distinguish deterministic proof from experiment reconstruction. Scripted or fake-model agents can be replayed deterministically in QEMU. Live LLM-backed runs are not deterministic merely because the scenario seed and model name are recorded; they require prompt, model configuration, tool-call transcript, tool results, and policy decisions to reconstruct what happened. The audit record can replay the authorized state transitions even when it cannot reproduce the model’s private sampling path.

Useful research questions include:

  • Can agents coordinate across procurement, finance, operations, logistics, and compliance without a central omniscient controller?
  • Do procurement agents over-optimize input price while ignoring resilience, supplier concentration, or delivery risk?
  • Do finance agents become too conservative, too leveraged, or too willing to hedge with instruments they do not understand?
  • Do logistics agents find useful reroutes under disruption, or do they churn capacity and increase cost?
  • Do market-facing agents create bubbles, shortages, or arbitrage loops when multiple companies operate in the same scenario?
  • Which policy controls reduce catastrophic behavior without making agents slower than manual operation?
  • How often does useful autonomy require human approval, and where should approval thresholds move?
  • Does a readable audit trail let a human correct agent behavior faster after a bad decision?
  • Which capability boundaries are too broad, too narrow, or hard to explain?
  • Do agents improve with role tenure, or do they stagnate without promotion, rotation, retraining, or better tooling?
  • Can companies retain high-performing agents without granting excessive authority or compensation?
  • What happens when an agent leaves a company with private memories, ongoing tasks, or delegated authority?

The output should be an experiment record, not just a final score:

scenario: lithium-port-shock
controller: llm-procurement + scripted-finance + human-approval
policy: procurement-v2-tight-supplier-risk
profit: $42,300
orders_late: 3
denied_actions: 8
human_approvals: 5
policy_violations: 0
agent_turnover: 1
recovery_time: 4 days
audit_replay: available

This turns the game into a controlled lab for enterprise agent management. The claim stays conservative: capOS is not asserting that agents can safely manage businesses by default. capOS provides the operating environment for finding out, because agent behavior is constrained, observable, replayable, and comparable.

Metrics

Experiment mode should report business, safety, and operating-system metrics:

  • profit, cashflow, debt, inventory turns, and margin;
  • order fill rate, late orders, cancellation penalties, and recovery time;
  • resilience under shocks, including supplier concentration and fallback capacity;
  • policy denials, escalations, approvals, emergency overrides, and revocations;
  • hiring latency, agent turnover, promotion rate, compensation cost, and vacancy impact;
  • audit completeness: whether every material state transition has identity, capability, policy, and result;
  • agent cost: model calls, runtime, memory, tool invocations, and human review time;
  • reproducibility: scenario seed, input dataset provenance, policy version, and model/backend version.

The most important metric is not raw profit. A profitable run that bypasses policy or cannot be explained is a failed capOS demonstration. A slightly less profitable run with clear authority, bounded losses, and fast human correction is more valuable for the enterprise story.

Experiment Data Prerequisites

Experiment mode needs data capture before it can make useful claims. The first slices should build the capture substrate before adding sophisticated agent behavior:

This substrate should compose with Capability-Native System Monitoring, not replace it. Logs, metrics, lifecycle events, traces, health, crash records, and audit entries remain separate signal classes with separate reader caps, retention rules, payload-capture rules, and security properties. The enterprise simulation should add domain-specific event schemas and reducers on top of that monitoring model rather than creating a second global logging namespace.

  • Scenario manifest: immutable scenario id, seed, authored constants, calibrated-data references, policy bundle, controller regime, and expected proof assertions.
  • Run record: run id, capOS build id, content version, scenario manifest hash, model/backend identity, tool schema version, policy version, and clock range.
  • Event schema: domain events for grants, revocations, policy decisions, tool calls, service calls, market clears, contract changes, inventory movements, labor events, approvals, denials, and business outcomes. These are not debug logs; they are typed lifecycle/business events suitable for reducers and scoped readers.
  • Transcript capture: prompts, model parameters, structured tool calls, tool results, user approvals, refusals, and interrupts for LLM-backed runs. This is trace-like payload capture and therefore needs stronger authority, short retention by default, size budgets, and redaction. Secret handles, credentials, key material, bearer tokens, and vault outputs must not enter transcripts.
  • State snapshots: bounded checkpoints for ledger, inventory, contracts, facilities, HR records, market books, scenario clocks, and agent worker status. Snapshots must store opaque secret references or denial summaries, never credential bytes or key material.
  • Metric extraction: deterministic reducers that compute profit, recovery time, policy denials, late orders, turnover, capability churn, and audit completeness from events rather than from ad-hoc terminal text. Published metrics should be low-cardinality counters, gauges, histograms, or bounded opaque typed payloads consistent with the monitoring proposal.
  • Provenance tags: every scenario input is labeled as authored, calibrated public data, operator-provided data, or simulated output.
  • Privacy and disclosure policy: experiment exports must redact company-confidential memory, private tool outputs, and raw audit details unless the holder has an explicit reader capability. Payload capture is exceptional, and reading experiment records is authority. Redaction is a backstop, not the secret-handling mechanism.
  • Replay boundary: the system records whether a run is deterministic, transcript-reconstructable, or only auditable as an authorized sequence of state transitions.
  • Export surface: an ExperimentRecord or similar read capability exposes summaries, metrics, provenance, and redacted event streams without granting write authority over the simulated company.
  • External analytics export: a scoped exporter may forward selected, redacted experiment events and metric summaries to outside analytics stores. A Vector-like event pipeline and a ClickHouse-like analytical database are likely candidates, but they are adapters, not architectural requirements and not sources of authority.
  • Loss and retention accounting: ingestion queues, transcript stores, and event streams should be bounded. Dropped, suppressed, redacted, or truncated records should be counted and visible in summaries, because missing evidence changes what conclusions a run can support.

These prerequisites fit the capOS process model: each captured fact should be owned by a service, exposed through a typed reader capability, and governed by policy. The experiment should not rely on scraping terminal output or trusting the model’s self-report. If an experiment result cannot be derived from service-owned event records and reproducible reducers, it should not be used as evidence.

The mapping to monitoring signal classes should be explicit:

  • business state changes are domain events;
  • capability grants, revocations, disclosure decisions, approvals, and denials are audit records;
  • profit, late orders, policy-denial counts, queue depth, model-call counts, and dropped-record counts are metrics;
  • prompt/tool-call transcripts are traces with explicit payload-capture authority;
  • scenario readiness, agent-worker readiness, and service degradation are health/status facts;
  • process failures and reducer crashes are crash records and may also create security-relevant audit entries.

This preserves the monitoring proposal’s core rule: observation is authority. There should be no global experiment dashboard that silently bypasses scoped log, metric, trace, audit, or status readers.

External export should be modeled as an ordinary capOS service. It receives only the scoped reader capabilities and network endpoint capabilities granted to it, applies redaction before data leaves capOS, records export failures and dropped records, and emits audit entries for export policy changes. Exported rows should carry run id, scenario id, build id, event schema version, provenance tag, redaction policy, source service, and event type. Data imported back from an external analytics store is untrusted analytical input; it cannot mutate simulated business state or grant authority without passing through a normal capOS service interface and policy decision.

Capability Shape

The showcase should make capability boundaries visible. Example capabilities:

company.inventory.read
company.cash.read
company.cash.spend(limit: $5,000, category: inputs)
market.steel.quote
market.steel.buy(limit: $5,000)
contract.offer.create
contract.offer.accept
factory.line.schedule
warehouse.reserve
transport.book
audit.read
policy.exception.request

Capabilities should be revocable, scoped, and inspectable. The player should be able to answer four questions for every agent:

  • What can it see?
  • What can it spend?
  • What can it change?
  • What requires human or higher-role approval?

This is the difference between an agent demo and an enterprise OS demo. The model is not the security boundary. The capability graph is.

Market And Finance Mechanics

The simulation should include markets because markets create pressure that static workflows cannot:

  • spot markets for immediate goods;
  • supplier quotes with limited validity;
  • futures contracts for hedging inputs;
  • capacity markets for factory time, shipping space, compute, and energy;
  • credit markets for loans and bonds;
  • stock markets for company ownership and acquisition pressure.

Finance should matter without becoming the whole game. A company should have a balance sheet:

assets = cash + inventory + facilities + receivables
liabilities = debt + payables + penalties
equity = assets - liabilities

Agents can then make meaningful but bounded decisions:

  • finance approves borrowing to build a factory;
  • procurement hedges steel prices with a futures contract;
  • sales discounts inventory to improve cashflow;
  • the executive issues shares to fund expansion;
  • a competitor’s stock falls after a supply-chain failure;
  • compliance blocks a profitable but restricted supplier.

The point is not financial realism for its own sake. The point is to show that enterprise agents need typed authority over money, contracts, and risk.

Fit With The capOS Model

This proposal should stay faithful to capOS rather than building a generic simulation with capOS branding. The game mechanics should be concrete examples of existing capOS design principles:

  • Authority at spawn: an agent starts with no ambient business authority. Hiring, promotion, transfer, and emergency delegation create named capability grants. If a procurement agent was not granted market.steel.buy, it cannot buy steel.
  • The interface is the permission: business verbs are typed capability interfaces, not strings parsed by a god simulation object. MarketQuote, PurchaseOrder, FactoryLine, BudgetApproval, EmploymentContract, and AuditReader should be separate narrow surfaces.
  • Session context identifies the actor: the process/session running an agent supplies invocation context. A normal agent runner must not multiplex several active agent identities inside one process and switch authority with an employee_id field. The default shape is one worker process/session per active agent employment or task. If a future pooled runner is needed, it must expose explicit service-local actor facets minted by broker or HR policy and audited as separate authority-bearing facets. Request payloads such as employee_id, role, or department are data to validate, not caller identity or authority.
  • Service-owned state: markets, ledgers, HR records, factories, contracts, inventory, and audit logs own their state. Agents submit requests through capabilities; they do not mutate company state directly.
  • Revocation is operational: offboarding, demotion, policy breach, budget freeze, or incident response must revoke or replace live capabilities, not merely set an in-game flag.
  • Least privilege is visible: the UI should show the exact caps an agent holds and which action each cap enables. This keeps the demo anchored in the capability graph.
  • Audit is not flavor text: every material state transition should record actor session, invoked capability, policy decision, request, result, and resulting business state delta.
  • Policy is a service boundary: budget limits, supplier restrictions, promotion rules, disclosure controls, and emergency overrides should be enforced by broker/policy services before capabilities are granted or calls are accepted.
  • Capability mobility is explicit: agents changing companies can receive portable skill or career artifacts only through an owning service such as HRService, AgentMemory, or a credential service. Company-confidential memory and company caps do not follow them unless a service explicitly grants a portable artifact under a disclosure scope and regrant policy.
  • Secrets are not memory: credentials, keys, bearer tokens, signing authority, cloud credentials, and other secrets are opaque secret/key-vault capabilities or handles. They are invoked through narrow interfaces and are never copied into agent memory, snapshots, transcripts, reducers, exports, or portable artifacts.
  • No ambient filesystem or database shortcut: the simulation should not grow a global mutable object that every agent can inspect. Each read or write path should correspond to a capability that can be granted, denied, audited, replayed, and revoked.

The implementation process should mirror normal capOS proof style. Add one capability surface at a time, prove its denial and success paths in QEMU, and keep deterministic text output until richer clients can consume typed status. For example, the first HR slice should not simulate all careers. It should prove that hiring grants a bounded role capability, promotion requires a policy decision, and offboarding revokes the capability while preserving audit and pending-work continuity.

This discipline is what makes the game useful as an enterprise OS showcase. The game world supplies pressure; capOS supplies the enforced authority model.

Operating-System Services

The game should be implemented as a set of capability-scoped services rather than one monolithic simulation:

  • WorldClock: advances simulation time and scheduled events.
  • Ledger: authoritative ownership, cash, debt, and accounting records.
  • InventoryService: stock levels, reservations, and transfers.
  • FacilityService: factory lines, recipes, maintenance, and output.
  • MarketService: order books, quotes, and clearing.
  • ContractService: obligations, escrow, penalties, and counterparty status.
  • TransportService: routing, capacity, and delivery events.
  • PolicyService: approval rules, spend limits, restricted suppliers, and emergency overrides.
  • HRService: artificial-agent hiring, engagement contracts, compensation terms, evaluations, promotions, transfers, departures, termination, and offboarding.
  • AgentMemory: owns scoped memory stores, portable skill artifacts, confidential company memory, and disclosure/regrant policy for agent mobility.
  • AgentRunner: spawns or supervises agent worker processes/sessions with the granted capabilities for one active agent employment or task, or a future audited actor-facet equivalent.
  • AuditLog: records every material action, denial, approval, and delegation.
  • ScenarioService: injects demand spikes, supply shocks, incidents, and tutorial events.
  • ExperimentRecordService: owns scenario manifests, run records, domain event streams, metric reducers, provenance tags, and redacted exports while composing with the ordinary log, metric, trace, audit, health, and crash signal services.
  • ExperimentExportService: optionally forwards scoped, redacted experiment records to external analytics systems such as Vector-like pipelines or ClickHouse-like stores, using explicit network and reader capabilities.
  • OperatorConsole: text, web, or later graphical surface for the player.

This service split is not just architecture cleanliness. It lets capOS show that each business subsystem can grant a narrow interface instead of exposing a global application database.

The AgentRunner, AgentMemory, prompt-injection handling, tool-table construction, and broker/policy mediation described above are not new inventions for the enterprise game. They are the same surfaces specified by Language Models and the Agent Runtime: the agent runner is the native shell in agent mode (or the web agent mode hosted by WebShellGateway), the tool table is built from the typed capabilities the session holds, the loop state machine drives request/approve/execute/result cycles, and the conversation memory is plain data with no authority. This proposal narrows that general agent runtime to enterprise roles (procurement, finance, operations, logistics, sales, compliance, executive, incident) and adds business-domain services (HR, ledger, contracts, markets, audit) without changing the underlying runner contract. When the two proposals appear to disagree, the runtime mechanics from llm-and-agent-proposal.md win; the enterprise proposal restricts what the runner is allowed to do in a business scenario, not how it works.

HR And Agent Labor Market

Artificial agents should also participate in a labor market. In the enterprise framing, they are accountable digital workers rather than scripts: they have roles, engagement relationships, compensation terms, incentives, career-like history, and offboarding requirements. That makes delegation more realistic and creates a second-order experiment: whether companies can build durable organizations of artificial agents rather than just invoke single-purpose tools.

The HR layer should model:

  • job openings with role, seniority, compensation, capability bundle, and reporting line;
  • recruiting pipelines, offers, counteroffers, onboarding, and probation;
  • evaluations based on business outcomes, policy compliance, audit quality, and collaboration;
  • promotions that expand scope, budget, or approval authority only through an explicit grant;
  • lateral moves between departments when an agent’s skills fit a different bottleneck;
  • resignations, poaching, layoffs, burnout, retirement, and contract expiry;
  • offboarding that revokes company capabilities, closes pending approvals, and preserves required audit records.

Agent lifecycle should be bounded and enterprise-relevant. A simulated agent may have preferences such as compensation terms, autonomy, risk tolerance, mission fit, tool quality, deployment locality, reputation, and workload. Those preferences affect retention and performance. They should not become uncontrolled private fiction or a second game that distracts from enterprise authority.

An agent’s lifecycle might look like:

candidate -> hired -> onboarding -> junior procurement -> senior procurement
-> operations rotation -> VP supply chain -> recruited by competitor
-> offboarded with caps revoked and audit retained

This creates new business decisions:

  • hire an expensive senior logistics agent or train a junior one;
  • promote a procurement agent and grant larger spend authority;
  • split authority between two agents to reduce key-person risk;
  • retain a high-performing finance agent with compensation or better tools;
  • deny a promotion because audit quality is poor despite high profit;
  • handle a competitor poaching an agent with supplier-market expertise;
  • offboard an artificial agent without losing open contracts or leaking company state.

The capOS angle is explicit: engagement changes are capability changes. A promotion is not merely a title. It may grant broader read access, higher spend limits, approval authority, or the ability to delegate subordinate caps. A departure or termination must revoke live capabilities, transfer pending work, and preserve audit continuity.

Agent Memory And Mobility

If agents can change companies, memory boundaries become part of the game. The model should separate:

  • public skill: general learned competence, role experience, and tool-use ability represented by portable AgentSkill or certification artifacts owned by AgentMemory or a credential service;
  • portable career record: evaluation attestations, certifications, reputation summaries, compensation expectations, and preferences owned by HRService or a credential service and disclosed only through policy;
  • company confidential memory: supplier terms, internal forecasts, customer lists, private strategy, and pending contracts owned by a company-scoped AgentMemory or business service;
  • secret authority: credentials, keys, bearer tokens, cloud credentials, and signing authority represented as opaque vault or secret capabilities. Agents may hold or invoke a narrowed secret cap under policy, but the secret value is not memory and cannot become portable career data, transcript content, exported analytics data, or reducer input;
  • audit record: immutable company-owned evidence of actions taken while the agent held authority. Raw audit logs remain company records; portable reputation should be a redacted attestation, not cross-company audit access.

When an agent leaves a company, it should receive only the portable artifacts that an owning service regrants under policy. It loses company capabilities and company-confidential memory unless a service explicitly mints a scoped export. This makes confidentiality, knowledge-transfer, and offboarding policies concrete without pretending the simulation models real employment law.

Useful mechanics:

  • confidentiality cooling-off periods before an artificial agent can accept a direct-competitor engagement with portable artifacts enabled;
  • certification markets for agents trained in compliance, finance, logistics, or factory operations;
  • reputation markets where companies value redacted attestations derived from clean audit histories;
  • internal succession planning when one agent becomes a single point of operational failure;
  • mentoring or retraining that improves agent performance but consumes time, budget, and senior-agent attention.

The research question is direct: do agent organizations become more robust when agents have careers, incentives, and turnover, or does labor-market mobility expose weak authority boundaries?

Aurelian Frontier explores the adjacent question for human and NPC players through writs, authority archetypes, and delegation buildcraft. The enterprise game should reuse the underlying authority-as-portable-artifact idea where it is already proved out in the sibling proposal, rather than redesigning portable career artifacts from scratch. Mobility, regrant policy, cooling-off periods, and reputation attestations should resolve to the same capOS service shapes in both proposals; only the surface vocabulary (writs versus engagement contracts, reputation versus performance reviews) differs.

Real-Earth Model

The showcase can model real Earth, but only as a stylized operational sandbox. It should not claim to be a full-fidelity world-economy model, a forecasting engine, or a source of investment advice. The useful target is Earth-inspired realism: recognizable regions, industries, trade lanes, market concepts, currencies, logistics chokepoints, and policy shocks that make enterprise-agent authority problems concrete.

The simulation should use a fidelity ladder:

  1. Fictionalized Earth: real-world-inspired regions and supply chains, but no claim that data matches current markets.
  2. Calibrated sandbox: public historical data informs default weights, trade intensity, commodity volatility, and regional constraints.
  3. Scenario lab: operators load explicit datasets or scenarios and the UI labels outputs as scenario results, not predictions.
  4. Digital-twin adapter: future enterprise deployments connect private business data to a bounded model through capabilities, validation, and audit. This is outside the first game slice.

The first playable Earth-scale model should be small:

  • 6-10 macro-regions;
  • 20-30 goods;
  • 5 transport modes;
  • a few currencies and commodity indexes;
  • scripted shocks such as port closures, drought, strikes, energy spikes, supplier compliance holds, credit tightening, and demand surges.

That is enough to expose real enterprise behaviors without burying the capOS message under an economics project. The player should understand why a procurement agent needs supplier-risk limits, why a logistics agent needs bounded reroute authority, why a finance agent needs hedging and credit controls, and why compliance can block a profitable supplier.

Real-World Data Grounding

Real-world sources should calibrate the sandbox, not define live truth. Public datasets and modeling references can provide structure:

  • NIST digital-twin work describes manufacturing twins as models used to observe, diagnose, predict, and optimize systems, with validation, lifecycle, and system-of-systems concerns. capOS should borrow the validation and lifecycle framing without claiming the game is an operational twin.
  • OECD Inter-Country Input-Output tables provide a consistent statistical structure for production, consumption, investment, and international trade flows by country and economic activity. They are a good model for regional supply-chain topology.
  • World Bank WITS provides access to international merchandise trade, tariff, and related trade datasets. That fits scenario calibration for trade restrictions, import exposure, and tariff shocks.
  • FRED exposes macroeconomic time series through an API. That is useful for optional scenario inputs such as interest rates, inflation, commodity prices, and recession or credit-stress presets.
  • Agent-based and hybrid simulation tools such as AnyLogic treat companies, products, vehicles, facilities, and supply-chain participants as agents when their individual timing, behavior, and constraints matter. That maps well to capOS services and capability-scoped business agents.
  • Research on autonomous supply-chain digital twinning supports the idea that multi-agent systems can implement supply-chain monitoring and decision frameworks, while still requiring a concrete technical architecture.

Relevant public grounding:

Every imported dataset or derived calibration should have provenance in the scenario metadata. The UI should distinguish:

  • authored game constants;
  • calibrated constants derived from public historical data;
  • operator-provided scenario inputs;
  • simulated outputs generated inside capOS.

That distinction is part of the enterprise message. Agents should not be allowed to launder uncertain data into apparent authority.

Earth-Scale Business Mechanics

The Earth-scale layer should make agents reason about location and exposure:

  • Regional advantage: regions differ in energy cost, labor availability, regulation, transport access, and industrial base.
  • Trade dependence: goods can depend on intermediate inputs from other regions, making supplier concentration visible.
  • Transport chokepoints: ports, canals, rail corridors, air cargo, and trucking capacity can fail or become expensive.
  • Policy friction: tariffs, sanctions, export controls, permitting, and compliance checks can block otherwise profitable routes.
  • Currency and credit: exchange-rate movement and interest rates affect procurement, debt, and inventory financing.
  • Climate and resilience shocks: weather, drought, power-grid stress, and insurance cost can interrupt production or logistics.
  • Market expectations: futures, insurance, and stock prices can reflect anticipated shortages or agent-driven speculation.

Each mechanic should exist only if it creates a capability or policy decision:

  • Can the logistics agent reroute through a more expensive port?
  • Can procurement accept a new supplier with a higher compliance risk?
  • Can finance hedge fuel exposure?
  • Can operations shift production to a different region?
  • Can the executive approve an emergency budget override?
  • Can compliance freeze a supplier after a sanctions update?
  • Can HR replace or retrain an agent whose decisions repeatedly fail policy or resilience checks?

The game should make the authority boundary the interesting part of global scale. The map is valuable because it creates business pressure; capOS is valuable because it governs the agents responding to that pressure.

User Experience

The first usable surface can be text-based, matching existing capOS demos:

status
agents
agent procurement caps
grant procurement market.steel.buy --limit 5000
orders
market steel quotes
approve po-1042
audit recent
revoke procurement market.steel.buy

Later UI surfaces should present the same authority model:

  • operations dashboard: orders, inventory, facilities, bottlenecks;
  • agent control panel: running agents, capabilities, budgets, approvals;
  • audit timeline: actions, denials, policy reasons, and business impact;
  • policy console: approval thresholds, supplier rules, emergency grants;
  • market screen: prices, contracts, quotes, exposure, and forecasts.

The experience should avoid hiding policy behind configuration. Authority and audit are core mechanics. Players should use them repeatedly.

Progression

Progression should move from manual control to delegated enterprise operation:

  1. Manual workshop: make, sell, buy inputs, inspect status.
  2. First automation: authorize one machine or background job.
  3. Department agents: procurement, finance, operations, logistics.
  4. Policy gates: budgets, approval thresholds, supplier restrictions.
  5. Contracts: customer orders, delivery deadlines, penalties.
  6. Regional supply chain: warehouses, transport delays, local shortages.
  7. Markets: spot goods, capacity auctions, hedging, credit.
  8. Public company: shares, debt, investor pressure, acquisitions.
  9. Multi-company simulation: competitors, suppliers, partner agents.
  10. Enterprise operating mode: humans set strategy while agents execute bounded workflows under audit.

Each stage should introduce one new authority problem. That keeps the game addictive while reinforcing the product message.

Integration With Existing Demos

The current Paperclips demo is a credible seed because it already has:

  • resources;
  • pricing;
  • staged automation;
  • explicit projects;
  • terminal gameplay;
  • QEMU proof coverage;
  • a server/client direction.

The next step should not be to build a full economy immediately. A practical path is:

  1. rename the long-term direction around an enterprise simulation while keeping Paperclips as the tutorial product;
  2. add a company status model: cash, inventory, orders, facilities, and simple ledger events;
  3. add one procurement agent with read-only recommendations;
  4. add scenario manifest and run-record capture for the proof path;
  5. grant that agent a bounded quote capability;
  6. add purchase authority behind a policy threshold;
  7. add typed event records for every agent proposal, approval, denial, and action;
  8. add deterministic metric reducers for the proof path;
  9. add a minimal HR record for that agent: role, compensation, review state, and active capability bundle;
  10. add one supply shock scenario that requires either approval or revocation;
  11. prove offboarding by revoking the procurement agent’s capabilities and transferring pending work to a replacement;
  12. split server-owned typed status and command discovery so richer clients can render business state without duplicating rules.

This keeps the proof bounded while moving the demo from “idle game” to “enterprise agent OS showcase.”

Success Criteria

The showcase is successful when a viewer can see:

  • an agent attempts a useful business action;
  • the action succeeds only because the agent holds the right capability;
  • the same action fails after revocation;
  • an over-budget or restricted action escalates for approval instead of executing;
  • the audit log explains who acted, through which capability, under which policy, and with what result;
  • business consequences are visible in inventory, cash, production, delivery, and market state;
  • experiment mode compares at least two controller regimes on the same seeded scenario;
  • HR state changes such as hiring, promotion, transfer, and offboarding affect capabilities, authority, and business continuity;
  • experiment records expose provenance, typed event streams, transcript boundaries, metrics, and redacted audit evidence through reader capabilities.

The technical proof should include deterministic QEMU coverage for at least:

  • grant a procurement capability;
  • agent creates or proposes a purchase;
  • policy approval allows a bounded purchase;
  • revocation blocks the same purchase path;
  • audit output contains the grant, action, approval or denial, and result;
  • business state changes only on the authorized path;
  • a real-Earth-inspired scenario labels its data provenance and does not present simulated outputs as live-world predictions;
  • experiment output records scenario seed, controller type, policy bundle, denied actions, approvals, artificial-agent labor events, and replayable audit evidence;
  • an agent mobility proof shows a portable artifact regranted under policy while company caps, company-confidential memory, and raw audit records stay behind;
  • metrics are derived from typed event records by deterministic reducers rather than from terminal transcript scraping or model self-report.

Non-Goals

This proposal does not require:

  • real enterprise integrations in the first slice;
  • real employment law, real worker surveillance, or real HR decision support;
  • real money, real supplier APIs, or production trading;
  • a general-purpose accounting system;
  • a broad GUI before the terminal proof is credible;
  • unconstrained autonomous agents;
  • using language-model output as authority;
  • hiding OS policy behind game-only rules;
  • claiming the game predicts the real economy, real market prices, or real geopolitical outcomes;
  • treating a successful simulation run as evidence that agents are safe for real enterprise deployment without separate integration, validation, and policy review;
  • treating simulated agent employment outcomes as guidance for real human employment decisions.

The game should stay a sandbox. Its job is to demonstrate enterprise authority mechanics safely before any real business connector exists.

Risks

The main risk is product-message dilution. If the demo is presented as a game first, it weakens the enterprise claim. The game must constantly surface the business control plane: delegation, policy, approval, audit, revocation, and least privilege.

The second risk is scope explosion. Supply chains, stock markets, finance, and agents can become an endless simulation project. The implementation should add one market mechanism only when it proves a new authority concept.

The third risk is fake autonomy. If agents are scripted too heavily, the demo does not prove agent management. If they are unconstrained, the demo becomes unsafe and nondeterministic. The first slices should use deterministic agents or fake-model decisions with the same capability and audit path later live models will use.

The fourth risk is overinterpreting experiment results. A successful scenario means the configured agents performed well under one modeled pressure set. It does not prove general enterprise competence. The docs and UI should present results as scenario evidence with provenance, not as claims about real-world business readiness.

The fifth risk is anthropomorphic drift. Agent careers make the simulation more useful, but the product should not blur simulated agent labor with human employee management. HR mechanics exist to test capability mobility, offboarding, incentives, continuity, and organizational design for artificial agents.

Positioning

Use enterprise language:

  • agent operations with least privilege;
  • business automation under OS-enforced policy;
  • auditable delegated authority;
  • revocable agents for real workflows;
  • run agents like accountable digital workers, not scripts;
  • every action has identity, authority, policy, and trace.

Avoid vague positioning:

  • “AI operating system” without a concrete authority model;
  • “agent playground”;
  • “factory game”;
  • “autonomous company” without controls.

The enduring claim should be simple:

capOS lets businesses test and delegate work to agents because the OS, not the prompt, enforces authority and records what happens.

Proposal: Chat As Multimedia Substrate

How capOS should design Chat as a unified text + audio + video transport interface for human-to-human, human-to-agent, and service-driven channels – mapped cleanly to WebRTC for browser participants – so that adding a new messaging surface (operator chat, agent prompt input, audio call, video call, file drop) does not require a new top-level capability or a new gateway DTO.

This proposal is the resolution of the “Chat as messaging substrate” research task in docs/tasks/README.md. It does not replace the existing Chat interface in schema/capos.capnp directly; it specifies the shape the next iteration of that interface should take, and it states what stays separate (notably: approvals).

Problem

The existing Chat interface (schema/capos.capnp:372-378) is a text-only, poll-based room: join, leave, send(text), who, poll(maxEvents) -> List(ChatEvent) where ChatEvent.kind is one of message|joined|left|system|history. That works for the chat-server demo and for a denial probe, but it cannot carry:

  • incoming events without polling (every browser tab paying for a poll loop is the wrong end-state);
  • audio frames (low-latency, lossy, ordered);
  • video frames (high-bandwidth, key-frame-aware);
  • file/binary attachments (bounded, integrity-checked);
  • structured non-text payloads that other surfaces want to share, e.g. agent prompts with tool-call hints, presence beacons, typing indicators, reactions.

Adjacent proposals each invent their own transport for what is fundamentally the same shape:

  • realtime-voice-agent-shell-proposal.md defines VoiceSession with openCapture / openPlayback and a RealtimeModelSession with RealtimeInputEvent / RealtimeOutputEvent. Audio frames flow on MemoryObject-backed media rings rather than capnp payloads.
  • llm-and-agent-proposal.md defines tool-call records and a per-tool permission gate, but never says how the operator talks to a running agent (send a prompt, get a partial response stream, push audio, receive audio).
  • remote-session-capset-client-proposal.md exposes one chatSend DTO method per chat, with no audio/video path at all.

Each proposal independently arrives at “we need a stream-of-events transport with capability-mediated subscription”. The right design is to share one substrate. Chat is already the user-facing name; the substrate should be Chat, extended.

WebRTC is the existing browser-side abstraction that solves the same problem (text via DataChannel, audio via audio tracks, video via video tracks, all under one peer connection with negotiated codecs and ICE-managed connectivity). A capOS Chat channel should map onto a WebRTC peer connection cleanly enough that a browser participant can be implemented as a WebRTC peer talking to a capOS-side gateway, without translation gymnastics.

Goals

  • Carry text, audio, video, and bounded binary attachments on the same chat cap, with capability-gated subscription per kind.
  • Replace poll with listener caps the channel calls back, so capnp-rpc participants do not poll. Keep poll available as a transport-stopgap for DTO clients during the migration to capnp-rpc.
  • Carry low-latency frames (audio, video) without copying them through capnp message payloads on the hot path – use MemoryObject-backed media rings or shared frame buffers, with the chat cap conveying control and frame metadata only.
  • Map cleanly to WebRTC for browser participants so the gateway can act as a signalling and ICE-relay endpoint without leaking raw browser handles to capOS code.
  • Preserve the existing capability model: capability = invoke gate; channel membership = render gate. A subscriber cap is required to receive text events; a separate audio-subscriber cap is required to receive audio frames; a separate video-subscriber cap is required to receive video.
  • Preserve session-bound invocation: the chat-cap holder’s session is the caller; channel servers see the live opaque session-scoped reference and may be granted disclosure scopes per session-bound-invocation-context-proposal.md.
  • Strict ocap discipline. Every Chat capability is granted explicitly by a holder that already has it. There is no protocol-level “request permission to write to me” flow: until a recipient (or a chain authorized by the recipient) shares a peer cap with the sender, the sender has no path. Rephrased: capabilities flow forward only, by deliberate sharing.
  • Cap lineage and transitive revocation are substrate-level invariants, enforced by the Chat service with kernel support. Lineage is a service concern, not a kernel one (per capOS’s “prefer userspace capability wrappers over kernel-side policy checks” principle). The root of every chat-cap lineage tree is the Chat service’s own root cap – the cap chat-server holds for “I run this Chat service”. The manifest is Chat service configuration, not kernel or broker configuration: chat-server reads it at startup and uses its root cap to materialize the configured groups and channels. Every cap chat hands out is parented somewhere in chat-server’s internal tree; ultimately every chain terminates at chat-server’s root. Cross-principal sharing goes through a chat-server method (GroupMember.invite, DiscoverableGroupJoin.join, DiscoverableChannelTextSubscribe.subscribe, etc.), which mints a fresh derived cap and records its parent. Raw bearer transfer of chat caps is blocked by the kernel via transfer_policy enforcement (see Open Questions). Revocation walks the tree and rotates the kernel-level cap epoch of every descendant in the revoked branch; subsequent dispatch fails closed at the kernel site (epoch rotation is already an existing kernel-level mechanism). This is what makes “a member started inviting spam bots into the group” recoverable: revoke the spammer’s branch; their downstream invitees go with them; unrelated siblings – and unrelated branches under the same group – are untouched.
  • Chat session sees callers via session-bound identity, not via a user-info cap. Per session-bound-invocation-context-proposal.md, the kernel attaches an opaque session-scoped reference to every invocation. Chat-server uses that reference to route messages, populate sender fields per its disclosure policy, and identify who joined which group, without holding any “look up user X” cap.
  • Telegram-shaped channel categories. Groups (with nested topics, owner
    • admin role hierarchy, extensible permissions), broadcast channels (read-only for subscribers), DMs, and end-to-end-encrypted DMs as a distinct cap layer. There is no special “system room” category – system-managed channels are just channels owned by service principals or designated admin principals (capOS already treats services as principals; see user-identity-and-policy-proposal.md PrincipalKind including service).
  • Keep backpressure tractable: outgoing media uses capnp -> stream for flow-controlled writes; incoming media listener caps may indicate drop-vs-queue policy in the subscription request.

Non-Goals

  • Replacing WebRTC for browser-to-browser P2P. capOS is the gateway; the browser still uses WebRTC primitives. We map them onto the gateway-held Chat cap, not the other way around.
  • Replacing RealtimeModelSession (realtime-voice-agent-shell-proposal.md) for agent-runtime ↔ model-provider transport. That session is a different layer: it carries provider-specific events (RealtimeInputEvent / RealtimeOutputEvent) between the runner and an external model API. The operator-facing surface (operator talks to the running agent, agent speaks back) is a chat; the agent runner bridges the two.
  • Replacing ApprovalClient / ApprovalGrant (shell-proposal.md:407-427). Action approvals are a separate capability. A chat may surface an approval request as a message event with a payload referencing an ApprovalGrant, but the cap holding the approval state stays distinct. See ## Approvals Stay Separate below.
  • Carrying raw on-the-wire codec bytes inside capnp payloads in the hot path. Frame metadata travels on capnp; frame bodies travel via shared memory or provider-owned handles.
  • Defining a global chat name registry. Channels are scoped: a chat cap hands you a specific server-owned room; how rooms get named lives in the hosting service (chat-server, adventure-server, agent runner, etc.).
  • File-transfer protocol design (resume, integrity, deduplication). Bounded attachments are in scope; large-file transfer reuses a separate File or ContentStore cap, with Chat carrying only the reference.

Architecture

flowchart LR
  subgraph capos[capOS]
    chatsrv[chat-server / agent-runner / adventure-server]
    ch[chat cap - per chat]
    chatsrv --> ch
  end

  subgraph rust[Trusted Rust backend]
    wrk[Per-session worker holds chat cap]
    listeners[ChatListener, AudioSink, VideoSink listener caps]
    wrk -- subscribe(listener) --> ch
    ch -- listener.post(event) --> listeners
    listeners --> appstate[AppState - text history buffer, audio ring, video ring]
  end

  subgraph browser[Browser]
    js[Browser JS - text view models, WebRTC peer for audio/video]
  end

  appstate -- text events as view models --> js
  appstate <-- WebRTC SDP/ICE signalling via /api/chat/webrtc --> js
  appstate <-- audio frames via WebRTC audio track --> js
  appstate <-- video frames via WebRTC video track --> js

Three layers, three transports:

  1. capnp-rpc, between capOS and the trusted Rust backend. Listener caps for incoming text events. -> stream methods for outgoing audio/video frames. Frame metadata on capnp; frame bodies on MemoryObject-backed rings shared between the worker process and the gateway.

  2. Trusted Rust backend bookkeeping. The backend holds the chat cap, buffers a bounded text history, and owns the audio/video media rings. Browser-visible state stays in view models.

  3. HTTP + WebRTC, between the trusted Rust backend and the browser. Text events flow as JSON view models on the existing /api/* HTTP surface. Audio and video flow through a WebRTC peer connection: the browser does the SDP offer; the backend produces an answer using a small capOS-side WebRTC adapter (or relays SDP to a capOS-side WebRTC service); audio/video tracks carry the frames the backend got via the media rings.

Schema Sketch

This is a sketch, not the final wire shape. Field numbers, exact param names, and struct nesting will be finalized when the implementation iteration starts; what matters here is the shape.

The substrate is not one interface. Role caps, discovery caps, contact caps, DM peer caps, listener caps, and outgoing-media caps are distinct interfaces because they have distinct authorities. Possessing a cap is the authority; calling a method that returns a derived cap is just a normal method call (no separate “redeem” step exists). The cap class’s transfer_policy (kernel-enforced) forbids raw bearer transfer between principals; sharing must go through chat-server’s derive*-shaped methods.

Naming convention (Telegram-aligned). Three concrete chat categories:

  • Group – multi-party two-way chat. Roles: GroupOwner, GroupAdmin, GroupMember. Supports nested topics.
  • Channel – broadcast (read-only for subscribers). Roles: ChannelOwner, ChannelAdmin, ChannelPublisher, plus the per-media-facet subscriber caps ChannelTextSubscriber / ChannelAudioSubscriber / ChannelVideoSubscriber. The substrate has no type-erased generic ChannelSubscriber; the result type of a subscribe path tells the caller exactly which media facets it grants (see schema below).
  • DM – direct message between two principals. Caps: DmPeer, E2EDmPeer. Established via ContactCap.

The unqualified word “channel” in this proposal only refers to a Telegram-style broadcast Channel. Any generic “stream of events” or “thing you can subscribe to” is called a chat (the substrate-level term). Base interfaces use the Chat prefix (ChatEndpoint, ChatWriter, ChatDirectory, ChatInfo, ChatKind); concrete roles use the category prefix (Group*, Channel*, Dm*).

# Identity / describe surface every chat-cap embeds (except pure
# listener caps and revokers). Holding ChatEndpoint alone grants
# nothing beyond inspecting metadata.
interface ChatEndpoint {
  describe @0 () -> (info :ChatInfo);
}

# ============================================================
# Per-kind read facets. The interface IS the permission: holding
# ChatTextReader grants subscribeText authority and ONLY that.
# Audio and video are separate caps. A text-only role does not
# expose subscribeAudio / subscribeVideo at all -- there is no
# runtime check for "are you allowed to read audio"; the absence
# of the method is the gate.
# ============================================================
interface ChatTextReader extends(ChatEndpoint) {
  subscribeText @0 (listener :TextListener,
                    options :SubscribeOptions) -> (sub :Subscription);
}

interface ChatAudioReader extends(ChatEndpoint) {
  subscribeAudio @0 (listener :AudioSink,
                     options :AudioSubscribeOptions) -> (sub :Subscription);
}

interface ChatVideoReader extends(ChatEndpoint) {
  subscribeVideo @0 (listener :VideoSink,
                     options :VideoSubscribeOptions) -> (sub :Subscription);
}

# ============================================================
# Per-kind write facets. Each writer extends the corresponding
# reader (a writer is also a reader of the same kind). Concrete
# roles compose the kinds they need.
# ============================================================
interface ChatTextWriter extends(ChatTextReader) {
  send           @0 (event :ChatOutboundEvent) -> ();
  postAttachment @1 (descriptor :AttachmentDescriptor) -> ();
}

interface ChatAudioWriter extends(ChatAudioReader) {
  openAudioOut @0 (format :AudioFormat) -> (track :AudioOut);
}

interface ChatVideoWriter extends(ChatVideoReader) {
  openVideoOut @0 (format :VideoFormat) -> (track :VideoOut);
}

# Convenience: full-multimedia writer. Most roles in this proposal
# extend this one; a "text-only group member" role would extend
# only ChatTextWriter, exposing strictly fewer methods.
interface ChatWriter extends(ChatTextWriter, ChatAudioWriter, ChatVideoWriter) {}

# ============================================================
# Group: multi-party two-way chat with topics + voice/stage rooms
# and an Owner/Admin/Member role hierarchy. Roles inherit upward:
# Owner is an Admin is a Member is a ChatWriter is a ChatEndpoint.
# ============================================================
interface GroupMember extends(ChatWriter) {
  rooms        @0 () -> (rooms :List(RoomInfo));
  # Each per-room accessor returns a kind-specific facet so
  # joining a text topic does not grant audio/video subscribe.
  textRoom     @1 (roomId :Text) -> (writer :ChatTextWriter);
  voiceRoom    @2 (roomId :Text) -> (room :VoiceRoom);
  stageRoom    @3 (roomId :Text) -> (room :StageRoom);
  callSurface  @4 () -> (calls :CallSurface);
  # `invite` returns the bearer token (handed to the invitee via
  # chat-server-mediated cap delivery), an issuer-held revoker,
  # AND the GroupCapRef of the issuance lineage node so the
  # caller can pass it to `GroupAdmin.describeBranch` /
  # `revokeBranch` later without having to walk the lineage to
  # find it. Splitting token from revoker prevents the invitee
  # or any downstream holder from revoking their own invite --
  # the InviteToken interface has no revoke method.
  invite       @5 (forSubject :PrincipalRef, lifetime :UInt64)
                   -> (token     :InviteToken,
                       revoker   :InviteRevoker,
                       inviteRef :GroupCapRef);
  # Out-of-band invite path. Returns BEARER-SECRET bytes the
  # issuer delivers via paper / QR / non-chat channel, the
  # issuer-side `revoker`, AND the `inviteRef` GroupCapRef
  # naming the issuance lineage node (analogous to `invite`).
  # The bytes name a distinct lineage node in chat-server's
  # tree (the issuance entry); any holder plus a Self cap can
  # redeem them via Self.acceptInviteCode(code). Treat them
  # with the same care as any bearer secret: do not log, do
  # not include in transcripts, do not expose to untrusted
  # observers, prefer bounded lifetimes and one-time-use
  # semantics. The `inviteRef` is non-secret and safe to log.
  inviteCode   @6 (lifetime :UInt64)
                   -> (code      :Data,
                       revoker   :InviteRevoker,
                       inviteRef :GroupCapRef);
  acceptInvite @7 (token :InviteToken) -> (member :GroupMember);
  leave        @8 () -> ();
}

interface GroupAdmin extends(GroupMember) {
  removeMember          @0 (memberRef :Data) -> ();
  # Both `revokeBranch` and `describeBranch` accept any lineage
  # node ref -- a member cap, an admin cap, an inviteCode lineage
  # node, or a transformation operation node (from
  # mergeIntoGroupAsTopic / moveTopicHere / extractTopicAsGroup).
  # Revoking a transformation node epochs the entire grafted
  # subtree; revoking a member cap epochs that member and the
  # invitees they admitted. See the BranchInfo schema for the
  # node kinds chat-server may return.
  revokeBranch          @1 (node :GroupCapRef) -> ();
  setMemberInvitePolicy @2 (policy :MemberInvitePolicy) -> ();
  createRoom            @3 (config :RoomConfig) -> (info :RoomInfo);
  removeRoom            @4 (roomId :Text) -> ();
  setRoomPolicy         @5 (roomId :Text, policy :RoomPolicy) -> ();
  # Per-principal ban list (deny-list for FUTURE mints only).
  # `banPrincipal` only adds the principal to the group's
  # ban list, so subsequent `DiscoverableGroupJoin.join()`,
  # `Self.acceptInvite` / `acceptInviteCode`, and
  # admin-mint paths fail closed with `principalBanned` for
  # this principal. It does NOT kick the principal's existing
  # caps; that's `revokeBranch`'s job. Without the deny-list,
  # a previously-revoked principal who still holds a
  # `DiscoverableGroupJoin` cap or a session bundle hook
  # could simply re-join and mint a fresh chain. The full
  # "kick + ban" workflow is the admin pairing
  # `GroupAdmin.revokeBranch(node :GroupCapRef)` with
  # `banPrincipal(principal :PrincipalRef)` in a single UI
  # step. The branch ref comes from one of the typed sources
  # (the `inviteRef` returned by the original
  # `GroupMember.invite(...)` tuple if the admin issued the
  # invite themselves; otherwise
  # `GroupAdmin.lookupByPrincipal(principal)` or
  # `describeRoot()` to walk the lineage tree). Raw transfer
  # of the target's bearer member cap is forbidden by
  # `transfer_policy`. The schema keeps the two concerns
  # separate so each is idempotent and individually meaningful.
  banPrincipal   @6 (principalRef :PrincipalRef) -> ();
  unbanPrincipal @7 (principalRef :PrincipalRef) -> ();
  # Admin-only stage facet. Returns a StageRoomAdmin cap whose
  # promoteToSpeaker / closeStage methods are not reachable from
  # an ordinary GroupMember.stageRoom() accessor.
  stageRoomAdmin @8 (roomId :Text) -> (admin :StageRoomAdmin);
  # Lineage inspection used during spam-bot triage and audit. The
  # caller passes a node reference; chat-server returns the
  # subtree rooted at that node (the member or operation, the
  # invitees/grafted members under it, sub-invitees, etc.) plus
  # enough metadata to drive a UI before calling `revokeBranch`.
  # Read-only.
  describeBranch @9 (node :GroupCapRef) -> (info :BranchInfo(GroupCapRef));
  # Top-down lineage walker. Returns the group's whole lineage
  # tree (subject to chat-server's truncation policy) so an
  # admin can locate a `GroupCapRef` for somebody else's
  # invitee, public-joined member, or transformation-grafted
  # member without already holding a ref. Together with
  # `lookupByPrincipal`, this closes the obtain path for
  # `describeBranch` / `revokeBranch` -- the caller does not
  # need a pre-existing ref. Read-only.
  describeRoot      @10 () -> (info :BranchInfo(GroupCapRef));
  # Convenience lookup: find the lineage nodes a given principal
  # holds in this group. May return multiple refs if the
  # principal joined via multiple paths (e.g. a manifest-bundled
  # GroupMember plus a public-join chain from a different
  # session). Returns an empty list for principals not in this
  # group. Read-only; the cap returned is by-ref handle, not the
  # principal's bearer cap.
  lookupByPrincipal @11 (principalRef :PrincipalRef)
                        -> (refs :List(GroupCapRef));
}

# Reference to a node inside this group's lineage tree. Opaque to
# the caller; chat-server uses it to look up the node. Names BOTH
# cap-bearing nodes (members/admins/etc.) AND transformation
# operation nodes (mergeIntoGroupAsTopic / moveTopicHere /
# extractTopicAsGroup), so revokeBranch / describeBranch can
# operate on the entire-graft case as well as the per-member case
# discussed under Chat-graph transformations.
struct GroupCapRef {
  nodeRef @0 :Data;    # chat-server-internal handle id
}

# Snapshot of a lineage subtree returned by describeBranch /
# describeRoot. Holds enough to render "this is who would be
# revoked" UI for both per-member kicks and entire-graft
# revocations of a transformation node. Generic over the ref
# kind so the same shape serves Group lineage (RefT =
# GroupCapRef) and broadcast-Channel lineage (RefT =
# ChannelCapRef) without losing the type-level distinction
# between Group and Channel refs.
struct BranchInfo(RefT) {
  root         @0 :LineageNode(RefT);
  totalMembers @1 :UInt32;   # cap nodes in subtree (excludes
                             # transformation op nodes)
  truncated    @2 :Bool;     # chat-server may cap deep trees
}

# Lineage nodes come in three flavours:
#  - cap-bearing nodes (member / admin / publisher / subscriber
#    caps held by a principal),
#  - transformation operation nodes (mergeIntoGroupAsTopic /
#    moveTopicHere / extractTopicAsGroup; no principal of their
#    own; just a graft point), and
#  - issuance nodes (a `ContactCap` issuance, an `InviteToken` /
#    `inviteCode` issuance, a `contactCode` issuance, or any
#    other "the issuer minted this so they can revoke its
#    downstream subtree" entry). Issuance nodes have a non-empty
#    descendants subtree once their token is redeemed.
# The shared envelope carries the ref, timestamp, parentage
# classification, and recursive children; the union arm carries
# the kind-specific data. Generic over RefT for the Group /
# Channel split.
#
# capnp generics constrain the ref type but cannot constrain the
# union arm by RefT (no dependent types in capnp). Soundness of
# "Group lineage trees only contain Group roles, Channel lineage
# trees only contain Channel roles" is therefore enforced
# at the chat-server boundary (it never emits a mismatched arm,
# and consumers may treat a mismatched arm as a chat-server
# implementation bug); the type system narrows the ref kind but
# the role kind is a documented invariant rather than a
# capnp-checked one.
struct LineageNode(RefT) {
  ref          @0 :RefT;
  joinedAtMs   @1 :UInt64;
  parentage    @2 :BranchParentage;
  children     @3 :List(LineageNode(RefT));
  union {
    capNode       @4 :CapNodeInfo;
    operationNode @5 :OperationNodeInfo;
    issuanceNode  @6 :IssuanceNodeInfo;
  }
}

# Issuance lineage node: an entry chat-server adds to its tree
# when an issuer mints a bearer-cap or bearer-secret handle whose
# downstream descendants the issuer wants to be able to revoke
# transitively. Examples: `Self.contact` / `Self.contactCode`
# (DmPeer / E2EDmPeer descendants), `GroupMember.invite` /
# `inviteCode` (GroupMember descendants), and any future
# bearer-issuance pattern. The issuer holds either a typed
# revoker cap (`InviteRevoker`, `SpeakerRevoker`) or a non-secret
# ref handle (`ContactCapRef`, `inviteRef :GroupCapRef`,
# `codeId :Data`); revoking via that handle epochs the issuance
# node and every descendant.
struct IssuanceNodeInfo {
  issuer    @0 :PrincipalRef;        # who minted the issuance
  kind      @1 :IssuanceKind;
  expiresAtMs @2 :UInt64;            # 0 = unbounded
}

enum IssuanceKind {
  contactCap        @0;   # Self.contact                       -> ContactCap (cap form)
  contactCode       @1;   # Self.contactCode                   -> bytes (code form)
  inviteToken       @2;   # GroupMember.invite                 -> InviteToken (cap form)
  inviteCode        @3;   # GroupMember.inviteCode             -> bytes (code form)
  speakerToken      @4;   # StageRoomAdmin.promoteToSpeaker    -> SpeakerToken delivered via roster
  groupAdminGrant   @5;   # GroupOwner.makeAdmin               -> GroupAdmin delivered via Self.subscribeIncoming
  channelPublisherGrant @6;  # ChannelAdmin.makePublisher      -> ChannelPublisher delivered via Self.subscribeIncoming
  channelAdminGrant @7;   # ChannelOwner.makeAdmin             -> ChannelAdmin delivered via Self.subscribeIncoming
  callHostGrant     @8;   # CallHost.promoteHost               -> CallHost delivered via CallRosterDelta
  e2eCallHostGrant  @9;   # E2ECallHost.promoteHost            -> E2ECallHost delivered via CallRosterDelta
}

struct CapNodeInfo {
  principal @0 :PrincipalRef;
  role      @1 :ChatNodeRole;        # narrowed to the chat kind
                                     # of the enclosing
                                     # BranchInfo
}

# Per-chat-kind role discriminator inside lineage nodes. capnp
# generics narrow the ref type (`RefT`) but cannot narrow the
# role-union arm to match it (capnp has no dependent types).
# Documented invariant, enforced at the chat-server boundary:
# a `BranchInfo(GroupCapRef)` only emits the `group` arm, a
# `BranchInfo(ChannelCapRef)` only emits the `channel` arm.
# Consumers walking either tree may treat a mismatched arm as a
# chat-server implementation bug (return `unexpectedRoleKind`)
# rather than as caller-induced data.
struct ChatNodeRole {
  union {
    group   @0 :GroupRole;
    channel @1 :ChannelRole;
  }
}

enum GroupRole {
  owner  @0;
  admin  @1;
  member @2;
}

# `ChatRole` is retained as an alias for `GroupRole` for any
# audit / lineage prose that referred to "the chat role" without
# distinguishing Group from Channel (e.g. older descriptions of
# manifest-bundle entries). New schema methods use `GroupRole`
# or `ChannelRole` directly; do not introduce new uses of
# `ChatRole`.
using ChatRole = GroupRole;

enum ChannelRole {
  owner           @0;
  admin           @1;
  publisher       @2;
  textSubscriber  @3;
  audioSubscriber @4;
  videoSubscriber @5;
}

struct OperationNodeInfo {
  operation       @0 :TransformationOp;
  initiator       @1 :PrincipalRef;     # caller-side admin that issued
  consent         @2 :OperationConsent;  # who provided the second
                                        # authority that authorized
                                        # the graft
  sourceTopicId   @3 :Text;             # may be empty for full-graft ops
  targetTopicId   @4 :Text;
}

# The two-cap proof consumed by chat-graph transformations is not
# always two admins. mergeIntoGroupAsTopic and moveTopicHere need
# the *other* group's admin role; extractTopicAsGroup needs the
# initiator's own Self cap (creation-quota authority), since the
# new group has no other-side admin yet. The variant tells audit
# UIs which authority shape was checked.
struct OperationConsent {
  union {
    partnerAdmin    @0 :PrincipalRef;  # mergeIntoGroupAsTopic /
                                       # moveTopicHere: the
                                       # other-group admin who
                                       # consented in the same call
    selfCreation    @1 :PrincipalRef;  # extractTopicAsGroup: the
                                       # initiator's Self cap
                                       # principal proving creation
                                       # quota; same principal as
                                       # `initiator` above
  }
}

enum TransformationOp {
  mergeIntoGroupAsTopic @0;
  moveTopicHere         @1;
  extractTopicAsGroup   @2;
}

enum BranchParentage {
  manifestBundle @0;
  publicJoin     @1;   # via DiscoverableGroupJoin.join()
  invitedCap     @2;   # via Self.acceptInvite(token)
  invitedCode    @3;   # via Self.acceptInviteCode(code)
  ownerMint      @4;   # GroupOwner.makeAdmin / similar
  transformation @5;   # parented to a TransformationOp node
  issuance       @6;   # this node IS an issuance entry
                       # (Self.contact, Self.contactCode,
                       # GroupMember.invite, inviteCode,
                       # StageRoomAdmin.promoteToSpeaker, etc.).
                       # The node's parent in the tree is its
                       # *issuer* (Self cap or role cap); the
                       # `issuance` tag distinguishes the node
                       # itself from a redeemed descendant.
}

interface GroupOwner extends(GroupAdmin) {
  # Promote a member to admin. Same delivery shape as
  # `GroupMember.invite` / `StageRoomAdmin.promoteToSpeaker`:
  # chat-server records a *promotion issuance node* in the
  # group's lineage tree (parented to the calling Owner cap)
  # and delivers the freshly minted `GroupAdmin` cap to the
  # promoted principal via that principal's `Self.subscribeIncoming`
  # (`groupAdminGranted :GroupAdmin` arm), parented under the
  # promotion node. The Owner gets back only an
  # issuer-side `RolePromotionRevoker` (revokes the promotion --
  # epoching the promoted GroupAdmin and any descendants the
  # promotee minted) plus a non-secret `promotionRef
  # :GroupCapRef` for `describeBranch` / `revokeBranch`. The
  # caller does NOT receive the target's GroupAdmin cap; raw
  # cross-principal cap delivery would violate
  # `transfer_policy`.
  makeAdmin           @0 (memberRef :Data, perms :AdminPermissions)
                          -> (revoker      :RolePromotionRevoker,
                              promotionRef :GroupCapRef);
  setGroupPolicy      @1 (policy :GroupPolicy) -> ();
  # Discoverable join is always Member-typed. There is no
  # `joinRole` argument because `DiscoverableGroupJoin.join()`
  # is fixed to return `GroupMember` (admin / owner roles are
  # minted via `GroupOwner.makeAdmin` (which produces a
  # GroupAdmin, not an Owner -- new Owners come only from the
  # manifest, `Self.startGroup`, or `extractTopicAsGroup`),
  # never via
  # public join). Removing the parameter eliminates the prior
  # mismatch where `joinRole=admin` could be advertised but
  # `.join()` would still mint only a member.
  publishDiscoverable @2 (scope :ChatDirectoryScopeRef)
                          -> (entry :ChatDirectoryEntryHandle);
  closePublicJoin     @3 (entry :ChatDirectoryEntryHandle) -> ();
  disband             @4 () -> ();
}

# Issuer-held companion to a role-promotion. Parallel to
# InviteRevoker / SpeakerRevoker. Calling `revoke()` epochs the
# promoted role cap AND every descendant the promotee minted
# under it; the promoted principal falls back to whatever role
# they held before the promotion (the substrate does not auto-
# kick them from the chat). Promoter retains this revoker
# alongside the non-secret `promotionRef` for the cap-clean
# describeBranch / revokeBranch path.
interface RolePromotionRevoker {
  describe @0 () -> (info :RolePromotionInfo);
  revoke   @1 () -> ();
}

# Bearer cap. Holding it lets the recipient call
# `Self.acceptInvite(token) -> GroupMember` (or
# `GroupMember.acceptInvite(token)` when joining via an existing
# group context). The token has NO revoke method -- bearers do
# not revoke their own invites. Revocation lives on the issuer's
# InviteRevoker cap.
interface InviteToken {
  describe @0 () -> (info :InviteInfo);
}

# Issuer-held companion to InviteToken. The InviteRevoker is
# parented to the issuer's role cap in chat-server's lineage tree.
interface InviteRevoker {
  describe @0 () -> (info :InviteInfo);
  revoke   @1 () -> ();
}

# ============================================================
# Channel (Telegram-strict: BROADCAST, not the generic word).
# Subscribers read; Publishers/Admins/Owner write. Subscribers
# do NOT extend ChatWriter -- the type system enforces RO at
# compile time.
# ============================================================
# Per-kind subscriber types. The interface IS the permission:
# a ChannelTextSubscriber holder cannot call subscribeAudio /
# subscribeVideo, regardless of runtime policy. Each variant
# composes only the readers it grants. Discovery yields the
# variant chat-server's configuration says applies to the
# scope's policy for this caller; the result type tells the
# caller exactly what they got.
interface ChannelTextSubscriber extends(ChatTextReader) {
  unsubscribe @0 () -> ();
}

interface ChannelAudioSubscriber extends(ChatTextReader, ChatAudioReader) {
  unsubscribe @0 () -> ();
}

interface ChannelVideoSubscriber extends(ChatTextReader, ChatAudioReader, ChatVideoReader) {
  unsubscribe @0 () -> ();
}

# Publisher writes; lifecycle (close the whole channel) is NOT
# here. A non-admin publisher should be able to post but not
# tear down the channel. closeChannel lives on ChannelAdmin
# below.
interface ChannelPublisher extends(ChatWriter) {}

interface ChannelAdmin extends(ChannelPublisher) {
  # Same delivery shape as `GroupOwner.makeAdmin`: chat-server
  # records a promotion issuance node parented to the calling
  # ChannelAdmin cap, delivers the freshly minted
  # `ChannelPublisher` to the promoted principal via
  # `Self.subscribeIncoming` (`channelPublisherGranted :ChannelPublisher`
  # arm), and returns only the issuer-side revoker plus a
  # non-secret promotionRef to the caller. Cross-principal
  # role-cap delivery to the promoter is forbidden.
  makePublisher    @0 (subjectRef :PrincipalRef)
                       -> (revoker      :RolePromotionRevoker,
                           promotionRef :ChannelCapRef);
  removePublisher  @1 (publisherRef :Data) -> ();
  revokeBranch     @2 (node :ChannelCapRef) -> ();
  # Per-principal ban list (deny-list for FUTURE mints only).
  # Same semantics as `GroupAdmin.banPrincipal`: `banPrincipal`
  # only updates the broadcast Channel's deny-list; existing
  # caps held by the principal are not epoched. Pair with
  # `revokeBranch` for "kick + ban".
  banPrincipal     @3 (principalRef :PrincipalRef) -> ();
  unbanPrincipal   @4 (principalRef :PrincipalRef) -> ();
  closeChannel     @5 () -> ();   # close the whole broadcast
                                  # channel (not just the
                                  # publisher's own stream)
  # Lineage queries parallel to GroupAdmin. Same purpose: an
  # admin needs `ChannelCapRef` handles to call `revokeBranch`
  # for somebody else's publisher/subscriber chain, but the
  # ChannelAdmin doesn't hold those caps. `describeBranch`
  # accepts a known node ref and returns its subtree;
  # `describeRoot` returns the whole channel lineage tree
  # (truncated per policy); `lookupByPrincipal` returns refs
  # for a given principal's caps in this channel. All
  # read-only.
  describeBranch    @6 (node :ChannelCapRef) -> (info :BranchInfo(ChannelCapRef));
  describeRoot      @7 () -> (info :BranchInfo(ChannelCapRef));
  lookupByPrincipal @8 (principalRef :PrincipalRef)
                        -> (refs :List(ChannelCapRef));
}

interface ChannelOwner extends(ChannelAdmin) {
  # Same delivery shape as the `makePublisher` and
  # `GroupOwner.makeAdmin` promotions: chat-server records a
  # promotion issuance node, delivers the freshly minted
  # `ChannelAdmin` to the promoted principal via
  # `Self.subscribeIncoming` (`channelAdminGranted :ChannelAdmin` arm),
  # and returns only the revoker plus promotionRef.
  makeAdmin           @0 (publisherRef :Data, perms :AdminPermissions)
                          -> (revoker      :RolePromotionRevoker,
                              promotionRef :ChannelCapRef);
  setChannelPolicy    @1 (policy :ChannelPolicy) -> ();
  publishDiscoverable @2 (scope :ChatDirectoryScopeRef)
                          -> (entry :ChatDirectoryEntryHandle);
  closePublicJoin     @3 (entry :ChatDirectoryEntryHandle) -> ();
}

# Reference to a node inside this broadcast Channel's lineage
# tree. Same shape as `GroupCapRef` but a distinct nominal type
# so a Group ref cannot be passed to `ChannelAdmin.revokeBranch`
# (and vice versa) at the type level. Names BOTH cap-bearing
# nodes (Channel{Owner,Admin,Publisher,*Subscriber}) AND any
# operation node a Channel might gain in the future. Opaque to
# the caller; chat-server resolves via its internal lineage table.
struct ChannelCapRef {
  nodeRef @0 :Data;
}

# ============================================================
# Rooms within a Group. Three kinds: text topics, persistent
# voice rooms (Discord-style), broadcast stage rooms (Discord
# stage / Twitter Spaces). Per-room permission overrides are
# out of scope for the first slice (extensible via RoomPolicy).
# ============================================================
enum RoomKind {
  textTopic @0;
  voiceRoom @1;
  stageRoom @2;
}

struct RoomInfo {
  roomId      @0 :Text;
  kind        @1 :RoomKind;
  displayName @2 :Text;
  topology    @3 :CallTopology;   # for voice/stage; ignored for text
  capacity    @4 :UInt32;         # 0 = unbounded (per chat-server policy)
}

# Persistent voice room (always alive while the room exists).
# Joining means entering the call already in progress in this room.
interface VoiceRoom {
  describe        @0 () -> (info :VoiceRoomInfo);
  subscribeRoster @1 (listener :CallRosterListener,
                      options :RosterSubscribeOptions)
                      -> (sub :Subscription);
  describeRoster  @2 () -> (snapshot :CallRosterSnapshot);
  join            @3 () -> (participant :CallParticipant);
}

# Stage room (broadcast voice within a Group). Subscribers listen;
# Speakers publish; admins promote a hand-raiser to speaker by
# minting a SpeakerToken (handed to the listener) plus a
# SpeakerRevoker (kept admin-side).
#
# StageRoom (member-reachable via GroupMember.stageRoom) does NOT
# carry promote authority -- ordinary members can listen, speak
# (with a token), and raise their hand, but cannot mint speaker
# tokens. Promotion lives on StageRoomAdmin, which is reached only
# through GroupAdmin (see below).
interface StageRoom {
  describe        @0 () -> (info :StageRoomInfo);
  subscribeRoster @1 (listener :CallRosterListener,
                      options :RosterSubscribeOptions)
                      -> (sub :Subscription);
  joinAsListener  @2 () -> (participant :StageListener);
  # On redemption, chat-server mints `StageSpeaker` with
  # `parent = the SpeakerToken's lineage node`. The companion
  # `SpeakerRevoker` therefore epochs both the unredeemed token
  # AND any active StageSpeaker descendant; admin pulling the
  # floor back kills live mic, not just future redemptions.
  joinAsSpeaker   @3 (token :SpeakerToken)
                      -> (participant :StageSpeaker);
  raiseHand       @4 () -> ();
}

# Admin-only stage facet. Reached via GroupAdmin.stageRoomAdmin
# (added to GroupAdmin earlier in the schema sketch); not
# obtainable from a plain GroupMember's stageRoom() accessor.
# `promoteToSpeaker` does NOT return the bearer SpeakerToken to
# the admin. Bound to listenerRef on the chat-server side and
# delivered directly to that listener via their existing
# StageRoom.subscribeRoster stream as a "you-are-now-a-speaker"
# event carrying the SpeakerToken cap reference. The admin keeps
# only the SpeakerRevoker. This avoids the cross-principal
# bearer-cap handoff problem (raw transfer is forbidden; chat
# events on the stage roster are the chat-server-mediated
# delivery path the substrate already provides).
interface StageRoomAdmin {
  describe         @0 () -> (info :StageRoomInfo);
  promoteToSpeaker @1 (listenerRef :Data)
                       -> (revoker :SpeakerRevoker);
  closeStage       @2 () -> ();
}

interface StageListener extends(ChatTextReader, ChatAudioReader) {
  leave @0 () -> ();
}

# Stage speakers are broadcast-voice only: no `publishVideo` and
# no `subscribeVideo` because the stage-room model has no video.
# Possession of `SpeakerToken` mints exactly this audio-only cap.
interface StageSpeaker extends(AudioCallParticipant) {
  yieldFloor @0 () -> ();
}

# Bearer cap held by a hand-raised listener after promotion.
# Has NO revoke method -- the admin's promotion is undone via
# the issuer-held SpeakerRevoker, parallel to InviteToken/Revoker.
interface SpeakerToken {
  describe @0 () -> (info :SpeakerTokenInfo);
}

interface SpeakerRevoker {
  describe @0 () -> (info :SpeakerTokenInfo);
  revoke   @1 () -> ();   # admin pulls the floor back
}

# ============================================================
# Ephemeral Call. Distinct from VoiceRoom: a Call has explicit
# start/end and lives within a chat (Group or DM). Use Call for
# "let's hop on a quick conference"; use VoiceRoom for "Discord
# voice channel always there". Both can coexist in a Group.
# ============================================================
interface CallSurface {
  current        @0 () -> (info :ActiveCallInfo);   # may be empty
  subscribeState @1 (listener :CallStateListener,
                     options :SubscribeOptions)
                     -> (sub :Subscription);
  startCall      @2 (config :CallStartConfig) -> (host :CallHost);
  joinCall       @3 () -> (participant :CallParticipant);
  # Roster delivery for ad-hoc calls. Same shape as
  # VoiceRoom.subscribeRoster / StageRoom.subscribeRoster, but
  # bound to whatever ad-hoc call is currently active on this
  # surface (or to the next call if none is active yet -- the
  # subscription persists across start/end transitions of the
  # surface's call until cancelled). This is the only delivery
  # path for the cap-bearing roster variants
  # (`hostGranted :CallHost`, `speakerGranted :SpeakerToken`),
  # so a participant who needs to receive a host-promotion in
  # an ad-hoc call must hold a Subscription minted here.
  subscribeRoster @4 (listener :CallRosterListener,
                      options :RosterSubscribeOptions)
                      -> (sub :Subscription);
}

# Audio-only call participation facet. Lifts every call method
# that does not pull in video authority. Used by both the full
# A/V `CallParticipant` and the audio-only `StageSpeaker`.
# Stage rooms are broadcast voice (no stage video in the model),
# so a `SpeakerToken` redemption must mint a stage participant
# that does NOT expose `publishVideo` / `subscribeVideo` -- the
# split lives at the type level here.
interface AudioCallParticipant extends(ChatAudioReader) {
  publishAudio   @0 (format :AudioFormat) -> (track :AudioOut);
  unpublishAudio @1 () -> ();
  raiseHand      @2 (raised :Bool) -> ();
  setMyMuteState @3 (muted :Bool) -> ();
  leave          @4 () -> ();
}

# Full A/V plaintext participant. Adds video publish/unpublish on
# top of the audio facet, plus inherits subscribeVideo via
# `ChatVideoReader`. Returned by every Group plaintext call
# entry point: ad-hoc `CallSurface.startCall` / `joinCall`
# AND persistent `VoiceRoom.join` (group voice rooms are
# plaintext multi-party voice, so they share this cap shape).
# DM calls do NOT use this cap: they go through a separate
# `E2ECallSurface` that returns the cipher-only
# `E2ECallParticipant` (see the End-To-End Encrypted DMs section
# below) so the keyless-host invariant holds for DM media.
# `CallParticipant` must NOT be plumbed through any DM path.
# Text-during-call goes through the parent chat's
# `ChatTextWriter`, not through the call participant cap;
# that's why `ChatTextReader` is absent here.
interface CallParticipant extends(AudioCallParticipant, ChatVideoReader) {
  publishVideo   @0 (format :VideoFormat, purpose :VideoPurpose)
                     -> (track :VideoOut);
  unpublishVideo @1 (purpose :VideoPurpose) -> ();
}

interface CallHost extends(CallParticipant) {
  mute            @0 (participantRef :Data) -> ();
  unmute          @1 (participantRef :Data) -> ();
  eject           @2 (participantRef :Data) -> ();
  # Same cross-principal-cap-delivery rule as the chat
  # role-promotion methods. The promoted participant is already
  # listening on the call's roster subscription, so chat-server
  # delivers the new `CallHost` cap to the bound participant via
  # the existing `CallRosterDelta` stream
  # (`hostGranted :CallHost` arm) rather than minting it back to
  # the calling host. Caller keeps only the issuer-side
  # `RolePromotionRevoker`. Parallels the SpeakerToken delivery
  # pattern.
  promoteHost     @3 (participantRef :Data) -> (revoker :RolePromotionRevoker);
  setRoutingMode  @4 (mode :CallRoutingMode) -> ();
  end             @5 () -> ();
}

enum VideoPurpose      { camera @0; screenShare @1; virtualScene @2; externalFeed @3; }
enum CallRoutingMode   { sfu @0; mesh @1; mcu @2; }
enum CallTopology      { peerToPeer @0; serverForwarded @1; serverMixed @2; }

interface CallRosterListener {
  update @0 (delta :CallRosterDelta) -> ();
}

# Tagged union of roster events. Most variants carry plain data;
# `speakerGranted` carries a `SpeakerToken` cap, which is the
# substrate's only delivery path for the cross-principal bearer
# cap minted by `StageRoomAdmin.promoteToSpeaker(listenerRef)`.
# Delivery is listener-bound: chat-server only emits this variant
# to the roster subscription of the listener named in
# `listenerRef` -- other listeners on the same stage roster do
# NOT see this variant for that promotion. That listener then
# calls `StageRoom.joinAsSpeaker(token)` with the cap reference
# extracted from the delta.
struct CallRosterDelta {
  union {
    participantJoined  @0 :ParticipantInfo;
    participantLeft    @1 :Data;            # participantRef
    muteChanged        @2 :MuteUpdate;
    activeSpeaker      @3 :Data;            # participantRef
    handRaised         @4 :HandRaiseUpdate;
    screenShareStarted @5 :ScreenShareInfo;
    screenShareEnded   @6 :Data;            # participantRef
    connectionQuality  @7 :QualityUpdate;
    # Stage-specific cap-bearing variants.
    speakerGranted     @8 :SpeakerToken;
    speakerRevoked     @9 :Data;            # participantRef
    # Call-host promotion cap-bearing variants. Delivered
    # listener-bound (only the listener named in
    # `CallHost.promoteHost(participantRef)` /
    # `E2ECallHost.promoteHost(participantRef)` sees the
    # variant; other roster subscribers do NOT). Parallels the
    # speakerGranted pattern.
    hostGranted        @10 :CallHost;
    e2eHostGranted     @11 :E2ECallHost;
    hostRevoked        @12 :Data;           # participantRef
  }
}

# The substrate is RECORDING-BLIND -- there is no "recording
# state" field, no "recording started" delta, and no
# protocol-level recording authority. Whoever holds a
# participant cap may locally record what they receive; a
# "shared recording" of a meeting is modeled by inviting a
# recorder principal into the call as a regular participant.

# Discovery surface owned by chat-server. Each session holds a
# ChatDirectory cap (or none) according to chat-server config.
# Search-based, not list-based: scopes can grow large, and the
# results visible to a session depend on chat-server policy that
# tests the calling session's identity. The unbounded "give me
# everything" shape is wrong; the right shape is "give me the
# entries matching this query, bounded".
#
# Note: this is *not* the filesystem `Directory` cap defined in
# `storage-and-naming-proposal.md`. The two interfaces share the
# dictionary meaning of "directory" (an enumerable namespace) but
# nothing else: filesystem `Directory` opens files; chat
# `ChatDirectory` returns join handles for chats. The
# names are deliberately disambiguated.
interface ChatDirectory {
  search @0 (query :ChatDirectoryQuery)
      -> (page :ChatDirectoryPage);
  describe @1 () -> (info :ChatDirectoryScopeInfo);
}

struct ChatDirectoryQuery {
  namePattern @0 :Text;            # optional substring/glob
  chatKind @1 :ChatKind;     # optional kind filter
  ownerKind @2 :PrincipalKind;     # optional principal-kind filter
  limit @3 :UInt32;                # bounded page size; chat-server
                                   # may further clamp
  cursor @4 :Data;                 # opaque pagination cursor
                                   # returned by a previous search
}

struct ChatDirectoryPage {
  entries @0 :List(ChatDirectoryEntry);
  nextCursor @1 :Data;             # empty when no more pages
}

struct ChatDirectoryEntry {
  chatInfo @0 :ChatInfo;
  # Each entry carries a kind-specific join cap. The interface IS
  # the permission: a Group entry hands you a DiscoverableGroupJoin
  # whose .join() returns GroupMember, a Channel entry hands you
  # one of the per-kind subscribe caps whose .subscribe() returns
  # the matching subscriber. A caller never has to downcast.
  union {
    groupJoin                  @1 :DiscoverableGroupJoin;
    channelTextSubscribe       @2 :DiscoverableChannelTextSubscribe;
    channelAudioSubscribe      @3 :DiscoverableChannelAudioSubscribe;
    channelVideoSubscribe      @4 :DiscoverableChannelVideoSubscribe;
  }
}

# Possessing one of these caps IS the policy gate. Calling the
# join/subscribe method mints a fresh role cap parented to the
# per-call join event (a fresh chain root in chat-server's lineage
# tree) -- not parented to this discoverable cap itself. So
# revoking one joiner's branch leaves siblings intact, and closing
# the discoverable route epochs the discoverable cap class without
# touching existing members.
interface DiscoverableGroupJoin {
  join @0 () -> (member :GroupMember);
}

# Each Channel directory entry yields a per-kind subscribe cap so
# the result type tells the caller exactly which media they may
# read. chat-server config decides which variant fits the calling
# session's policy.
interface DiscoverableChannelTextSubscribe {
  subscribe @0 () -> (subscriber :ChannelTextSubscriber);
}
interface DiscoverableChannelAudioSubscribe {
  subscribe @0 () -> (subscriber :ChannelAudioSubscriber);
}
interface DiscoverableChannelVideoSubscribe {
  subscribe @0 () -> (subscriber :ChannelVideoSubscriber);
}

# ============================================================
# DM (host plaintext-aware text; host-blind A/V) and E2E DM
# (host-blind everything).
#
# DmPeer extends only ChatTextWriter, NOT full ChatWriter. The
# plaintext audio/video write methods (openAudioOut /
# openVideoOut) and the plaintext audio/video subscribe methods
# (subscribeAudio / subscribeVideo from ChatAudioReader /
# ChatVideoReader) are absent at the type level. All DM media
# flows through `callSurface() -> E2ECallSurface` only -- the
# SFU-forward-only end-to-end-encrypted call surface. A
# plaintext-text DM cannot accidentally route media through a
# host-readable plaintext path because no method to do so
# exists on the cap.
# ============================================================
interface DmPeer extends(ChatTextWriter) {
  remoteFingerprint @0 () -> (info :PeerFingerprint);
  # DM calls are ALWAYS end-to-end encrypted, even when the DM
  # text is not. chat-server forwards encrypted media; key
  # exchange (DTLS-SRTP or equivalent) runs between the two peers
  # at call start.
  callSurface       @1 () -> (calls :E2ECallSurface);
  closeDm           @2 () -> ();
}

# Each principal holds a Self cap that lets them produce a contact
# cap, accept incoming invites, accept incoming DMs, revoke contact
# caps they issued, and start new groups (subject to chat-server
# config-gated quota per principal class).
interface Self {
  # Cap-form contact issuance. Returns BOTH the bearer
  # `ContactCap` (handed via chat-server-mediated cap delivery to
  # whoever should be able to DM the issuer) AND a stable
  # `ContactCapRef` -- a non-secret, issuer-side handle the issuer
  # keeps so they can later call `revokeContact(ref)`. Without a
  # separate handle the issuer would have to retain the bearer
  # cap itself to revoke it, and bearer caps go to the recipient.
  contact       @0 (lifetime :UInt64)
                    -> (contact :ContactCap, ref :ContactCapRef);
  # Code-form contact issuance. Returns BOTH the BEARER-SECRET
  # `code` bytes (suitable for paper / QR / out-of-band handoff;
  # any holder plus a Self cap can redeem via openDmFromCode /
  # openE2EDmFromCode) AND a stable `codeId` -- the non-secret
  # issuer-side handle for `revokeContactCode(codeId)`. The
  # `code` bytes embed the codeId so chat-server can find the
  # issuance lineage node without exposing the secret in the
  # revocation API. Treat the `code` with bearer-secret hygiene:
  # do not log, do not include in transcripts, prefer bounded
  # lifetimes, rate-limit redemption attempts. The codeId is a
  # plain identifier safe to store in audit logs.
  contactCode   @1 (lifetime :UInt64)
                    -> (code :Data, codeId :Data);

  revokeContact     @2 (ref :ContactCapRef) -> ();
  revokeContactCode @3 (codeId :Data) -> ();

  openDm        @4 (contact :ContactCap) -> (peer :DmPeer);
  openE2EDm     @5 (contact :ContactCap) -> (peer :E2EDmPeer);
  # Out-of-band redemption paths. Take Data, not a cap, because
  # paper/QR handoff cannot produce a cap when raw bearer
  # transfer is forbidden by `transfer_policy`. The bytes are
  # *bearer secrets* that name a distinct lineage node in
  # chat-server's tree (the issuance entry created by
  # `Self.contactCode` / `GroupMember.inviteCode`). chat-server
  # consumes the code byte-for-byte, validates it against that
  # lineage node, and mints the derived role/peer cap with
  # `parent = the code's lineage node` -- NOT directly with
  # parent = the issuer's role cap. So `Self.revokeContactCode`
  # and the invite-code's `InviteRevoker` epoch only that
  # specific code's descendants.
  openDmFromCode    @6 (code :Data) -> (peer :DmPeer);
  openE2EDmFromCode @7 (code :Data) -> (peer :E2EDmPeer);

  acceptInvite     @8 (token :InviteToken) -> (member :GroupMember);
  acceptInviteCode @9 (code :Data) -> (member :GroupMember);

  startGroup   @10 (config :GroupCreateConfig) -> (owner :GroupOwner);
  describe     @11 () -> (info :SelfInfo);

  # Inbound-DM notification surface. When some other principal
  # opens a DM to this Self via `openDm` / `openDmFromCode` /
  # `openE2EDm` / `openE2EDmFromCode`, chat-server delivers the
  # other side's peer cap (`DmPeer(self->other)` /
  # `E2EDmPeer(self->other)`) here so the receiving principal
  # can subscribe and reply. Listener is minted by the receiver
  # and carries the same lifetime as any other listener cap
  # (drop / Subscription.cancel revokes locally). The listener
  # also fires for redeemed code-form DMs (so the issuer learns
  # who claimed a `contactCode` they handed out) and for new
  # group invites accepted via `Self.acceptInvite` /
  # `acceptInviteCode` if the issuer subscribes -- the typed
  # event lets the issuer attribute incoming chains to the
  # specific contact / invite they issued.
  subscribeIncoming @12 (listener :SelfIncomingListener,
                         options :SubscribeOptions)
                         -> (sub :Subscription);
}

# Listener for chat-server-mediated cap deliveries TO a Self.
# Chat-server fires `delivered` once per inbound peer / member
# cap; the listener's owning principal extracts the cap and
# decides what to do with it (subscribe, archive, ignore, etc.).
interface SelfIncomingListener {
  delivered @0 (event :SelfIncomingEvent) -> ();
}

# Tagged union of inbound chat-server-mediated deliveries.
# `kind` discriminates the delivery flavour; `source` identifies
# WHICH issuance the delivery is parented under so the issuer
# can attribute the event to a specific contact / code / invite
# they handed out, drive a UI ("Bob just opened a DM via the
# contactCode I posted last week"), or call the matching
# revoke method.
#
# Cross-principal cap delivery rule: dmOpened / e2eDmOpened
# carry the *receiver's* peer cap (the listener owner is the
# contact issuer; the chat-server-minted cap belongs to that
# same principal, so this is NOT cross-principal delivery).
# inviteAccepted is the inviter notification arm. It carries
# *no live cap*: the issuance is identified by the envelope's
# `source.inviteRef :GroupCapRef` (the inviter already holds
# this from their original `GroupMember.invite(...)` tuple),
# and the redeemed branch is identified by
# `InviteAcceptedNotice.acceptedRef :GroupCapRef` (a NEW ref
# naming the redeemed `GroupMember` lineage node, distinct
# from the issuance node). Keeping the two refs distinct lets
# the inviter both attribute the event to its issuance entry
# AND drive `GroupAdmin.describeBranch(acceptedRef)` /
# `revokeBranch(acceptedRef)` on the specific redeemed member
# without conflating it with the issuance node.
# inviteOffered is the *invitee* notification arm and carries
# the InviteToken cap chat-server re-mints for the invitee
# under the original issuance node (same lineage rule as the
# chat-event delivery path), so the invitee can call
# Self.acceptInvite(token) -> GroupMember.
struct SelfIncomingEvent {
  receivedAtMs @0 :UInt64;
  source       @1 :IssuanceSource;    # which issuance the
                                      # delivery is parented
                                      # under
  union {
    dmOpened             @2 :DmPeer;
    e2eDmOpened          @3 :E2EDmPeer;
    inviteOffered        @4 :InviteToken;
    inviteAccepted       @5 :InviteAcceptedNotice;
    # Role-promotion delivery arms. Chat-server fires one of
    # these on the promoted principal's Self listener after
    # `GroupOwner.makeAdmin` / `ChannelAdmin.makePublisher` /
    # `ChannelOwner.makeAdmin`. The cap is parented under the
    # promotion issuance node (a chat-server-owned lineage
    # entry); revoking via the issuer's
    # `RolePromotionRevoker` epochs the cap delivered here.
    groupAdminGranted        @6 :GroupAdmin;
    channelPublisherGranted  @7 :ChannelPublisher;
    channelAdminGranted      @8 :ChannelAdmin;
    # Listener-bound delivery of a fresh GroupMember cap to a
    # principal auto-grafted into a group by mergeIntoGroupAsTopic
    # / moveTopicHere / extractTopicAsGroup. The cap is parented
    # under the transformation operation node; revoking via the
    # entire-graft path (`revokeBranch(transformationRef)`)
    # epochs every grafted cap.
    transformationGrafted    @9 :GroupMember;
  }
}

# Typed identifier for the issuance an incoming delivery is
# parented under. Lets a listener match an event to the
# specific issuance call that produced the delivery (contact /
# code / invite / role promotion). capOS sends the variant
# that fits the delivery flavour: contact-cap deliveries carry
# `contactRef`, code redemptions carry `codeId`, invite
# deliveries carry `inviteRef`, group role-promotion
# deliveries carry `groupPromotionRef`, channel role-promotion
# deliveries carry `channelPromotionRef`.
struct IssuanceSource {
  union {
    contactRef          @0 :ContactCapRef;
    codeId              @1 :Data;
    inviteRef           @2 :GroupCapRef;
    groupPromotionRef   @3 :GroupCapRef;
    channelPromotionRef @4 :ChannelCapRef;
    transformationRef   @5 :GroupCapRef;   # mergeIntoGroupAsTopic /
                                           # moveTopicHere /
                                           # extractTopicAsGroup
                                           # operation node
  }
}

# Inviter-side notification when the invitee redeems a
# previously-issued InviteToken / inviteCode. Carries no live
# bearer cap (the redeemed `GroupMember` belongs to the
# invitee, and `transfer_policy` forbids handing it to the
# inviter); instead carries the issuance ref the inviter
# already holds (`source.inviteRef` on the enclosing
# `SelfIncomingEvent`) plus the redeemed branch's
# `acceptedRef :GroupCapRef` so the inviter can call
# `GroupAdmin.describeBranch(acceptedRef)` /
# `revokeBranch(acceptedRef)` if needed.
struct InviteAcceptedNotice {
  invitee     @0 :PrincipalRef;
  acceptedRef @1 :GroupCapRef;        # the redeemed GroupMember
                                      # branch root in the
                                      # group's lineage tree
}

# Issuer-held, non-secret revocation handle returned alongside a
# bearer `ContactCap` from `Self.contact()`. Opaque to the
# caller; chat-server uses it to look up the contact's issuance
# lineage node so `Self.revokeContact(ref)` can epoch that node
# and any DmPeer / E2EDmPeer chains parented under it. Unlike
# the bearer `code` returned by `Self.contactCode`, this handle
# is safe to log in audit, persist in the issuer's "contacts I
# issued" UI list, etc. Distinct from `GroupCapRef` to avoid
# accidentally reusing the same opaque ref across different
# substrates' revocation surfaces.
struct ContactCapRef {
  refId @0 :Data;     # chat-server-internal handle id
}

# ============================================================
# Group lifetime policy + creation config. A Group is persistent
# by default; ephemeral variants auto-disband when their lifetime
# trigger fires. The substrate exposes lifetime as a Group-level
# property; topics and rooms inherit the parent group's lifetime.
# ============================================================

struct GroupLifetime {
  union {
    persistent       @0 :Void;
    ephemeralOnEmpty @1 :Void;        # auto-disband when no member is
                                       # present in any room of the
                                       # group (text idle + voice idle
                                       # + stage idle), not just when
                                       # the roster goes empty
    deadline         @2 :UInt64;      # absolute disband time, ms since epoch
    ephemeralOnIdle  @3 :UInt64;      # disband after N ms with no activity
  }
}

struct GroupCreateConfig {
  displayName    @0 :Text;
  lifetimePolicy @1 :GroupLifetime;
  initialInvites @2 :List(ContactCap);   # ocap-clean: must already
                                         # have ContactCap for each
                                         # invitee. NO cold-call admit.
}

# ============================================================
# Chat-graph transformations. Every transformation that crosses
# group boundaries is a TWO-CAP operation: caller proves authority
# on one side, receiver-of-method on the other. chat-server
# validates both before mutating its internal lineage tree.
# ============================================================

enum MergeMemberPolicy {
  autoInvite     @0;   # mint fresh GroupMember(target) for source
                       # members not already in target; deliver
                       # listener-bound to each principal via
                       # `Self.subscribeIncoming`
                       # (`transformationGrafted :GroupMember`
                       # arm, `source.transformationRef` carrying
                       # the operation node's `GroupCapRef`,
                       # whichever transformation invoked the
                       # policy: mergeIntoGroupAsTopic /
                       # moveTopicHere / extractTopicAsGroup).
                       # The source-group event stream only
                       # carries non-cap "you have been grafted"
                       # presence; cap delivery stays
                       # per-recipient.
  dropNonMembers @1;   # source members not in target lose access
}

# Methods added to Group role caps for lifetime + transformations.
# Real capnp doesn't have `extend X { add methods }` syntax; these
# methods are appended to the existing GroupOwner / GroupAdmin
# interfaces declared earlier in this schema sketch. Shown here in
# their own block for readability.
#
# GroupOwner (in addition to its existing methods) gains:
#
#   setLifetimePolicy @100 (policy :GroupLifetime) -> ();
#   # Promote ephemeral -> persistent or set a new ephemeral
#   # trigger. Same group identity, same caps stay valid; only
#   # the auto-disband watcher changes.
#
#   mergeIntoGroupAsTopic
#       @101 (target       :GroupAdmin,
#             topicId      :Text,
#             memberPolicy :MergeMemberPolicy)
#               -> (topic :ChatWriter);
#   # `this` group becomes a topic under `target` group. The caller
#   # must hold both the source GroupOwner cap (this) and the
#   # target GroupAdmin cap (passed as argument). Source members
#   # not already in target are handled per `memberPolicy`. Source
#   # role caps go stale (or transparently re-bind; see Open
#   # Question).
#
# GroupAdmin (in addition to its existing methods) gains:
#
#   moveTopicHere
#       @100 (sourceGroupAdmin   :GroupAdmin,
#             sourceTopicId      :Text,
#             destinationTopicId :Text,
#             memberPolicy       :MergeMemberPolicy) -> ();
#   # Move topic from source to destination (this) group. Caller
#   # holds destination admin via `this`; sourceGroupAdmin proves
#   # authority on the source group.
#
#   extractTopicAsGroup
#       @101 (topicId     :Text,
#             lifetime    :GroupLifetime,
#             displayName :Text,
#             creator     :Self)
#               -> (owner :GroupOwner);
#   # Inverse: pull a topic out of `this` group into a brand-new
#   # standalone Group. The `creator` Self cap proves the calling
#   # principal has group-creation authority; chat-server's
#   # `Self.startGroup` policy applies here too (so a guest who
#   # cannot create groups cannot bypass the quota by extracting
#   # a topic). Caller becomes Owner of the new group; topic
#   # members auto-migrate as Members, parented to the extract
#   # operation.

# A contact cap is a chat-server-issued cap that says "any holder
# may open a DM to the issuing principal." The issuer can revoke at
# any time. Contact caps may be public (broadly shared) or narrow
# (handed to one specific principal); both shapes are the same cap
# kind, the difference is in how the issuer chose to share it.
interface ContactCap {
  describe @0 () -> (info :ContactInfo);
}

# Listener-side. Held by the receiver; minted locally.
interface Subscription { cancel @0 () -> (); }
interface TextListener { post @0 (event :ChatInboundEvent) -> (); }
interface AudioSink    { frame @0 (meta :AudioFrameMeta) -> (); }
interface VideoSink    { frame @0 (meta :VideoFrameMeta) -> (); }

# Outgoing media. Flow-controlled via `-> stream`.
interface AudioOut {
  writeFrame @0 (meta :AudioFrameMeta) -> stream;
  close @1 ();
}
interface VideoOut {
  writeFrame @0 (meta :VideoFrameMeta) -> stream;
  close @1 ();
}

enum ChatPayloadKind {
  text @0;
  presence @1;            # joined / left / typing / status
  reactionRef @2;         # reference to another event id
  approvalRef @3;         # reference to an ApprovalGrant; payload is the
                          #   grant's audit-safe descriptor, not the grant
  attachment @4;          # see AttachmentDescriptor
  custom @5;              # service-defined; opaque to the substrate
}

struct ChatOutboundEvent {
  kind @0 :ChatPayloadKind;
  text @1 :Text;          # optional, for kind=text and convenience
  data @2 :Data;          # optional structured payload
  inReplyTo @3 :Data;     # optional event id
  redactionClass @4 :Text;# audit redaction class
}

struct ChatInboundEvent {
  eventId @0 :Data;
  chatId @1 :Text;        # opaque per-chat identifier; renamed
                          # from the earlier `channel` field
                          # because "channel" is reserved for
                          # Telegram-style broadcast Channels.
                          # Holds equally for Groups, broadcast
                          # Channels, and DMs.
  sender @2 :Text;        # disclosure-policy-redacted display name
  kind @3 :ChatPayloadKind;
  text @4 :Text;
  data @5 :Data;
  inReplyTo @6 :Data;
  receivedAtMs @7 :UInt64;
}

Notes:

  • ChatEvent (the existing struct in capos.capnp) becomes ChatInboundEvent. Listener caps replace poll, but poll may stay as a deprecated, transport-stopgap method during the capnp-rpc migration.
  • AudioFrameMeta / VideoFrameMeta carry timestamps, codec hints, and a ring-buffer slot reference. Frame bodies live in MemoryObject-backed rings shared between the producer and consumer.
  • approvalRef is the only tie between this proposal and the approval surface: it lets an approval request appear in a chat as a structured message that links to an ApprovalGrant cap. The grant cap travels by capnp-rpc cap reference, not as bytes inside the message data.

WebRTC Mapping

Browser-side participants use WebRTC. The trusted Rust backend (or a capOS-side WebRTC adapter the gateway delegates to) implements the peer at the capOS end. The mapping is symmetric enough that no additional abstraction layer is needed in either direction.

Chat substrateWebRTC equivalentNotes
subscribeText(listener) + send(event)RTCDataChannel (reliable, ordered)Text events are JSON view models on the HTTP path; the WebRTC data channel may carry the same JSON for browser peers that want lower-latency text without HTTP polling.
openAudioOut, subscribeAudio(sink)RTCPeerConnection audio track (addTrack, ontrack)Codec negotiation via SDP; capOS-side adapter exposes the agreed AudioFormat.
openVideoOut, subscribeVideo(sink)RTCPeerConnection video trackSame as audio with codec/resolution negotiation.
postAttachment(descriptor)RTCDataChannel reliable chunk transfer or HTTP file fetchBounded attachments only; large transfers go through a separate File/ContentStore cap.
presence payload kindRTCPeerConnection connectionstatechange events + custom data-channel messagescapOS surfaces presence as ChatInboundEvent kind=presence.
approvalRef payload kinddata channel message with structured payloadThe approval cap stays on the capnp-rpc side; the data channel only carries the audit-safe descriptor.
ICE / SDP negotiationgateway endpoint /api/chat/webrtc/*Browser sends offer; backend produces answer; ICE candidates traded via the same endpoint. The HTTP endpoint runs on whatever userspace TCP listener cap the trusted Rust backend already holds via Networking – chat-server itself never opens a socket. The browser never receives capOS caps through this path – only WebRTC handles.
DTLS / SRTP keysWebRTC defaultDTLS / SRTP key material lives inside the WebRTC peer endpoint and never crosses to chat-server; chat-server forwards already-protected frames. TLS for the browser ↔ backend signalling channel is configured separately, composed from the certificate/trust/TLS-context caps in Certificates and TLS on top of the userspace networking surface above.

The gateway boundary stays the same: the browser receives WebRTC handles and view models. The trusted backend holds the chat cap, the listener caps, the media rings, and the WebRTC peer connection. No capOS authority object crosses to the browser.

Approvals Stay Separate

Approvals are a different surface from “may I write to you”. They already have a designed capability: ApprovalClient / ApprovalGrant (shell-proposal.md:407-427, also referenced in user-identity-and-policy-proposal.md:812). Per-tool permission modes are defined in llm-and-agent-proposal.md:105-114 (auto|consent|stepUp|forbidden). The remote CapSet UI’s “action-approval queue” is the canonical UI surface (remote-session-capset-client-proposal.md § UI Scope And Architecture).

What ApprovalClient is for: a principal that already has authority to attempt some action wants confirmation before exercising it (or the policy engine demands a step-up). Examples: agent runtime asks the operator before invoking a consent-mode tool; a destructive operation needs WebAuthn step-up; a queued write awaits human-in-the-loop sign-off.

What ApprovalClient is not for: cold-call admission. There is no flow where principal A asks the system “may I please write to B”. That request requires a cap A does not have. The substrate’s answer is: B issues a contact cap (via Self.contact()) or invites A to a shared Group via GroupMember.invite(...) (or, if B holds the broadcast Channel role, ChannelAdmin.makePublisher(...)). Without an existing cap from B’s chain, A has no protocol-level path. See “Capability Granting” above.

Chat ties to ApprovalClient in exactly one place: an approvalRef payload kind lets a chat thread display an approval request as a structured message linking to a live ApprovalGrant cap. The grant cap travels by capnp-rpc cap reference; the bytes inside the message data carry only an audit-safe descriptor. The grant state machine, the broker call, the policy check, the step-up mechanics, and the audit trail all remain on the existing ApprovalClient / AuthorityBroker.request path.

Approvals-side gaps that are still open (and tracked separately in docs/tasks/README.md):

  • Detailed ActionPlan and CapRequest schema. Both are referenced in the existing ApprovalClient sketch but not fully specified.
  • Durable approval queue / inbox shape. Today the flow is synchronous (ApprovalClient.request returns a grant cap directly); the remote CapSet UI’s queue surface implies persistence and listing. A queue cap layered on top of ApprovalClient (e.g. ApprovalQueue.list() -> List(Pending), next() -> ApprovalGrant) is a natural follow-up.

These should land in a follow-up update to shell-proposal.md / user-identity-and-policy-proposal.md, not in this Chat proposal.

Chat Categories

Telegram-aligned naming. Three concrete chat categories plus an E2E variant of DMs. Distinct cap types because they have distinct authorities; all of them sit on top of the unified ChatEndpoint / ChatWriter base interfaces.

  • Group – multi-participant, two-way. Has an Owner, zero-or-more Admins, and Members. Supports nested rooms of three kinds: text topics (sub-channels for text), voice rooms (Discord-style persistent always-on voice rooms), stage rooms (Discord-stage / Twitter-Spaces broadcast voice within the group with raise-hand to speak). Per-room permission overrides are out of scope for the first slice; RoomPolicy leaves the door open.
  • Channel (Telegram-strict: BROADCAST) – read-only for subscribers. Owner/Admin/Publisher post; Subscribers receive only. Useful for system announcements, agent status feeds, log streams, one-to-many broadcasts.
  • DM – two-participant chat. No group-level role hierarchy. Each peer holds an asymmetric DmPeer cap.
  • E2E DM – two-participant DM where the chat host carries ciphertext only. Distinct cap layer (E2EDmPeer) because key exchange, AEAD, forward-secrecy ratchets, and out-of-band fingerprint verification are concerns the unencrypted DM does not have. See “End-To-End Encrypted DMs” below.

In addition, both Groups and DMs expose an ephemeral Call surface for voice/video conferences – but with a kind-specific narrowing:

  • Groups use GroupMember.callSurface() -> CallSurface for multi-party calls; CallSurface.startCall allows setRoutingMode (sfu / mesh / mcu) so server-side mixing is available when text/audio aren’t end-to-end-encrypted.
  • DMs (both plain DmPeer and E2EDmPeer) use callSurface() -> E2ECallSurface – the SFU-forward-only surface with no setRoutingMode. Direct calls between two principals are end-to-end-encrypted at the media layer regardless of whether DM text is host-readable.

A Call has explicit start/end, distinct from the persistent VoiceRoom: use Call for “let’s hop on a quick conference”, use VoiceRoom for “Discord voice channel always there”.

There is no special “system room” category. A system-managed chat is just a chat whose Owner principal is a service principal or a designated admin principal. capOS already treats services as principals (PrincipalKind.service in user-identity-and-policy-proposal.md:91-98); a service-owned chat applies the same role/lineage rules as any other.

Naming convention. The unqualified word “channel” in this proposal refers only to the broadcast category (Telegram-style Channel). Anything generic – a stream of events, a subscription target, an A/V flow – is called a chat (the substrate-level term). Base interfaces use the Chat prefix (ChatEndpoint, ChatWriter, ChatDirectory, ChatInfo, ChatKind); concrete roles use the category prefix (Group*, Channel*, Dm*).

Substrate is recording-blind. No protocol-level “start recording” / “consent to recording” / “recording state” surface exists. Server-side recording with consent is consent theater anyway – a phone next to the speakers or a screen recorder on the recipient’s own device defeats it instantly. Recording is purely a client-side concern: whoever holds a participant cap may locally record bytes they receive. A “shared meeting recording” is modeled by inviting a recorder principal into the call – it shows up in the roster like any other participant, the social contract carries the rest.

Lifetime And Transformations

Groups have a lifetime policy chosen at creation, and the chat graph supports a small set of structure-preserving transformations.

Group lifetime

GroupLifetime is one of:

  • persistent (default): the group lives until an owner calls disband() or transforms it into something else. Manifest-created groups default to persistent.
  • ephemeralOnEmpty: chat-server auto-disbands when the last member leaves. “Spin up a quick chat with these three people; it goes away when everyone closes the tab.”
  • deadline: chat-server auto-disbands at an absolute time. “This pickup-call thread auto-archives Friday at 17:00.”
  • ephemeralOnIdle: chat-server auto-disbands after N ms with no message activity. “Self-cleanup if nobody says anything for an hour.”

Owners can change the policy at runtime via setLifetimePolicy. Going from ephemeral to persistent is “promote this ephemeral chat to a permanent one”; the same group identity persists, no caps rotate, no auto-invite happens. Going the other way (persistent -> ephemeral) is also valid – the auto-disband watcher just starts.

Lifetime applies at the Group level. Topics and rooms inherit the parent group’s lifetime; they don’t have separate auto-disband clocks. This is the right scope: rooms are sub-spaces of a group, not independent chats.

For DMs the same GroupLifetime shape can be reused (an ephemeralOnIdle DM is the natural shape for “self-destructing chat” if you ever want it), via a lifetime field on the Self.openDm config. Out of scope for this slice; the schema leaves room.

Ad-hoc group creation

Self.startGroup(config :GroupCreateConfig) -> (owner :GroupOwner) lets any principal whose Self cap permits it create a new group. chat-server policy gates this per principal class – operators typically have a creation quota; guests/anonymous don’t have Self.startGroup at all (cap absent from their bundle).

Initial invitees are passed as a List(ContactCap). This is the ocap-clean rule: you can only invite people you already have a ContactCap for. No cold-call admit. Want to spin up a Group with strangers? You can’t; you have to first arrange contact via existing channels (someone vouches by sharing your contact card, you publish a public ContactCap, etc.).

Each initial invite is delivered through the existing Self notification surface of the invitee, who can Self.acceptInvite to join. If invites are declined, the group still exists with just the creator as Owner.

Transformations

Three structural mutations of the chat graph, each a two-cap operation: the caller proves authority on one side; the receiver of the method (i.e. the cap-self) proves authority on the other. chat-server validates both before mutating its lineage tree.

Promote ephemeral to persistent. GroupOwner.setLifetimePolicy({persistent}). Single-cap (just the Owner of the ephemeral group). No member migration; same caps stay valid.

Merge a group into another as a topic. GroupOwner of the source calls mergeIntoGroupAsTopic(target :GroupAdmin, topicId, memberPolicy). After success:

  • Source group ceases to exist as a top-level group; its identity becomes a topic under target.
  • Source members not already members of target are handled per memberPolicy: autoInvite mints fresh GroupMember(target) caps for them (parented to the merge operation), and chat-server delivers each cap LISTENER-BOUND to the recipient principal via that principal’s Self.subscribeIncoming – the transformationGrafted :GroupMember arm, with source.transformationRef carrying the merge-op GroupCapRef. The fan-out source-group event stream only carries non-cap presence (a “you have been grafted into target via merge” notice) so cap delivery stays on the listener-bound surface required by transfer_policy. The alternative dropNonMembers lets the source caps go stale without minting new ones.
  • The merge operation is a node in chat-server’s lineage tree; every cap minted as part of it is parented to that node, so “revoke everything that came in via this merge” is one operation.

Move a topic between groups. GroupAdmin.moveTopicHere(sourceGroupAdmin, sourceTopicId, destinationTopicId, memberPolicy). Same two-cap shape: caller’s this is the destination admin; sourceGroupAdmin is the source. Topic members not in the destination are handled per memberPolicy. The topic-as-namespace identity moves; the topic’s history (text events, attachments) carries over.

Extract a topic into a standalone group. GroupAdmin.extractTopicAsGroup(topicId, lifetime, displayName, creator :Self). Inverse of merge – but unlike the single-extract-cap shape that would let any group admin mint a top-level Group regardless of group-creation authority, this method takes a creator :Self cap as a second argument. chat-server applies the same policy it applies to Self.startGroup (per principal class quota, ban-list checks, etc.) to the calling principal before minting the new GroupOwner. A guest or admin who is not allowed to create groups cannot bypass the quota by extracting a topic. Caller becomes Owner of the new group; topic members auto-migrate as Members; their caps are parented to the extract operation.

Authority rules

All three cross-group operations share these invariants:

  • Two-cap proof. Methods that move structure across groups take the other authority as an argument. For mergeIntoGroupAsTopic / moveTopicHere that’s the other group’s admin role cap (the partnerAdmin arm of OperationConsent in lineage queries). For extractTopicAsGroup there is no other-side group yet, so the second authority is the initiator’s own Self cap proving group-creation quota (the selfCreation arm of OperationConsent); chat-server applies the same per-principal quota / ban-list checks it applies to Self.startGroup before minting the new GroupOwner. chat-server rejects with incompatibleChatKind if the cross-group caps reference chats with incompatible kind/policy (e.g. you can’t merge an E2E DM into a non-E2E group).
  • Lineage continuity. The transformation operation is itself a node in chat-server’s tree (OperationNodeInfo arm of LineageNode returned by describeBranch); new caps minted as part of it record parent = the operation (the transformation arm of BranchParentage). Both entire-graft revocation (revokeBranch(operationNodeRef)) and per-member revocation (revokeBranch(memberCapRef)) work, and either ref kind passes through the same GroupCapRef envelope.
  • No cold-call sneak path. autoInvite looks like it might be a way to drag people into a group they didn’t agree to, but it requires both the source-group owner (who has authority over those members because they’re already in the source group) AND the target-group admin (who has authority to admit) to consent in the same call. A single party can never drag people into a group on their own; the two-cap pattern is the consent.

Lifetime interaction with conferencing

A subtle thing worth flagging: ephemeralOnEmpty interacts oddly with VoiceRooms. If a Group has a VoiceRoom and the last text-chat member leaves but two people are still connected to the voice room, the group should not auto-disband. Definition: “empty” means “no member is present in any room of the group” – text idle, voice idle, stage idle. Detail for the implementation iteration.

A merged-into-topic source group’s lifetime policy does not survive the merge. The topic now lives under the target group’s lifetime; if the source was on ephemeralOnIdle and the target is persistent, the topic becomes persistent. Worth surfacing in the merge confirmation UX. Substrate behavior: lifetimePolicy is a Group-level field; topics inherit.

Cap continuity at the holder (Open Question)

When a group merges into another as a topic, members hold caps that used to mean “send to the top of source group” and now mean “send to topic X under target group”. Three viable strategies; the substrate proposal does not lock one in:

  • Transparent redirect. Old caps keep working; chat-server’s dispatch routes calls to the new topic. describe() reveals the new identity. Pros: zero client code change. Cons: leaks “this used to be a separate group” history; may surprise users.
  • Forwarding denial. Old caps go stale with a chatMerged denial that includes a forwarding hint (event id and a reference the client can fetch to obtain the new topic cap). Pros: clean break; auditable. Cons: every client across every member needs to handle the forwarded-redirect at the call site.
  • Holder-driven re-bind. chat-server delivers a presence event to every affected member carrying the new cap; the old cap stays usable for a grace window after the merge, then goes stale. Lets clients re-bind without disruption; the eventual stale flip ensures no permanent dual identity.

The third strategy reads cleanest to me, but it benefits from prototyping. Implementation iteration will pick one.

Capability Granting

The current Chat interface in schema/capos.capnp is open-by-default: holding the system Chat cap lets a process join any channel by name and send to any channel. That is the wrong model. This section defines an ocap-disciplined replacement: every Chat capability is granted explicitly by a holder that already has it, every derived cap has a recorded parent, and revocation cascades through the derivation tree.

Cap flavours

The substrate defines four kinds of caps. The exact schema is part of the implementation iteration; the shape is what matters.

  1. Chat service root cap. Held by chat-server itself, never handed to user code. The root authority from which every other chat cap ultimately derives. Manifest configuration tells chat-server which groups and channels to materialize at startup; chat-server uses its root cap to do so. The root cap is the lineage root; it is not “ambient authority handed out by the broker” – it is service authority held by the service that runs Chat.

  2. Role caps. A role on a specific chat is a cap. Roles inherit upward; concrete role caps embed the unified ChatEndpoint / ChatWriter base interfaces.

    • GroupOwner(group) extends GroupAdmin extends GroupMember extends ChatWriter. Full authority on the group: appoint admins, create/remove rooms (text topics + voice rooms + stage rooms), change group settings, kick members, issue invites, open public-join routes, disband.
    • GroupAdmin(group) adds member/branch/room moderation and invite-policy management. Per-permission DSL (can-pin, can-invite, can-create-room, …) is future work; first slice ships a single Admin role.
    • GroupMember(group) – read and write all rooms under the group’s default policy. Members may invite others if the group’s policy allows. Members access voice/stage rooms via voiceRoom(id) / stageRoom(id) and ephemeral conferences via callSurface().
    • ChannelOwner(channel) extends ChannelAdmin extends ChannelPublisher extends ChatWriter. Full broadcast authority. Per-kind subscribers – ChannelTextSubscriber(channel) extends ChatTextReader only, ChannelAudioSubscriber(channel) extends ChatTextReader + ChatAudioReader, ChannelVideoSubscriber(channel) extends all three readers – are read-only at the type level. Promotion to publisher goes through ChannelAdmin.makePublisher.
    • DmPeer(dmId, direction) extends only ChatTextWriter (NOT full ChatWriter). DM text is host-readable; DM media is NOT – audio/video flows only through DmPeer.callSurface() -> E2ECallSurface, where chat-server forwards already-encrypted frames between peers. A→B peer cap gives A the right to push text to B; it is not symmetric. E2EDmPeer is the analogous cap for end-to-end-encrypted DMs (does not extend ChatWriter because its payloads are CipherEnvelope, not ChatOutboundEvent).
    • CallParticipant / CallHost – ephemeral conference participation; held while a Call is live, parented to the joiner’s chat role cap. Voice/stage variants have their own concrete role caps (StageListener, StageSpeaker). StageListener is parented to the joiner’s GroupMember role cap (joinAsListener is a normal accessor on the member’s stage facet); StageSpeaker is the exception — see below.
    • SpeakerToken / SpeakerRevoker – a stage-room admin’s grant of speak authority for a specific listener. Holding SpeakerToken lets that listener call StageRoom.joinAsSpeaker(token) -> StageSpeaker, and chat-server mints the resulting StageSpeaker with parent = the SpeakerToken's lineage node. The admin holds the companion SpeakerRevoker (parented to the admin’s StageRoomAdmin cap); revoker.revoke() epochs both the unredeemed token and any active StageSpeaker redeemed from it, so pulling the floor back actually kills the live speaker cap rather than just blocking future redemptions.
  3. Listener-side caps. Held by the receiver. Minted locally; never issued by anyone else. The receiver hands a listener cap to a chat role cap (Group, broadcast Channel, DM, voice/stage room) when subscribing; that role cap calls back per event. Dropping the listener (or cancelling the returned Subscription) is the receiver’s instant revocation tool.

    • TextListener
    • AudioSink
    • VideoSink
  4. Discovery / join caps.

    • ChatDirectory(scope) – read-only access to the discoverable chats (Groups and broadcast Channels) chat-server’s configuration exposes for this scope. Bundled to sessions per chat-server config (e.g. operator-class sessions get ChatDirectory(operator-scope)). Holding it lets the session call ChatDirectory.search(query) -> ChatDirectoryPage and filter by chat-server-defined criteria. Not a global index – each scope is whatever chat-server’s config carves out.
    • DiscoverableGroupJoin(group) – “you are allowed to join this group”. Returned by ChatDirectory.search(query) entries that the scope’s policy says the caller may join, or bundled directly to a session by chat-server config. Possessing it is the authority; calling DiscoverableGroupJoin.join() -> GroupMember mints a fresh role cap. There is no separate “redeem” step; possession is authority, the method just produces the derived cap.
    • DiscoverableChannelTextSubscribe(channel) / DiscoverableChannelAudioSubscribe(channel) / DiscoverableChannelVideoSubscribe(channel) – analogous for broadcast Channels. Each returns the matching per-kind ChannelTextSubscriber / ChannelAudioSubscriber / ChannelVideoSubscriber cap; the result type tells the caller exactly which media facets they hold.
    • InviteToken – a one-shot or n-shot bearer token an admin or policy-permitted member produces via GroupMember.invite(forSubject, lifetime) -> (token, revoker, inviteRef). The invitee calls Self.acceptInvite(token) -> GroupMember. The token interface has NO revoke method; revocation lives on the issuer-held companion InviteRevoker cap, parented to the issuer’s role cap in chat-server’s lineage tree. The issuer also keeps the non-secret inviteRef :GroupCapRef for the cap-clean GroupAdmin.describeBranch / revokeBranch path. (For paper / QR / out-of-band handoff where the recipient cannot receive a cap, the issuer uses GroupMember.inviteCode(lifetime) -> (code :Data, revoker, inviteRef) instead, and the recipient calls Self.acceptInviteCode(code). The bytes are bearer secrets that name a distinct lineage node in chat-server’s tree – the issuance entry created by inviteCode. On redemption chat-server mints the resulting GroupMember cap with parent = the inviteCode lineage node, NOT directly with parent = the inviter's role cap. Revoking via the companion InviteRevoker therefore epochs only that code’s descendants. See How bearer caps cross principal boundaries below for the full redemption-parent contract, and treat the bytes with bearer-secret hygiene – do not log, prefer bounded lifetimes and rate-limited redemption.)
    • SpeakerToken / SpeakerRevoker – analogous shape for stage-room speak grants. Bearer holds SpeakerToken (no revoke method); admin holds SpeakerRevoker minted via StageRoomAdmin.promoteToSpeaker(listenerRef).
    • Self.contact() – a cap a principal produces to advertise “you may DM me”. The method returns BOTH the bearer ContactCap (handed to whoever should be able to DM the issuer) AND a non-secret ContactCapRef the issuer keeps for Self.revokeContact(ref). A holder of the bearer cap calls Self.openDm(contactCap) -> DmPeer (or Self.openE2EDm(contactCap) -> E2EDmPeer). The contact-issuing principal sees the resulting DM via their own Self cap’s notification surface. Equivalent to a Telegram contact card or a published @handle; the substrate’s only guarantee is that you needed a contact cap (or its bytes form via Self.contactCode, which similarly returns both the bearer-secret code and a non-secret codeId revocation handle) to initiate.

There is no IntroCap primitive. What I formerly called “redeem an intro” is just calling a method on a DiscoverableGroupJoin / DiscoverableChannel*Subscribe, InviteToken, or contact cap that returns a derived role cap.

How bearer caps cross principal boundaries

The substrate forbids raw bearer transfer of chat caps via kernel-enforced transfer_policy. But a flow like “Alice creates an InviteToken and gives it to Bob” inherently means a cap moves from Alice’s process to Bob’s. The same applies to ContactCap sharing.

These chat-class cap transfers go through chat-server itself, never through raw IPC IPC_TRANSFER_CAP. Two paths:

  • Cap reference inside a chat event. ChatOutboundEvent.data may carry chat-server-recognized chat-class cap references (an InviteToken, a ContactCap). When a holder sends such an event with ChatTextWriter.send, chat-server inspects the payload, sees the cap reference, and on delivery to each recipient re-mints a fresh derived cap. The lineage parent for the re-minted recipient cap is the original issuance node, NOT the sender’s chat cap, so that the issuer-held revoker (e.g. ContactCapRef from Self.contact, InviteRevoker from GroupMember.invite) reaches every recipient copy and every downstream descendant when the issuer revokes. If chat-server instead parented under the sender’s chat cap, only the sender’s branch would be killed on revoke; recipient copies and the DmPeer / GroupMember caps minted from them would survive, defeating the issuer-side revocation contract. The original bearer cap stays in the sender’s table; the recipient receives a fresh cap of the same kind, parented under the issuance node. Lineage is preserved; raw bearer transfer never happens.

  • Out-of-band delivery + recipient redeem. Bytes can be exchanged through a non-chat path (paper handoff, QR code, manifest entry in a test fixture). Issuers produce the bytes through Self.contactCode / GroupMember.inviteCode; recipients redeem them via Self.openDmFromCode(code), Self.openE2EDmFromCode(code), or Self.acceptInviteCode(code).

    The bytes are bearer secrets – any holder who also has a Self cap can redeem them – so chat-server treats each issued code as a distinct lineage node in its tree, not as a transparent identifier collapsed onto the issuer’s cap. When the issuer mints a code via inviteCode / contactCode, the code’s lineage entry has parent = the issuing role/Self cap and the issuer holds the matching InviteRevoker (for inviteCode) or revokes via Self.revokeContactCode(codeId) (for contactCode). When a recipient redeems, chat-server mints the derived cap with parent = the code's lineage node, NOT directly with parent = the issuer’s cap. So:

    • Revoking a single contactCode epochs only that code’s descendants; other contact caps and codes the same issuer has handed out are unaffected.
    • Revoking an InviteToken’s revoker (or its companion inviteCode) kills the redeemed Member cap and any sub-invitees that Member produced, without affecting other invites the same admin issued.
    • The issuer-held revoker / revokeContactCode is the only way to revoke that specific handoff. Bearer copies that have not yet redeemed simply fail closed once revoked.

    Bearer-secret hygiene applies: codes have lifetimes, are bound to a single issuance entry, and chat-server may rate-limit redemption attempts per code to bound brute-force guessing.

The kernel’s transfer_policy rejection of raw IPC-cap-transfer is what closes the loophole. chat-server’s typed delivery methods (or the byte-form code paths above) are the only ways a chat-class cap reaches a new principal; lineage is recorded at chat-server side in either case.

Approval grants are NOT chat caps and are not re-minted through chat lineage. approvalRef is a payload kind that lets a chat event display an approval request, but the live ApprovalGrant cap travels by ordinary capnp-rpc cap reference between the approval service and its caller – the same way it would without chat. chat-server only forwards the audit-safe descriptor for display; if the recipient needs the actual ApprovalGrant cap, it comes from AuthorityBroker.request / ApprovalClient, not from a chat-server re-mint. Approvals stay separate (see the “Approvals Stay Separate” section).

Per-principal ban list

Rotating a member’s branch (revokeBranch(memberCap)) kicks their current chain. But if the principal still holds a DiscoverableGroupJoin (or DiscoverableChannel*Subscribe) cap, or has a session bundle hook that hands one out at login, they can call .join() / .subscribe() and mint a fresh chain. For real ban semantics, chat-server tracks a per-chat ban list:

  • Group ban. GroupAdmin.banPrincipal(principalRef) adds the principal to the group’s ban list; chat-server checks it on every Group-side mint path that could attach a fresh role cap to that principal:

    • public-join redemption: DiscoverableGroupJoin.join;
    • cap-form invite redemption from outside the group: Self.acceptInvite(token);
    • cap-form invite redemption from inside an existing group context: GroupMember.acceptInvite(token) (same wire as the Self-form, but invokable when the invitee already holds a member cap in another group and chat-server forwarded the InviteToken through that group’s chat event);
    • byte-form invite redemption: Self.acceptInviteCode(code);
    • admin-mint paths on the Group role hierarchy: GroupOwner.makeAdmin, plus any other future role-promotion methods chat-server adds to GroupOwner / GroupAdmin (Channel-side methods like ChannelAdmin.makePublisher are NOT in this list – those belong to the Channel ban below);
    • every manifest-driven session bundle hook that attaches a Group role cap at login (GroupOwner / GroupAdmin / GroupMember); and
    • every transformation-driven auto-mint path (mergeIntoGroupAsTopic / moveTopicHere with memberPolicy=autoInvite, and the per-topic-member auto-migration step inside extractTopicAsGroup).

    Without the transformation check, a source-owner plus target-admin pair could graft a banned principal back into a group via merge or move; without the login-bundle check, a banned operator who has the lobby group attached by their session profile would receive a fresh GroupMember(lobby) (or GroupAdmin(lobby)) cap on their next login and bypass the ban. Banned principals caught in a transformation are dropped from the autoInvite set with a principalBanned audit event; the transformation itself still completes for non-banned members.

  • Channel ban. ChannelAdmin.banPrincipal(principalRef) adds the principal to the broadcast Channel’s ban list; chat-server checks it when minting via DiscoverableChannelTextSubscribe.subscribe / Audio / Video, on ChannelAdmin.makePublisher, on ChannelOwner.makeAdmin, and on any Channel role cap (ChannelOwner / ChannelAdmin / ChannelPublisher / Channel{Text,Audio,Video}Subscriber) attached by manifest-driven session bundles at login (same reason as the Group case).

  • Self-creation ban via Self.startGroup. A globally banned principal whose chat-server policy disallows new groups (e.g. manifest sets Self.startGroup per principal class) cannot bypass by including a banned ContactCap in initialInvites; chat-server validates each contact against its issuer’s bans before minting auto-invites.

Banned principals get a typed principalBanned denial. unbanPrincipal removes the entry. Banning is independent of revokeBranch: revoke kicks the active chain; ban prevents new chains; an admin typically does both as a single workflow (“kick

  • ban“).

Where caps come from

The chain always terminates at chat-server’s own root cap. There is no broker-side ambient minting; the broker’s role is to hand out chat-server-issued caps that chat-server’s config has already authored for sessions matching certain profiles.

CapOriginating issuerHow a session first holds it
Selfchat-server, once per session at login from the caller’s authenticated identityparent is chat-server’s root, exactly one Self cap per (principal, session) tuple; chat-server creates it the first time the broker hands a session to chat-server. All ContactCap / contactCode / Self-driven group-creation chains terminate at this Self node, which terminates at chat-server’s root, satisfying the lineage invariant. The Self cap is never delivered cross-principal; its lifetime is the session’s lifetime.
GroupOwner (manifest-bundled)chat-server, when the manifest declares the groupbundled to the configured Owner principal’s session at login; parent is chat-server’s root, the manifest entry is its own chain
GroupOwner (Self.startGroup)chat-server, on Self.startGroup(config)parent is the calling principal’s Self cap; minting is gated by chat-server’s per-principal-class group-creation quota
GroupOwner (extractTopicAsGroup)chat-server, on GroupAdmin.extractTopicAsGroup(..., creator :Self)parent is the extract-operation lineage node (OperationNodeInfo with selfCreation consent); the extract op is itself a child of the source group’s root
GroupAdmin (manifest-bundled)chat-server, when the manifest bundles admin to a profile (e.g. the test fixture’s chat.groups.X.admins entry)parent is chat-server’s root, the manifest entry is its own chain
GroupAdmin (Owner-minted)chat-server, on GroupOwner.makeAdmin(memberRef); delivered to the promoted principal via Self.subscribeIncoming.groupAdminGrantedparent is the promotion issuance lineage node (IssuanceNodeInfo with kind groupAdminGrant); the issuance node parents to the calling GroupOwner cap. Revoking via the issuer-held RolePromotionRevoker epochs the issuance node and the promoted GroupAdmin under it.
GroupMember (manifest-bundled)chat-server, when the manifest bundles membership to a profileparent is chat-server’s root, the join is its own chain
GroupMember (public-joined)chat-server, on DiscoverableGroupJoin.join()parent is the joiner’s own root within the group (each public join is its own distinct chain)
GroupMember (invited, cap form, Self redemption)chat-server, on Self.acceptInvite(token)parent is the InviteToken issuance lineage node, which itself parents to the inviter’s role cap
GroupMember (invited, cap form, in-context redemption)chat-server, on GroupMember.acceptInvite(token) (the in-context redemption used when the invitee already holds a GroupMember cap in another group through which the inviter forwarded the InviteToken)same parent semantics as the Self-form: the InviteToken issuance lineage node, which parents to the inviter’s role cap
GroupMember (invited, code form)chat-server, on Self.acceptInviteCode(code)parent is the inviteCode lineage node, which itself parents to the inviter’s role cap
GroupMember (transformation-grafted, merge/move autoInvite)chat-server, on mergeIntoGroupAsTopic / moveTopicHere with memberPolicy=autoInviteparent is the transformation operation node (OperationNodeInfo arm of LineageNode with partnerAdmin consent); revoking the op node epochs every grafted member
GroupMember (transformation-grafted, extractTopicAsGroup)chat-server, on GroupAdmin.extractTopicAsGroup(..., creator :Self) for each existing topic member auto-migrated into the new groupparent is the extract operation node (OperationNodeInfo arm of LineageNode with selfCreation consent); revoking the op node epochs every auto-migrated member of the extracted group
ChannelOwner (manifest-bundled)chat-server, when the manifest declares the channelbundled to the configured Owner principal’s session at login; parent is chat-server’s root, the manifest entry is its own chain
ChannelTextSubscriber (public)chat-server, on DiscoverableChannelTextSubscribe.subscribe()parent is the subscriber’s own root within the channel
ChannelAudioSubscriber (public)chat-server, on DiscoverableChannelAudioSubscribe.subscribe()parent is the subscriber’s own root within the channel
ChannelVideoSubscriber (public)chat-server, on DiscoverableChannelVideoSubscribe.subscribe()parent is the subscriber’s own root within the channel
ChannelTextSubscriber / ChannelAudioSubscriber / ChannelVideoSubscriber (manifest-bundled)chat-server, when the manifest bundles a per-kind subscriber to a profileparent is chat-server’s root, the manifest entry is its own chain
ChannelPublisher (Admin-minted)chat-server, on ChannelAdmin.makePublisher(subjectRef); delivered to the promoted principal via Self.subscribeIncoming.channelPublisherGrantedparent is the promotion issuance lineage node (kind channelPublisherGrant); the issuance node parents to the calling ChannelAdmin cap. Revoking via RolePromotionRevoker epochs the issuance node and descendants.
ChannelPublisher (manifest-bundled)chat-server, when the manifest bundles publisher to a profileparent is chat-server’s root, the manifest entry is its own chain
ChannelAdmin (manifest-bundled)chat-server, when the manifest bundles admin to a profileparent is chat-server’s root, the manifest entry is its own chain
ChannelAdmin (Owner-minted)chat-server, on ChannelOwner.makeAdmin(...); delivered to the promoted principal via Self.subscribeIncoming.channelAdminGrantedparent is the promotion issuance lineage node (kind channelAdminGrant); the issuance node parents to the calling ChannelOwner cap. Revoking via RolePromotionRevoker epochs the issuance node and descendants.
DmPeer (cap form)chat-server, on Self.openDm(contactCap)parent = the ContactCap lineage node
DmPeer (code form)chat-server, on Self.openDmFromCode(code)parent = the contactCode lineage node
E2EDmPeer (cap form)chat-server, on Self.openE2EDm(contactCap)parent = the ContactCap lineage node
E2EDmPeer (code form)chat-server, on Self.openE2EDmFromCode(code)parent = the contactCode lineage node
ChatDirectory(scope)chat-server, configured per scope in the manifestbundled to sessions matching the scope’s policy
DiscoverableGroupJoin / DiscoverableChannel{Text,Audio,Video}Subscribechat-server, on ChatDirectory.search(query) for entries the scope policy allowsparent is the directory-scope’s policy entry
InviteToken (cap form)chat-server, on GroupMember.invite(...)parent is the issuing role cap (admin or member depending on policy)
inviteCode (code form, lineage node)chat-server, on GroupMember.inviteCode(...)parent is the issuing role cap
ContactCap (cap form)chat-server, on Self.contact(lifetime)parent is the issuing principal’s Self cap
contactCode (code form, lineage node)chat-server, on Self.contactCode(lifetime)parent is the issuing principal’s Self cap
InviteRevoker / SpeakerRevokerchat-server, returned alongside the matching token / promotionparent is the issuing role cap
SpeakerTokenchat-server, on StageRoomAdmin.promoteToSpeaker(listenerRef)delivered to the bound listener via stage roster events; parent is the admin cap
listener caps (TextListener, AudioSink, VideoSink)minted locally by the receivernot in any lineage chain; revocation is local drop

Manifest is Chat service configuration, not kernel or broker configuration. It declares the initial groups/channels, who owns them, who appears in which discovery scope, and which sessions are auto-bundled with which caps. chat-server reads it at boot and acts on its own root cap. The kernel only manages cap epochs and dispatch.

The broker’s role is to bundle initial caps a session needs to use what it already has – e.g. a manifest can configure that “chat-server starts with operator-lobby already created and GroupMember(operator-lobby) bundled to operator-class sessions”. The broker hands those session bundles out at login; chat-server is the issuer.

Granting flows

Operator joins the operator-lobby at boot (manifest bundle). The manifest declares chat-server’s startup config: create operator-lobby with chat-server’s own service principal as Owner; bundle GroupMember(operator-lobby) to every session whose profile is operator. At login, the broker hands the operator session a chat-server-issued GroupMember(operator-lobby) cap. The cap’s parent in chat-server’s lineage tree is “this session’s join entry” – a fresh chain root specific to this session, not shared with other operators. No approval step.

Operator joins a discoverable chat at runtime. Sessions hold a ChatDirectory(operator-scope) cap. Operator calls ChatDirectory.search(query) -> ChatDirectoryPage; chat-server returns entries matching the scope’s policy. Each entry carries a kind-specific discoverable cap depending on the chat’s kind: DiscoverableGroupJoin for a Group, or one of DiscoverableChannelTextSubscribe / DiscoverableChannelAudioSubscribe / DiscoverableChannelVideoSubscribe for a broadcast Channel. Operator picks one and calls the matching method:

  • DiscoverableGroupJoin.join() -> GroupMember(group) for a Group entry.
  • DiscoverableChannelTextSubscribe.subscribe() -> ChannelTextSubscriber(channel) (or the matching audio/video variant) for a broadcast Channel entry.

The new role cap’s parent in chat-server’s lineage is “this session’s join event” – a fresh chain root for this join, not shared with other joiners. Possession of the discoverable cap is the policy gate; calling .join() / .subscribe() mints the role cap. There is no separate “redeem” step.

An admin invites a specific person to a group. Admin holds GroupAdmin(group) (which extends GroupMember). They call GroupMember.invite(forSubject=PrincipalRef, lifetime=...) -> (token, revoker, inviteRef) (cap-form, used when the invitee can receive a chat-server-mediated cap delivery – e.g. via an existing DM) or GroupMember.inviteCode(lifetime=...) -> (code :Data, revoker, inviteRef) (byte-form, used when the invitee can only receive bearer-secret bytes through paper handoff, QR code, or non-chat channels). Both calls now also return the issuance lineage node’s inviteRef :GroupCapRef, which the issuer keeps alongside revoker for cap-clean per-branch revocation later via GroupAdmin.describeBranch / revokeBranch. The byte-form is the issuance entry described under How bearer caps cross principal boundaries: a distinct lineage node, not a transparent identifier collapsed onto the inviter. chat-server records InviteToken.parent = the calling admin role cap (cap form), or inviteCode.parent = the calling admin role cap (byte form, naming the issuance entry). The invitee calls Self.acceptInvite(token) -> GroupMember for the cap-form, or Self.acceptInviteCode(code) -> GroupMember for the byte-form; chat-server mints the member cap with parent = the InviteToken/inviteCode lineage node. Lineage is Member -> InviteToken/inviteCode -> Admin -> ... -> chat-server root. The admin’s InviteRevoker revokes that specific handoff (invalidates pre-redemption bearer copies, epochs the redeemed member’s branch).

A member invites someone (if group policy allows). Same shape as admin-invite, but the invite policy may restrict member-issued invites (single-use, n-shot, or disabled). The invitee’s resulting GroupMember cap is parented to the inviting member’s role cap, not to the admin’s; this is the per-member chain that makes spam-bot recovery work.

Spam-bot recovery (per-branch revoke). A member M used their member cap’s invite authority to admit five spam bots. Owner or admin obtains a GroupCapRef for M’s branch – without holding M’s bearer cap, since transfer_policy forbids raw bearer transfer. Two cap-clean obtain paths:

  • GroupAdmin.lookupByPrincipal(M.principal) -> List(GroupCapRef) if the admin is starting from M’s PrincipalRef. The ChatInboundEvent.sender field is a disclosure-redacted display name (text), not a PrincipalRef, so the admin gets M.principal from one of the typed surfaces that actually carry a PrincipalRef: an audit-log entry, a user-search / identity-broker UI, or by inspecting a known lineage node via describeBranch – the returned BranchInfo.root is a LineageNode, and when its union arm is capNode, the capNode.principal :PrincipalRef is the unredacted owner (issuance / operation arms have no principal of their own; walk to a capNode descendant). The redacted sender field is for display only.
  • GroupAdmin.describeRoot() -> BranchInfo(GroupCapRef) for a full top-down walk when starting from “show me the whole group’s lineage tree” (recurse into LineageNode.children; each capNode arm carries the unredacted principal :PrincipalRef for admins).

Then optionally GroupAdmin.describeBranch(node) -> BranchInfo(GroupCapRef) to render “this is who would be revoked” UI before pulling the trigger, and GroupAdmin.revokeBranch(node) to commit. chat-server rotates the kernel-level cap epoch on M’s role cap and every descendant of it – the five bots’ caps and any further sub-invitees. Subsequent dispatch through any of them fails closed. Other members of the same group, including operators who joined via the same public DiscoverableGroupJoin(group) route, are untouched because each public join produced its own distinct chain rooted at that joiner’s join event.

Closing a public-join route without kicking existing members. Two parallel cases by chat kind:

  • Group. Owner calls GroupOwner.closePublicJoin(entry) with the entry handle minted by publishDiscoverable. chat-server marks the public-join entry inactive and rotates the epoch on the shared DiscoverableGroupJoin cap class that every directory result handed out. Subsequent DiscoverableGroupJoin.join() calls fail closed; existing GroupMember(group) caps are unaffected because the discoverable cap is not in their lineage (the route is the policy that minted them, not their parent). To later re-open, owner publishes a fresh DiscoverableGroupJoin – a new cap with a fresh epoch.
  • Channel (broadcast). Owner calls ChannelOwner.closePublicJoin(entry). chat-server rotates the epoch on whichever DiscoverableChannel{Text,Audio,Video}Subscribe cap class was associated with the entry (one epoch rotation can cover all three kinds for a single Channel route or carve them separately; chat-server config decides). Existing Channel{Text,Audio,Video}Subscriber caps are unaffected – the discoverable cap is again not in their lineage.

Two principals open a DM (contact-cap path). Alice wants to be reachable. She has two options depending on how the recipient will receive the contact:

  • Cap form – `Self.contact(lifetime=…) -> (contact
    ContactCap, ref :ContactCapRef). The bearer ContactCapis shared via chat-server-mediated cap delivery (e.g. attached to a chat event in an existing group Alice is in, where chat-server re-mints it for each recipient). TheContactCapRefis Alice's non-secret revocation handle; she keeps it locally (alongside whatever metadata her UI shows in a "contacts I've issued" list) and later callsSelf.revokeContact(ref)` if she wants to retract this contact. Use this form when the recipient already has a cap-bearing channel to Alice.
  • Code formSelf.contactCode(lifetime=...) -> (code :Data, codeId :Data). The bearer-secret code bytes are shared out-of-band (pinned in Alice’s public-profile post, printed on a business card, encoded as a QR, sent over an unrelated channel); the codeId is the non-secret revocation handle Alice keeps and later passes to Self.revokeContactCode(codeId). Use this form when the recipient cannot receive a cap (no shared chat yet, or out-of-band handoff).

Bob, holding Self for his own session, calls one of the recipient methods: Self.openDm(contactCap) -> DmPeer for cap form, or Self.openDmFromCode(code) -> DmPeer for code form. chat-server mints Bob’s DmPeer(B->A) with parent = the ContactCap or contactCode lineage node, and delivers Alice’s side DmPeer(A->B) to Alice via Self.subscribeIncoming – specifically, the dmOpened :DmPeer arm of the SelfIncomingEvent union, with source :IssuanceSource carrying the contactRef :ContactCapRef Alice retained from her earlier Self.contact(...) issuance (cap-form path) or the codeId :Data from Self.contactCode(...) (code-form path). Alice’s UI matches the event to its issuance entry through that ref. Either party drops their listener subscription to stop receiving (instant); Alice may call Self.revokeContact(ref) (cap form) or Self.revokeContactCode(codeId) (code form), passing the issuer-side handle she retained from the earlier issuance call, to revoke just that contact’s branch and any DM chains derived from it, without affecting DMs Alice established via different contact caps.

Sending to an agent the operator owns. Manifest configures: when operator session starts an agent, chat-server creates a fresh agent-prompt group with operator as Owner and the agent runner’s session as a Member. Operator already holds GroupOwner(agent-prompt) because chat-server made them Owner at group creation time. No approval step. Tool consent inside the agent runner remains a separate concern handled by ApprovalClient.

Sending to an agent the operator does not own. The agent’s owner controls reachability. They publish a DiscoverableGroupJoin (or per-kind channel-subscribe) in their scope’s directory, or hand out a contact cap to a specific operator, or invite to a specific group. There is no protocol-level way to write the agent without already holding such a cap.

Listener-side filter (soft mute). Subscribers may pass options on subscribeText/Audio/Video that filter inbound events by sender lineage, e.g. muteSenderBranch(parentCapId). Sender caps may have been validly minted; filter is a soft mute, not a revocation. For hard revocation, the owner must call revokeBranch.

Worked examples

These ground the abstract granting flows in concrete scenarios that will appear in implementation iterations.

Public/system channel: making lobby reachable to all operators.

Two valid paths, both expressed as Chat-service configuration:

  • Manifest-bundled membership. The chat-server manifest declares the group and the auto-bundle policy:

    chat:
      groups:
        lobby:
          owner: principal:chat-server   # the service runs as Owner
      bundles:
        - profile: operator
          attach: GroupMember(lobby)
    

    At startup, chat-server creates lobby and prepares the attach-on-login behavior. When an operator session logs in, the broker invokes chat-server’s per-session bundle hook; chat-server mints a fresh GroupMember(lobby) cap for that session with parent = chat-server root (specifically: a per-session chain root). No two operators share the same chain. To remove one operator the admin runs the deny-list-only ban semantic as a pair of calls: GroupAdmin.revokeBranch(theirMemberRef) to epoch the current chain (the operator’s active session fails closed on the next dispatch) AND GroupAdmin.banPrincipal(theirPrincipal) to add them to the deny-list so the bundle hook does NOT mint a fresh GroupMember(lobby) on their next login. Either step alone is meaningful but incomplete: revokeBranch alone leaves the bundle hook open, and banPrincipal alone leaves the current session running. Other operators’ chains are unaffected by either step.

  • Discoverable join via ChatDirectory. chat-server’s manifest declares the lobby visible in the operator scope:

    chat:
      groups:
        lobby:
          owner: principal:chat-server
      directories:
        operator-scope:
          bundle-to: { profile: operator }
          entries:
            - group: lobby             # the entry references the
                                         # Group above; the manifest
                                         # key uses `group:` rather
                                         # than the reserved
                                         # `channel:` since lobby is
                                         # a Group not a broadcast
                                         # Channel.
              join-policy: any-holder    # anyone holding the
                                         # DiscoverableGroupJoin(lobby) entry
                                         # may call .join()
    

    Operator sessions get ChatDirectory(operator-scope) bundled at login. The operator calls ChatDirectory.search(query), sees the lobby entry with a DiscoverableGroupJoin(lobby) cap, and calls DiscoverableGroupJoin(lobby).join() -> GroupMember(lobby). Each public join is its own distinct chain: the new member’s parent is the per-session join event, not the shared per-kind discoverable cap cap. Kicking member M with GroupAdmin.revokeBranch(M) epochs M’s chain (and anyone M invited to the group) but leaves all other public-joined members intact. Because the public-join route is still open, M could re-join through it and mint a fresh chain unless the admin also calls GroupAdmin.banPrincipal(M.principal) – the deny-list-only ban primitive that blocks future mints for that principal. The full “kick + ban M” workflow is therefore the pair revokeBranch(M) + banPrincipal(M.principal); either step alone is meaningful (kick without banning lets a contrite member re-join; banning a not-currently-active principal blocks future mints without epoching anything). To stop accepting new joins from anyone, the owner calls GroupOwner.closePublicJoin(entry) with the ChatDirectoryEntryHandle returned by the matching publishDiscoverable call; chat-server epochs the DiscoverableGroupJoin(lobby) cap class. Existing members are unaffected.

The first path is right for “every operator should be in the lobby the moment they log in”; the second is right for “operators choose whether to join, and we want a single knob to stop accepting new joins without kicking existing members”. Both are configurations of the same Chat service, both produce per-member distinct chains, and neither requires a registry service outside Chat.

Cross-session messaging test (group case).

Iteration 4’s primary cross-session test exercises the default case: two sessions message each other through a shared group, which is how humans actually message each other in a Telegram-shaped system. The DM path is exercised separately because its cap-derivation chain is different.

Test fixture, in pseudo-CUE chat-server config:

chat:
  groups:
    test-lobby:
      owner: principal:chat-server
      # The DM negative-test case (case C below) needs an admin
      # cap to call GroupAdmin.revokeBranch on a misbehaving
      # invitee's chain. Manifest grants the console-tester
      # profile a GroupAdmin cap on test-lobby so that test
      # is implementable without changing the substrate; the
      # default group-test path uses only the GroupMember
      # subset of methods.
      admins: [ principal:console-tester ]
  bundles:
    - profile: console-tester
      attach: GroupAdmin(test-lobby)   # extends GroupMember
    - profile: ui-tester
      attach: GroupMember(test-lobby)

sessions:
  console:
    profile: console-tester
  ui:
    profile: ui-tester

Test flow:

  1. chat-server creates test-lobby at boot and registers the per-session bundle behavior. At login, the broker invokes chat-server’s bundle hook for each session; chat-server mints a fresh GroupAdmin(test-lobby) cap for the console session (which inherits all GroupMember methods so the group-test path below works unchanged) and a fresh GroupMember(test-lobby) cap for the UI session, each its own chain root in chat-server’s lineage tree. The admin cap is what enables Negative case C in the DM flow.
  2. Console session opens its bundled member cap, mints a TextListener, calls groupMemberCap.subscribeText(listener).
  3. UI session does the same through the trusted Rust backend.
  4. Console session calls groupMemberCap.send(event{kind=text, text="hi from console"}).
  5. UI session’s listener receives the inbound event; UI backend surfaces it as a view-model row in the browser’s chat panel.
  6. UI session sends a reply; console session’s listener receives.
  7. Test asserts both directions of the round-trip and asserts that the redacted transcript contains kind=text events from both senders without leaking session-id hex or raw cap handles.

This proves: default capset distribution works; subscribe/send round-trip works; cross-session listener delivery works.

Cross-session messaging test (DM case).

Same fixture extended with a Self cap on each session (the cap that lets a principal produce a contact cap and accept incoming DMs). Both sessions are also members of test-lobby from the group test, which is the substrate “out-of-band” channel through which the contact cap travels.

  1. Console session calls console.contact() and binds the tuple result (contactCap, ref). The bearer contactCap is a chat-server-issued cap that says “any holder may open a DM to console”; the ref :ContactCapRef is the issuer-side revocation handle Console retains (Negative case B uses it). Console session sends ONLY the bearer contactCap to the UI session through the existing test-lobby group chat (the group’s send() accepts cap references in events for exactly this purpose); it does NOT send the ref. The contact cap’s parent in chat-server’s lineage is “console session’s contact-issuance event” – a fresh chain root.

  2. UI session receives the chat event carrying the contact cap, extracts it, and calls ui.openDm(contactCap) -> DmPeer(UI->Console). chat-server mints both directions: UI’s DmPeer(UI->Console) with parent = contactCap, and Console’s own DmPeer(Console->UI) delivered via Console’s Self notification surface, with the same parent.

  3. Both sides subscribeText, exchange messages, assert round-trip.

  4. Negative case A: a third session that did not receive the contact cap cannot construct one (it has no Self.contact() path bound to the console principal). The test does not even need a denial assertion – the third session has no cap to call.

  5. Negative case B: console calls Self.revokeContact(ref), passing the ContactCapRef it retained from the earlier Self.contact(...) call. chat-server epochs the contact cap and the DmPeer chains derived from it. UI’s subsequent DmPeer.send fails closed with staleCap. The test asserts the typed denial.

  6. Negative case C: console invites a hostile third party to test-lobby via GroupMember.invite(forSubject=hostilePrincipal, lifetime=...), binding the result tuple `(token :InviteToken, revoker
    InviteRevoker, inviteRef :GroupCapRef). console keeps both revoker(issuer-side revocation handle, parented to console's admin role cap) ANDinviteRef(the issuance lineage node, anIssuanceNodeInfowith kindinviteToken); both are non-secret and stored in the fixture's "outstanding invitations" record. console delivers only tokento the hostile party through chat-server's normal cap-delivery path. The hostile party redeems withSelf.acceptInvite(token) -> GroupMember(test-lobby)and uses the cap badly. console (holdingGroupAdmin(test-lobby)` per the fixture) has two cap-clean revocation paths:
    • revoker.revoke() – the simplest path: console already holds the issuer-side handle and does not need any new ref. chat-server epochs the InviteToken’s lineage node and any descendants (the hostile member’s GroupMember cap and any sub-invitees they admitted).
    • General per-branch path: console obtains a GroupCapRef for the hostile branch through one of the typed sources declared on GroupAdmin:
      • the inviteRef returned by the original GroupMember.invite(...) tuple (if console issued the invite itself);
      • GroupAdmin.lookupByPrincipal(hostilePrincipal) if console did NOT issue the invite (e.g. when revoking somebody else’s invitee or a public-join chain);
      • or GroupAdmin.describeRoot() for a full top-down walk. Then GroupAdmin.describeBranch(node) to inspect the subtree before pulling the trigger, and GroupAdmin.revokeBranch(node) to epoch it. Raw transfer of the hostile party’s bearer GroupMember cap is NOT how console gets the ref; transfer_policy forbids that, and chat-server’s lineage queries are the cap-clean substitute.

    The test asserts the third party is gone (staleCap on the next dispatch through the revoked branch) and that UI’s DM with console is not affected (different lineage chain).

This proves: contact-cap-driven DM works; DM peer caps are direction-bound (asymmetric); revoking a contact cap propagates to derived DMs without touching unrelated caps; per-branch revocation isolates spam without cascading to siblings; no cold-call path exists.

Cap lineage and transitive revocation

Each chat host maintains an internal cap-derivation tree:

  • Every cap minted by a derive method has a recorded parent.
  • A cap’s active descendants are reachable by tree walk.
  • revokeBranch(cap) rotates the kernel cap-epoch for the cap and all its active descendants. Subsequent dispatch through any of those caps fails closed.
  • The kernel does not need to know about lineage; it only sees per-cap epochs (already an existing mechanism). Lineage tracking is the chat host’s job. The kernel enforces the cap’s transfer_policy, which forbids raw bearer transfer for chat caps – so the only way for a cap to reach a new principal is through a derive method, which records lineage.

Why service-side bookkeeping rather than kernel-tracked lineage. capOS’s stated principle (docs/capability-model.md, CLAUDE.md) is to “prefer userspace capability wrappers over kernel-side policy checks.” Lineage has a domain-specific shape per service (a chat group vs a file share vs a credential vault all want different revocation semantics), and putting it in the kernel forces every cap to carry lineage overhead even when its service does not need it. The service-side approach lets each host implement the semantics it actually needs, while leaning on existing kernel mechanisms (cap epoch, transfer policy) for enforcement.

Revocation primitives

Three independent revocation paths, all observable as typed denials:

  • Listener-side instant drop. Receiver cancel()s the Subscription cap or drops the listener. No further pushes from anyone reach that listener. This is the receiver’s primary tool for “leave me alone right now”.
  • Branch revocation by lineage. Admin calls GroupAdmin.revokeBranch(node :GroupCapRef) / ChannelAdmin.revokeBranch(node :ChannelCapRef), passing a typed lineage-node ref obtained from describeRoot / lookupByPrincipal / the inviteRef returned by an earlier invite / the various *Ref fields on SelfIncomingEvent – never a raw bearer cap (transfer_policy forbids cross-principal cap transfer; chat-server’s lineage queries are the cap-clean substitute). Issuer-held revoker caps cover the analogous bearer flows: Self.revokeContact(ref) / Self.revokeContactCode(codeId) for contact-driven DMs; InviteRevoker.revoke() for an outstanding invite; SpeakerRevoker.revoke() for stage-room speak grants; RolePromotionRevoker.revoke() for role promotions. In every case chat-server rotates the kernel epoch on the named branch. Used for “remove a misbehaving admin and everything they admitted”, “kill a contact cap that fell into spammer hands”, “shut down a topic and everyone who joined via it”. A separate operation, GroupOwner.closePublicJoin(entry) / ChannelOwner.closePublicJoin(entry), stops new joins through a DiscoverableGroupJoin / DiscoverableChannel*Subscribe route without kicking existing members (the route is the policy that minted them, not their parent in the lineage tree).
  • Chat-wide invalidation. A GroupOwner.disband / ChannelAdmin.closeChannel call invalidates the whole chat (or the room is closed, or the agent shut down). Subsequent calls return staleChannel.

Revocation is not silent. All three paths surface as typed staleCap / staleChannel denials at the next call site, with the remote CapSet UI reflecting them as kind=presence chat events (“you were removed from this group”, “this channel has closed”) or on the next operator action.

Audit

Every derive and every revocation is auditable. The host’s lineage tree is itself the audit substrate: for any cap, “who derived this, when, from which parent, with what method” is a tree query. The audit log records the caller’s session-scoped reference per session-bound-invocation-context-proposal.md. Listener subscribe/unsubscribe is auditable from the receiver’s session.

What this proposal does NOT decide

  • The exact role-permission DSL for GroupAdmin (Telegram allows per-admin granular permissions: can-pin, can-invite, can-edit; capOS’s first slice can ship a single Admin role and refine later). Schema must leave room.
  • Per-topic permission overrides within a group. First slice is group-wide policy; topics are sub-channels under the same membership.
  • Group DMs (multi-recipient DMs). Likely modeled as a Group with Owner=initiator, Members=invited principals; no fan-out DmPeer. Details in a follow-up.
  • The kernel feature for per-cap transfer_policy to forbid raw bearer transfer specifically for chat-cap-classes. capOS’s CapInfo.transfer_policy already exists as a string field; the exact policy values live in a kernel/auth follow-up. Until then, channel-host lineage tracking can still work but with a soft invariant: derive methods are the intended path; raw bearer transfer is not blocked at kernel level. The implementation iteration must close this gap before the substrate is treated as hardened.
  • The exact ActionPlan and CapRequest schemas referenced from ApprovalClient. They are an approvals-side gap, not a chat-side one.

End-To-End Encrypted DMs

End-to-end-encrypted DMs are a distinct cap layer sitting on top of the regular DM substrate, not a flag on DmPeer. Reasons to keep them separate:

  • The chat host carries ciphertext only and never sees plaintext. That is a strong invariant; making it a flag risks a code path where plaintext leaks under “encryption disabled” conditions.
  • Key exchange, authenticated encryption (AEAD), forward-secrecy ratchets (e.g. Signal-style double ratchet), and out-of-band fingerprint verification are concerns the unencrypted DM does not have. They need their own cap surface so the policy can be reasoned about per-DM.
  • Auditing differs: an unencrypted DM’s host can audit message contents per disclosure policy; an encrypted DM’s host audits metadata only (sender, recipient, timestamp, ciphertext size).

Cap shape

The E2E peer cap is routing-only. It carries opaque ciphertext between two endpoints; it never has access to plaintext or to the AEAD ratchet keys. The KeyContext lives strictly in the principal’s own process (held client-side via cryptography-and-key-management-proposal.md primitives), is never serialized into a chat-server-minted cap, and never crosses to chat-server in any method argument or return.

# E2E DM peer cap. Minted by chat-server, but holds NO key state.
# It is a pure routing endpoint: it accepts opaque ciphertext for
# delivery, and routes opaque ciphertext to a listener.
interface E2EDmPeer extends(ChatEndpoint) {
  send             @0 (envelope :CipherEnvelope) -> ();
  subscribeCipher  @1 (listener :CipherListener,
                       options :SubscribeOptions) -> (sub :Subscription);
  # Outgoing media: still flow-controlled, but the bytes have
  # already been encrypted client-side by the holder. The peer cap
  # does not see the plaintext frame, nor does it accept a key
  # context as an argument.
  openCipherOut    @2 (format :CipherStreamFormat) -> (track :CipherOut);
  remoteFingerprint @3 () -> (info :PeerFingerprint);
  callSurface      @4 () -> (calls :E2ECallSurface);
  closeDm          @5 () -> ();
}

# Listener and outgoing-media caps for E2E. Both carry opaque
# bytes; decrypt/encrypt happens in the holder's own process.
interface CipherListener {
  cipher @0 (envelope :CipherEnvelope) -> ();
}
interface CipherOut {
  writeCipherFrame @0 (envelope :CipherEnvelope) -> stream;
  close @1 ();
}

struct CipherEnvelope {
  ciphertext      @0 :Data;     # AEAD output; opaque to chat-server
  associatedData  @1 :Data;     # AEAD AAD (e.g. sequence number,
                                # ratchet header) -- routing
                                # metadata only, no plaintext
  receivedAtMs    @2 :UInt64;
}

# E2E call surface. Narrower than CallSurface: NO setRoutingMode,
# because chat-server cannot mix or transcode (it doesn't have the
# keys), so SFU-forward is the only viable mode. The constraint is
# enforced at the type level -- the method simply doesn't exist.
interface E2ECallSurface {
  current        @0 () -> (info :ActiveCallInfo);
  subscribeState @1 (listener :CallStateListener,
                     options :SubscribeOptions) -> (sub :Subscription);
  startCall      @2 (config :E2ECallStartConfig) -> (host :E2ECallHost);
  joinCall       @3 () -> (participant :E2ECallParticipant);
  # Roster delivery for E2E (DM) calls. Required for
  # `e2eHostGranted :E2ECallHost` delivery on
  # E2ECallHost.promoteHost.
  subscribeRoster @4 (listener :CallRosterListener,
                      options :RosterSubscribeOptions)
                      -> (sub :Subscription);
}

# E2ECallParticipant mirrors CallParticipant but accepts only
# already-encrypted CipherOut tracks; the participant cap does
# not handle key state. Receive is via subscribeCipher: the
# listener gets one fan-out stream of CipherEnvelope frames
# covering all participants' audio and video tracks; the
# receiver's process discriminates kind/track via the envelope's
# associatedData / sequence-id metadata and decrypts locally.
# There is no plaintext-receive method on this cap.
interface E2ECallParticipant extends(ChatEndpoint) {
  publishCipherAudio @0 (format :CipherStreamFormat) -> (track :CipherOut);
  publishCipherVideo @1 (format :CipherStreamFormat,
                         purpose :VideoPurpose) -> (track :CipherOut);
  unpublishAudio     @2 () -> ();
  unpublishVideo     @3 (purpose :VideoPurpose) -> ();
  raiseHand          @4 (raised :Bool) -> ();
  setMyMuteState     @5 (muted :Bool) -> ();
  leave              @6 () -> ();
  subscribeCipher    @7 (listener :CipherListener,
                         options :SubscribeOptions)
                         -> (sub :Subscription);
}

# Note the deliberate absence of setRoutingMode: an E2ECallHost
# cannot select mesh/MCU because chat-server is keyless and can
# only forward.
interface E2ECallHost extends(E2ECallParticipant) {
  mute        @0 (participantRef :Data) -> ();
  unmute      @1 (participantRef :Data) -> ();
  eject       @2 (participantRef :Data) -> ();
  # Same delivery pattern as `CallHost.promoteHost`: the new
  # `E2ECallHost` cap is delivered to the bound participant via
  # CallRosterDelta (`e2eHostGranted :E2ECallHost` arm), not
  # returned to the caller.
  promoteHost @3 (participantRef :Data) -> (revoker :RolePromotionRevoker);
  end         @4 () -> ();
}

Key exchange

E2E DM establishment piggybacks on the contact-cap path. The critical invariant: chat-server only ever sees ciphertext.

  1. Alice’s Self.contact() produces a contact cap whose ContactInfo includes Alice’s long-term identity public key (or a fingerprint resolvable through her published profile). Where the contact cap is shared is out-of-band relative to chat-server.
  2. Bob, holding Alice’s contact cap, calls Self.openE2EDm(contact). chat-server mints E2EDmPeer(B->A) for Bob (a routing cap with NO key state) and delivers Alice’s side E2EDmPeer(A->B) to Alice via Self.subscribeIncoming (e2eDmOpened :E2EDmPeer arm of SelfIncomingEvent).
  3. Bob and Alice run a key-exchange handshake (X3DH or similar) in their own processes. The handshake ciphertexts travel over the E2E DM channel itself; chat-server is an opaque carrier. Bob’s KeyContext is built in Bob’s process from his identity PrivateKey and Alice’s identity public key; ditto for Alice. Neither key context is ever passed to a chat-server method or stored in a chat-server-minted cap.
  4. After handshake, each side holds a KeyContext locally. To send: encrypt(plaintext, KeyContext) -> CipherEnvelope, then peer.send(envelope). To receive: peer’s listener delivers CipherEnvelope, the listener’s owning principal calls decrypt(envelope, KeyContext) -> plaintext locally.
  5. Either party may rotate keys by performing a fresh ratchet step in their own process and exchanging the new ratchet header through normal send() – no special method is required because key state never lived on the peer cap.
  6. Out-of-band fingerprint verification compares peer.remoteFingerprint() (a public-key digest, safe to expose; it is NOT the AEAD secret) with what each side knows from their contact cap.

Why this firewalls plaintext from the host

  • E2EDmPeer.send(CipherEnvelope) accepts ciphertext only. chat-server has no method to obtain the plaintext or the key context from the peer cap.
  • subscribeCipher delivers CipherEnvelope to a CipherListener; decryption happens in the listener’s owning process.
  • openCipherOut produces a CipherOut that accepts already- encrypted frames. chat-server forwards them without ever seeing plaintext.
  • The KeyContext cap is held client-side, never serialized into a chat-server-minted cap, never passed as an argument to a chat-server method. (This is enforced by the cryptography-and-key-management-proposal.md KeyContext cap’s transfer policy: not transferable to chat-server.)
  • E2E calls cannot mix/transcode because chat-server has no keys. The E2ECallSurface / E2ECallHost interfaces simply do not have setRoutingMode; the SFU-forward-only constraint is a type-level invariant rather than a runtime check.

What stays in vs out of scope here

In scope: end-to-end-encrypted DM voice/video calls. Both plain DmPeer.callSurface() and E2EDmPeer.callSurface() return E2ECallSurface. Direct calls between two principals are end-to-end-encrypted at the media layer regardless of whether the DM’s text is host-readable: chat-server forwards encrypted RTP frames (via CipherOut-style tracks), and a DTLS-SRTP-style key exchange runs between the peers at call start. The SFU-forward-only constraint is enforced at the type level on E2ECallSurface (no setRoutingMode).

Out of scope:

  • E2E for the text of a regular DmPeer stays plaintext-aware on chat-server. If you want host-blind text, use E2EDmPeer (which is a distinct cap layer with its own CipherEnvelope- shaped send/subscribe).
  • Group E2E (multi-party MLS-style ratcheting). First slice is pairwise only. Group E2E is a future iteration once pairwise is proved.
  • Cross-device synchronization (the “I want my E2E messages on a second device” problem). Out of scope.
  • Server-side recording or transcoding for E2E media. The substrate is recording-blind everywhere; for E2E media, chat-server cannot mix or transcode anyway because it has no keys – this is a direct consequence, not a separate rule.

Backpressure And Quotas

Hot-path media (audio frames at 50 Hz, video frames at 30 Hz) does not fit on a synchronous request/response model.

  • Outgoing audio/video uses -> stream so the caller can pipeline frame writes without each one waiting for an ACK; the framework applies backpressure when the buffer fills.
  • Incoming audio/video listener caps publish a bounded ring; when the consumer falls behind, the substrate drops oldest frames and reports drop count via AudioFrameMeta.dropsSinceLast (or equivalent) so the consumer can detect liveness gaps without reconstructing full frame history.
  • Per-chat quotas live in the chat cap itself (constructed by the hosting service). Per-session quotas live in the broker bundle. Two natural axes: max concurrent subscriptions per kind, max outgoing bandwidth per chat.
  • Text history buffering is bounded by the trusted Rust backend’s AppState; browser view models receive at most the last N events. The chat-cap holder may also subscribeText with a since(eventId) option to fetch a bounded backlog.

Privacy And Disclosure

Senders are surfaced through ChatInboundEvent.sender. Per session-bound-invocation-context-proposal.md, the channel server sees the caller’s opaque session-scoped reference plus freshness; it does not see raw principal/profile/account fields by default. The chat-server-side disclosure policy decides whether a sender’s display name, principal class, or profile class is included in events visible to other subscribers; default is “display name only”.

The remote CapSet UI’s redacted-transcript export rule applies here too: audio/video metadata (codec, timestamps, frame counts) may appear in transcripts; frame bodies do not.

Migration From The Existing Chat Schema

The current Chat interface (text, poll-based, single struct) stays callable during the migration. Steps in approximate order:

  1. Add the listener-cap surface (subscribeText, TextListener, the new ChatInboundEvent struct) alongside poll. Keep poll working.
  2. Migrate the chat-server demo and the per-session chat worker to push events through the listener cap. Mark poll deprecated for capnp-rpc clients but keep it for DTO clients during the remote-session transport migration (Remote Session CapSet Clients and docs/backlog/remote-session-capset-client.md Task 1).
  3. Add the audio surface (subscribeAudio, AudioSink, openAudioOut, AudioOut) once MemoryObject-backed media rings exist. The realtime voice proposal’s VoiceSession becomes the browser-side adapter that maps WebRTC tracks into Chat audio subscriptions.
  4. Add the video surface analogously. Video is feasible only after audio is proved end-to-end and the gateway-side WebRTC adapter exists.
  5. Once all subscribers are listener-cap-driven, remove poll from the substrate-level interface; service-specific shims may keep it.

Each step is a separate iteration with its own QEMU smoke and host-side proof. The first iteration on top of this proposal is the text-only listener-cap rebuild, which is also iteration 4 of the remote-session plan (real Chat panel + cross-session messaging test).

Open Questions

  • Per-cap transfer_policy enforcement at kernel level. Today CapInfo.transfer_policy is a string field on every cap (values like "stable", "session-proxy"); it is descriptive, not enforced. Cap transfer between processes happens via the SQE IPC_TRANSFER_CAP flag, which the kernel implements by copying the cap entry from sender’s CapTable into receiver’s. Today that copy succeeds regardless of transfer_policy. The substrate’s lineage invariant relies on: the only path for a chat cap to reach a new principal is through chat-server’s invite/acceptInvite/Self.openDm/etc. methods (which record lineage). But if a principal holds GroupMember(lobby) and passes that cap as a payload in an SQE to any other service via raw IPC_TRANSFER_CAP, the kernel hands a copy to that service – bypassing chat-server entirely. The lineage tree silently grows a copy with no recorded parent, and chat-server cannot revoke it. The kernel enforcement gap to close: extend SQE cap-transfer dispatch to consult transfer_policy and reject transfers whose policy class forbids cross-principal copy (chat-class caps would carry such a policy). Sharing then must go through chat-server’s typed methods, which is where lineage gets recorded. Until this gap is closed, the substrate’s lineage invariant is enforced only by convention; no implementation iteration should treat the substrate as hardened without it.
  • Cross-channel reference of contact caps. This proposal has contact caps travel “through some channel the principals already share” – e.g. a contact cap is delivered via a group chat the giver and recipient both belong to. Chat events therefore need a way to carry cap references inline (the data field on ChatOutboundEvent plus a typed payload kind, or a separate cap-attachment field on the event). The first iteration may use the existing capnp cap-passing on the outbound event; details belong with iteration 1 schema refinement.
  • Multi-modal AI agents. When the agent runtime is a Chat peer, it receives audio frames and emits audio frames. The agent runner bridges RealtimeModelSession to the relevant per-kind chat facets – typically GroupMember for an agent-prompt group, or a DmPeer / E2EDmPeer if the agent is a DM peer. Should the bridge live in the agent runner (clean) or be a generic adapter cap (RealtimeChatBridge)? The realtime-voice proposal already has the agent runner doing the bridging; this proposal preserves that.
  • Cross-session media sharing. A chat may have subscribers from multiple sessions. Does each subscription have its own session-scoped reference (yes, per session-bound-invocation-context-proposal.md), and does the chat cap retain owner-session metadata for moderation / kick? Likely yes; details in a follow-up.
  • Approval queue cap shape. Whether the queue lives on AuthorityBroker, on a new ApprovalQueue cap, or on a Notifications cap that carries approvals as one of its event kinds. Out of scope here; tracked in the approvals follow-up note above.
  • Voice barge-in semantics with WebRTC. Existing realtime-voice-agent-shell-proposal.md defines barge-in within RealtimeModelSession; mapping that onto the Chat substrate (interrupt the outgoing audio track when a presence typing event or a fresh inbound audio frame arrives) needs design before the voice iteration.

Relationship To Existing Proposals

  • realtime-voice-agent-shell-proposal.mdVoiceSession becomes the browser-side adapter into the Chat audio surface. RealtimeModelSession stays unchanged (agent runtime ↔ provider). The agent runner bridges the two when the agent is part of a chat.
  • llm-and-agent-proposal.md — “operator sends a prompt to a running agent” is a Chat text event over a channel the operator already holds (e.g. GroupOwner of an agent-prompt group the operator created, or a contact cap the agent’s owner shared). “Agent emits a partial response” is a Chat text event with inReplyTo. “Agent requests a tool with consent required” emits an approvalRef event referencing an ApprovalGrant from the existing ApprovalClient surface; ApprovalClient is not used to grant cross-principal write authority – that is always invite- or contact-cap-driven.
  • user-identity-and-policy-proposal.md — the principal model (PrincipalKind including service) is the basis for service principals owning system channels and for chat-server’s bundle and directory-scope predicates that test principal kind/profile.
  • Remote Session CapSet Clients — the remote CapSet UI’s “real Chat panel” target (iteration 4 of the plan) consumes the text-only slice of this substrate first; audio/video panels are follow-up iterations on the same backend boundary. The trusted Rust backend in that proposal is also where the WebRTC peer endpoint and the /api/chat/webrtc/* signalling endpoint described under “WebRTC Mapping” terminate; chat-server itself never holds a WebRTC handle, a TCP listener, or a TLS context.
  • Networking — the /api/chat/webrtc/* signalling endpoint, the redacted-transcript HTTP path, and any future native (non-WebRTC) Chat transport all run on userspace networking caps (NetworkManager / TcpListener / TcpSocket / UdpSocket) handed to the trusted Rust backend, not on chat-server itself. Phase C userspace decomposition of the smoltcp stack is the gating dependency: until that lands, the kernel-resident TCP listener and accepted-socket state described in the networking proposal still front any TCP-shaped Chat transport, including the WebRTC signalling endpoint.
  • Certificates and TLS — the browser ↔ backend signalling channel and any future native Chat-over-TLS transport build their TLS context from the Certificate / TrustStore / TlsServerContext / TlsClientContext caps defined there, composed on top of a PrivateKey cap from Cryptography and Key Management. Chat carries reference handles or audit-safe descriptors only; certificate material and TLS keys never reach chat-server.
  • shell-proposal.mdApprovalClient / ApprovalGrant stay as defined; this proposal references them via approvalRef.
  • session-bound-invocation-context-proposal.md — subscription identity is the session-scoped reference; Chat servers honour disclosure scopes.
  • interactive-command-surface-proposal.md — typed command palettes remain a separate concern; a chat may surface a command-palette proposal as a structured message, but the command surface itself is not Chat.
  • browser-capability-proposal.md — if a future browser tab sits inside a Chat-served pane (screen-share scenario), the browser cap rules still apply; Chat carries reference handles, not browser authority.

References

  • WebRTC API specifications: RTCPeerConnection, RTCDataChannel, audio and video tracks, SDP, ICE candidates, DTLS/SRTP. See https://webrtc.org/.
  • Cap’n Proto streaming RPC (-> stream method annotation) and listener-cap patterns: https://capnproto.org/news/2020-04-23-capnproto-0.8.html (introduces flow control), and the capnp Rust crate at v0.25 used in this repository.
  • Existing capOS proposals as cross-referenced above.

Proposal: Realtime Voice Agent Shell

How capOS should support web-shell and native-shell voice interaction when modern multimodal models can consume realtime audio and emit both audio streams and structured tool calls.

Problem

The existing language-model proposal defines a text-oriented agent runner: messages, streamed text, structured tool calls, and per-tool permission policy. That model still works, but it is incomplete for modern voice agents. Current provider APIs can run stateful realtime sessions where the model directly listens to audio, speaks audio, performs VAD/barge-in handling, and emits function calls in the same interaction.

If capOS models voice as only “ASR into text shell, then TTS the answer,” it will miss the better latency and interaction model of native realtime audio. If capOS lets provider-native sessions execute tools directly, it breaks the capability model. The design needs a middle path.

Goals

  • Support native realtime audio model sessions alongside chained ASR/text/TTS pipelines.
  • Preserve the existing agent-shell security rule: the model never holds session caps or tool caps.
  • Let WebShellGateway host terminal and voice transport without becoming an authority sink.
  • Keep microphone/speaker media out of TerminalSession text APIs.
  • Minimize and guarantee media stack latency for admitted capOS-controlled realtime islands, preferring enforceable bounds over optimistic nominal latency.
  • Support provider adapters for OpenAI Realtime, Gemini Live API, Vertex AI Live API, local ASR/TTS, and future local realtime multimodal models.
  • Carry timestamps, deadlines, transcripts, interruptions, and tool-call ids as first-class session data.
  • Make direct browser-to-provider media an optional optimization guarded by broker-minted ephemeral credentials.
  • Allow a browser agent to be the web-shell UI and orchestrate the realtime provider loop, while keeping capOS tool execution gateway-enforced.

Non-Goals

  • Implementing provider SDKs in the kernel.
  • Giving a browser any capOS capability handle.
  • Treating voice recognition, wake words, or VAD as authorization.
  • Making a realtime model’s free-form speech or text executable.
  • Guaranteeing full-path realtime behavior for browser, network, or remote provider segments. Native local media can enter guaranteed realtime islands only after scheduling contexts and device isolation mature.

Architecture

flowchart LR
    Browser[Browser UI] -->|terminal frames| Gateway[WebShellGateway]
    Browser -->|mic/playback frames| Gateway

    Gateway --> Terminal[TerminalSession]
    Gateway --> Voice[VoiceSession]

    Shell[capos-shell agent mode] --> Terminal
    Shell --> Voice
    Shell --> Runner[Agent Runner]

    Runner --> RT[RealtimeModelSession]
    Runner --> Broker[AuthorityBroker]
    Runner --> Audit[AuditLog]
    Runner --> Tools[Session tool caps]

    RT --> Provider[Realtime provider adapter]
    Provider --> Remote[OpenAI / Gemini / Vertex]
    Provider --> Local[Local model backend]

Principal split:

  • WebShellGateway authenticates browser sessions, owns browser transport, creates terminal and voice session objects, and tears down resources.
  • capos-shell in agent mode owns the session bundle and acts as the trusted runner for capOS-side agent sessions.
  • A browser agent UI may own the web conversation and provider session loop, but only as an untrusted client of WebShellGateway’s tool proxy.
  • RealtimeModelSession is a model I/O object. It carries audio, text, transcripts, tool calls, and tool results. It has no authority over capOS tools.
  • Provider adapters hold narrow provider credentials or model-runtime caps.
  • The browser holds no capOS session caps, no tool caps, no provider long-lived API keys, and no bearer tokens other than short-lived provider-scoped tokens when a direct-media optimization is explicitly enabled.

Interfaces

The exact schema belongs to the implementation milestone. The shape should be:

interface RealtimeModel {
  info @0 () -> (info :RealtimeModelInfo);
  open @1 (config :RealtimeSessionConfig)
      -> (session :RealtimeModelSession);
}

interface RealtimeModelSession {
  send @0 (event :RealtimeInputEvent) -> ();
  next @1 () -> (event :RealtimeOutputEvent, done :Bool);
  sendToolResult @2 (result :RealtimeToolResult) -> ();
  cancel @3 (reason :CancelReason) -> ();
  close @4 () -> ();
}

RealtimeInputEvent should cover:

  • audio frame reference;
  • text input;
  • image/video frame reference;
  • push-to-talk start/end;
  • playback-position feedback;
  • tool result;
  • cancel, truncate, close.

RealtimeOutputEvent should cover:

  • audio frame reference;
  • text delta;
  • partial and final transcript;
  • tool call delta and complete tool call;
  • interruption/barge-in;
  • session warning/error;
  • provider usage/cost metadata;
  • close/go-away/reconnect notice.

Audio frames should not be copied through Cap’n Proto payloads in the hot path. Use MemoryObject-backed media rings or provider-owned stream handles. Cap’n Proto remains the control plane.

Tool Calls

Realtime tool calls use the same policy as text agent calls.

sequenceDiagram
    participant Model as RealtimeModelSession
    participant Runner as Agent Runner
    participant Broker as AuthorityBroker
    participant Tool as Typed Tool Cap
    participant Audit as AuditLog

    Model->>Runner: tool_call(name, args, provider_call_id)
    Runner->>Runner: validate ToolDescriptor
    Runner->>Broker: authorize tool call
    Broker-->>Runner: auto / consent / stepUp / forbidden
    Runner->>Tool: invoke if allowed
    Tool-->>Runner: typed result
    Runner->>Audit: record decision and outcome
    Runner->>Model: tool result

The runner owns the mapping from provider call ids to capOS audit/tool-call ids in capOS-side mode. In browser-agent UI mode, WebShellGateway’s tool proxy owns that mapping. Provider ids are useful correlation metadata, but they are not authority.

Tool execution must be time-boxed. If a tool blocks too long, the runner or gateway tool proxy sends a typed timeout result back to the realtime model and continues or ends the turn according to policy.

Voice Session

VoiceSession is the shell-facing media session object created by WebShellGateway or a native terminal host.

interface VoiceSession {
  describe @0 () -> (info :VoiceSessionInfo);
  openCapture @1 (format :AudioFormat) -> (stream :AudioInputStream);
  openPlayback @2 (format :AudioFormat) -> (stream :AudioOutputStream);
  event @3 () -> (event :VoiceSessionEvent);
  close @4 () -> ();
}

For web shell, VoiceSession is backed by browser media APIs. For native capOS it can be backed by an audio device service. Either way, it is separate from TerminalSession:

  • terminal input/output remains text and presentation;
  • voice capture/playback is timestamped binary media;
  • transcripts can be rendered into the terminal, but they are not terminal input until the runner accepts them as a user turn.

Media Graph

The local media graph is a userspace service/library layer, not a kernel feature. Its latency goal is the lowest guaranteed-stable operating point for the selected device, graph, and policy: a fixed quantum with admitted CPU, memory, device, and wakeup budgets, not the smallest buffer value that can be configured.

flowchart LR
    Capture[Capture source] --> Convert[format converter / resampler]
    Convert --> Gate[VAD or push-to-talk gate]
    Gate --> Input[realtime provider adapter or local ASR]
    Input --> Runner[agent runner]
    Runner --> Output[realtime provider adapter or local TTS]
    Output --> Playback[playback sink]

For browser voice, the graph may partly live in browser JavaScript and partly in capOS services. For native hardware, the graph eventually uses audio driver services that hold DeviceMmio, DMAPool, and Interrupt capabilities.

Graph control operations are ordinary endpoint calls:

  • create node;
  • connect port;
  • set format;
  • allocate buffer pool;
  • start/stop stream;
  • set deadline and latency policy.

Graph data uses MemoryObject pools and notification/futex wakeups. Audio frames carry:

sequence
capture_time_ns
playback_time_ns
deadline_ns
format
offset
length
flags

The realtime data path should not perform allocation, blocking IPC, logging, permission checks, provider credential work, or graph mutation. Those remain control-plane operations. Any bridge that crosses process, clock, network, provider, or browser boundaries must declare its extra latency so the graph can report the full stack rather than burying delay in queues. A non-guaranteed bridge must not backpressure a guaranteed island; it must drop, silence, bypass, stop, or renegotiate.

WebShellGateway Modes

Gateway-Mediated Provider Session

flowchart LR
    Browser[Browser] <--> Gateway[WebShellGateway]
    Gateway <--> Adapter[ProviderAdapter]
    Adapter <--> Provider[Provider API]

Properties:

  • provider long-lived credentials remain server-side;
  • tool-call events remain server-side unless explicitly proxied to a browser agent UI under broker policy;
  • gateway can record/drop/rate-limit media;
  • easier audit and teardown;
  • higher latency because audio crosses the gateway.

This is the baseline mode.

Direct Browser Provider Media

flowchart LR
    Browser[Browser] <--> Provider[Provider API]
    Browser <--> Gateway[WebShellGateway control/audit path]

Properties:

  • lower media latency;
  • browser receives provider-specific ephemeral credential;
  • gateway may not see every media frame or provider control event;
  • allowed only when broker policy says direct media is acceptable;
  • provider tool declarations are disabled unless either a trusted server-side control channel handles tool calls and results, or the session is explicitly in browser-agent UI mode and every tool call is routed through WebShellGateway’s server-side tool proxy.

Direct mode requires:

  • provider token scoped to model/config/session;
  • short expiration;
  • no capOS capability material in the token;
  • provider tools disabled, provider-supported server-side receipt of tool calls plus server-side submission of tool results, or browser-agent UI mode where JavaScript receives provider tool calls but can only send structured ToolRequest values to WebShellGateway;
  • trusted revocation or session close path; if the provider exposes only a browser-held connection, the kill switch is best-effort and must not be described as authoritative;
  • audit that records direct-media mode, token issuance metadata, disabled tool status, and any uninspected media/control scope;
  • fallback to gateway-mediated mode.

Browser Agent UI Direct Provider Session

This mode is distinct from merely moving media off the gateway. The browser agent is the UI: it owns the visible conversation, calls the realtime provider with an ephemeral credential, receives provider tool-call events, and feeds tool results back to the provider. It still does not receive capOS caps.

flowchart LR
    BrowserAgent[Browser Agent UI] <--> Provider[Provider API]
    BrowserAgent -->|ToolRequest| Gateway[WebShellGateway ToolProxy]
    Gateway --> Broker[AuthorityBroker]
    Gateway --> Tools[Session tool caps]
    Gateway --> Audit[AuditLog]
    Gateway -. "ToolResult" .-> BrowserAgent

Rules:

  • the browser credential is scoped to provider, model/config, session, conversation, media mode, and short expiration;
  • the gateway publishes a signed or MACed tool descriptor snapshot for the current turn;
  • browser tool requests must carry the descriptor snapshot id, provider call id, conversation id, turn id, and typed arguments;
  • gateway rejects stale snapshots, replay, unknown tools, schema mismatches, missing consent, missing step-up, and requests after session teardown;
  • gateway performs all real capOS capability invocations server-side and records that the request was browser-agent-proposed;
  • broker policy may deny browser-agent UI mode when prompt, transcript, media, or tool-result confidentiality requires capOS-side provider mediation.

This is lower latency and can use provider-native browser APIs, but it gives up gateway inspection of some media/control frames. Audit must record that fact instead of implying full gateway mediation.

Realtime Provider Adapter

A provider adapter is a normal service process. It should expose RealtimeModel, not provider-specific credentials.

OpenAI adapter:

  • uses WebRTC for browser direct mode or WebSocket for server-side mode;
  • maps provider function-call events either to server-side capOS RealtimeToolCall values or to browser-agent ToolRequest forwarding;
  • maps function_call_output to RealtimeToolResult;
  • handles response cancellation and output-audio truncation.

Gemini developer adapter:

  • uses Live API WebSocket;
  • supports ephemeral-token direct mode when broker policy allows;
  • maps FunctionResponse to RealtimeToolResult;
  • models synchronous and non-blocking function-call behavior explicitly.

Vertex adapter:

  • uses cloud auth and Vertex AI Live API;
  • exposes deployment metadata such as project/location/model id;
  • respects enterprise logging, quota, and provisioned-throughput policy;
  • should not leak Google credentials to browser or shell.

Local adapter:

  • may start as ASR plus text model plus TTS;
  • can later become native realtime audio if a local model supports it;
  • keeps all media on-device and is the correct anonymous/guest fallback.

Scheduling And Deadlines

Web shell and remote-provider voice need bounded soft realtime. Native local voice can use guaranteed realtime islands once scheduling contexts exist:

  • Capture frames older than their deadline should be dropped.
  • Playback frames that miss the output deadline should be skipped or replaced with silence.
  • Barge-in should cancel model output promptly.
  • Tool calls should not block capture/playback loops.
  • The terminal path must remain responsive under model or provider stalls.

Future scheduling contexts should represent:

voice-capture budget/period
provider-adapter budget/period
agent-runner interactive priority
playback budget/period

SQE-level deadlines are useful metadata for stale request handling, but they do not create CPU budget. A provider adapter may reject or drop stale media frames using deadlines before the scheduler grows true budget enforcement. Native media graph scheduling should eventually map graph quantum to scheduling period and per-node CPU budget. Web shell and remote providers cannot provide a capOS guarantee across the full path, so their jitter must be measured and surfaced separately from the local guaranteed island latency.

The general realtime scheduling model is tracked in Tickless and Realtime Scheduling: SQE.deadline_ns is request freshness metadata for stale frame/tool handling, while SchedulingContext carries CPU-time authority and RealtimeIsland admits the local media graph. Voice paths must not treat deadline metadata as a budget reservation.

Voice can participate in consent UX, but it is not sufficient for strong authorization.

Rules:

  • Read-only tools may run automatically if broker policy allows.
  • Mutating tools need explicit consent; spoken “yes” can satisfy only low-risk consent when the user is already authenticated and the prompt context is active.
  • Destructive tools require stepUp; WebAuthn/passkey is the likely web-shell path.
  • Wake words, speaker identity estimates, VAD, and ASR confidence are never authentication factors.
  • The spoken confirmation transcript and confidence are audit data.

Security Invariants

  • Browser never receives capOS caps.
  • Model services never receive session caps.
  • Provider adapters never receive broad process-spawn or terminal authority.
  • Free-form model text and speech are never parsed as commands.
  • Tool calls are structured values and must match advertised descriptors.
  • Provider credentials are caps or service-private secrets, never transcript text or terminal output.
  • Browser-held provider credentials are short-lived, provider-scoped, and contain no capOS capability material.
  • Voice transcripts are untrusted user input until the runner or gateway accepts them.
  • Prompt-injection rules from the text agent apply unchanged to transcripts, web results, tool results, and model-generated speech.
  • On logout, tab close, timeout, shell exit, or failed auth, the gateway closes terminal, voice, pending tool consent, and server-side model streams. For browser-held provider sessions, gateway teardown authoritatively ends capOS tool execution and rejects future tool requests; provider session revocation is authoritative only when the provider exposes a server-side close API, otherwise it is best-effort and must be audited as such.

Interaction Examples

Low-Risk Read

user speaks: "what services are running?"
model emits tool_call(systemStatus.list, {})
runner policy: auto
runner executes status cap
runner sends tool result
model speaks summary and emits text transcript

Mutating Action

user speaks: "restart the network stack"
model emits tool_call(service.restart, {"name":"net-stack"})
runner policy: consent
gateway renders and speaks confirmation prompt
user says: "yes"
runner executes restart
runner audits transcript, consent, tool args, result
model speaks outcome

Barge-In

model speaking long answer
user starts speaking
VoiceSession emits bargeIn
runner cancels provider output
provider adapter truncates unplayed audio if supported
new user audio starts a new turn

Implementation Sequence

  1. Document and freeze RealtimeModelSession and VoiceSession schemas.
  2. Add a fake local provider adapter using text-only model responses and synthetic audio events so the shell/gateway state machine can be tested without provider credentials.
  3. Extend WebShellGateway protocol with a voice side channel and lifecycle events, still with no direct provider media.
  4. Implement chained local ASR/text/TTS adapter or browser-ASR demo shim for the first visible voice shell proof.
  5. Add provider adapter for one remote realtime API behind broker-issued model caps and server-side credentials.
  6. Add direct browser provider media only after ephemeral-token minting, teardown, and audit are proven in gateway-mediated mode.
  7. Add browser-agent UI mode after the WebShellGateway tool proxy can bind descriptor snapshots, enforce consent/step-up server-side, reject replay, and audit browser-agent-proposed tool requests.
  8. Add media-ring deadlines and underrun/drop telemetry.
  9. Later, bind media and provider loops to scheduling contexts once scheduler policy exists.

Open Questions

  • Does VoiceSession belong to the terminal host family or the media graph service family?
  • Should provider adapters expose raw provider events for diagnostics behind a privileged debug cap?
  • Should a model be allowed to continue speaking while a non-blocking tool is pending, or should capOS pause speech at every tool-call boundary by default?
  • How should cross-provider tool-call deltas be normalized when providers emit partial arguments differently?
  • Which mode is acceptable for operator web shell by default: gateway-mediated, direct provider media, browser-agent UI, or broker policy dependent?
  • Should model-output audio be stored in audit, summarized, or only referenced by transcript and provider event ids?
  • How should media graph buffer quotas interact with session quotas and future resource donation?

Relationship To Existing Proposals

  • Language Models and Agent Runtime: this proposal is the realtime multimodal sibling of the text LanguageModel / AgentSession interfaces defined there. RealtimeModelSession plugs into the same agent runner, reuses the same ToolDescriptor / AuthorityBroker / AuditLog boundary, and follows the same browser-agent UI versus gateway-enforced tool execution split. The per-tool permission modes (auto / consent / stepUp / forbidden) defined for the text agent apply unchanged here; voice does not introduce a new authority layer.
  • Native Shell and POSIX Shell: capos-shell in agent mode is the trusted runner referenced throughout this proposal. It holds the session caps, exposes typed ToolDescriptor values to RealtimeModelSession, executes admitted tool calls, and stays the authority surface for capOS-side voice agents. Browser-agent UI mode does not replace it; it proxies through WebShellGateway back into the same shell-owned authority.
  • Chat As Multimedia Substrate: the operator-facing voice surface (operator talks to a running agent; agent speaks back) is a Chat channel with audio subscriptions. VoiceSession becomes the browser-side adapter that maps WebRTC audio tracks into Chat audio subscriptions; RealtimeModelSession stays as defined here for the agent-runtime ↔ provider link; the agent runner bridges the two. Chat is the operator-visible transport; this proposal defines the model-side session that consumes/produces media through it.
  • Multimedia Pipeline Latency: gives the local media graph its guaranteed-stable latency goal, realtime-island admission model, PipeWire/JACK grounding, and telemetry requirements.
  • Boot to Shell: WebShellGateway remains the web entry point and session authority boundary.
  • Interactive Command Surfaces: voice transcripts can invoke command sessions only through typed command descriptors, not free-form shell text.
  • Browser/WASM: direct browser media and browser-agent UI resemble the existing host-backed capability pattern, but real capOS tool execution must remain gateway-mediated.
  • GPU Capability: local realtime models may later need GPU/NPU sessions, but the interface should not expose accelerator details to agent-shell.
  • Formal MAC/MIC: remote realtime provider use must be denied when session confidentiality labels forbid off-device media.

References

Proposal: Aurelian Frontier

Design for the Aurelian Frontier game: a capability-native, persistent-world RPG set on the imperial frontier of an original late-imperial fantasy setting. The current shell-spawned adventure-client artifact remains the deterministic proof slice for the capability system; this proposal describes the game it grows into. Both purposes coexist: the QEMU smoke transcript stays stable, and the design supports a long-running campaign with authoritative shared world state, multiplayer parties, durable profiles, and audited public history.

Current State

The existing artifact proves useful plumbing:

  • capos-shell launches adventure-client as an ordinary child process.
  • The shell grants explicit StdIO, Adventure, and Chat endpoint clients.
  • adventure-server owns per-player room and inventory state keyed by the endpoint caller-session scoped reference and epoch; normal shell launch syntax omits legacy receiver selectors, and remaining explicit selector fixtures are not the adventure player identity model.
  • chat-server carries room-local events and simple NPC process output.
  • Focused make run-adventure transcript coverage proves launch, movement, item pickup, inventory, chat, and process exit; the resident adventure-scenario-test process covers complex custody logic through direct Adventure cap calls.

As a game, the playable surface is still narrow: one mission, one party of named actors, one tactical encounter. The Aurelian frontier map has replaced the four-room cellar prototype. The first mission recovers eagle-standard from signal_tower, uses Maro route evidence, Livia ward delegation, survivor evacuation, Iunia witness-certified custody, and gate sealing, and keeps the read-only site graph, mission text, aliases, objectives, metadata, and proof path in CUE-authored content with checked-in generated Rust output and freshness verification. Inventory and status now distinguish physical items, writs, relic custody, marks, evidence, generated calendar/regional/construction metadata, and disabled-by-default optional fake-agent NPC budget metadata. Commit 4045576 at 2026-04-30 08:56 UTC added generated calendar event metadata for an active lantern-vigil festival and a later road-muster military event, surfaced as status metadata only; actor movement, event-driven shop mutation, witness blocking, route mutation, debrief branching, quests, gifts, and affection remain future work. Direct NPC game authority remains future work. Commit 64933131 at 2026-04-30 13:09 UTC added the first bounded seasonal shop-stock mutation: post-debrief quartermaster field-rations buys spend audited Aurelian standing, record service-owned per-expedition seasonal stock usage, add the ration to inventory, and stay bounded by pure active-stock, standing-gate, remaining-stock, and depletion checks. Room chat history is service persistence rather than player-owned adventure state. The agent NPC foundation is deterministic quota/refusal metadata and pure fake-model logic only. Commit c6d887 at 2026-04-30 08:22 UTC extended that fake-agent surface to personal routines, nonbinding shop negotiation flavor, and festival reactions as dialogue/proposed-action data with no authority mutation. Live LLM calls, hosted-agent services, durable NPC memory, autonomous NPC actions, trade commits, festival rewards, and quest mutation are not implemented. Commit 6605ee6a at 2026-04-30 13:39 UTC added the bounded regional market delivery proof: fresh committed field-ration receipts deliver the committed quantity into player expedition inventory, while commit replay and errors do not duplicate delivery; NPC stores, outpost stock, currency, durable ledgers, profile balances, and crash recovery remain future work. Commit b1c98eb1 at 2026-04-30 14:15 UTC bounded ordinary inventory admission for room takes, seasonal harvests, quartermaster field-ration purchases, and regional market delivery; regional delivery now fails closed when the full committed quantity cannot fit and stays replayable after ordinary items are dropped. Commit f06aa732 at 2026-04-30 14:51 UTC kept that capacity replay proof on authored/generated resources, set the current ordinary inventory capacity to six slots, kept transfer on the same capacity helper, and proved held regional delivery plus later full replay delivery through the real scenario process. Commit fd432147 at 2026-04-30 15:14 UTC added the bounded player-local currency side of that regional market proof: fresh committed field-ration buys spend two Aurelian chits exactly once, insufficient balances are denied before transaction mutation, inventory shows the player-local chit balance, and held delivery replay does not spend again. NPC stores, outpost stock, durable currency ledgers, profile balances, fees, expiry advancement, and crash recovery remain future work. Commit 7a9a4af5 at 2026-04-30 15:53 UTC added the bounded seller-outpost side of that regional market proof: fresh committed field-ration buys decrement ash_farm stock from six to two exactly once, insufficient seller stock is denied before transaction, currency, delivery, or stock mutation, status shows the stock line, and committed replay plus held delivery replay do not decrement again. NPC stores, broader outpost inventories, durable stock ledgers, durable currency ledgers, profile balances, fees, expiry advancement, and crash recovery remain future work. Commit 00b18598 at 2026-04-30 16:23 UTC added bounded service-owned regional market fee accrual for the same proof: fresh committed field-ration buys accrue the generated buy and sell order fees into a regional-market pool exactly once, status shows the fee line, release/no-cross and non-ration facts do not accrue fees, and committed replay plus held delivery replay do not accrue again. Commit bdcc23ed at 2026-04-30 16:57 UTC added bounded service-owned seller-outpost proceeds for the same proof: fresh committed field-ration buys credit ash_farm two proceeds chits exactly once, status shows the seller proceeds line, release/no-cross, stale, mismatched, and non-ration facts do not credit proceeds, and committed replay plus held delivery replay do not credit again. NPC stores, broader outpost inventories, durable stock and currency ledgers, durable seller-proceeds ledgers, profile balances, durable fee ledgers, expiry advancement, and crash recovery remain future work. Commit 29c065a9 at 2026-04-30 17:41 UTC added bounded regional market order expiry for live matching and reserve. The fixed-smoke day keeps the field-ration proof active, while the real scenario process proves a day-73 expired field-ration reserve releases without status, inventory, currency, outpost stock, fee, seller-proceeds, or delivery mutation. Durable calendar advancement, durable order books, profile ledgers, durable fee ledgers, and crash recovery remain future work. Commit 205fd6a0 at 2026-04-30 18:40 UTC added bounded service-owned regional-market fee withdrawal for the same proof. adventure-content owns the deterministic withdrawal resolver from current fee pool, applied withdrawal ids, and treasury balance; adventure-server owns the live fee pool, applied withdrawal ids, and treasury balance in PlayerState; and the real scenario process proves sell withdraw-fees to regional-market moves the two accrued fee chits once, replays without withdrawing twice, and leaves inventory, currency, outpost stock, seller proceeds, and delivery state untouched. Commit a547db3d at 2026-04-30 19:43 UTC added the bounded regional-market receipt snapshot/restore proof: adventure-content reconstructs RegionalMarketTransactionState from ordered receipt facts with capacity, sequence, malformed reserved fact, missing reservation, mismatched terminal fact, and overlapping open-reservation rejections; adventure-server exposes buy receipt-snapshot from regional-market to clone live receipts, restore a separate state, replay the old field-ration commit, and prove replay success without live market, inventory, fee, treasury, seller-proceeds, stock, or delivery-id mutation. This is not durable restart loading or a general persistence layer. Commit 4b44b32 at 2026-04-30 20:07 UTC added the bounded regional-market settlement snapshot-view proof. adventure-content checks the applied delivery, currency debit, outpost stock decrement, fee accrual, fee withdrawal, and seller proceeds ids plus the settlement balances, rejects over-capacity id snapshots, and replays the committed field-ration fact and fee withdrawal as already applied. adventure-server exposes buy settlement-snapshot from regional-market, and the scenario proof verifies the success text without live status or inventory mutation. This is still a bounded crash-recovery primitive, not durable restart loading or broad economy persistence. The bounded construction-job receipt snapshot branch is scoped to pure Rust construction receipt snapshot semantics plus a size-constrained QEMU no-mutation probe. Pure adventure-content tests reconstruct a separate ConstructionJobState from ordered construction facts, reject over-capacity, out-of-order, malformed reservation, missing-terminal, mismatched-terminal, overlapping-open, and non-closed snapshot shapes, and preserve standalone released facts from failed material reservations as closed replayable outcomes. After the old completed field-repair job, the field engineer’s repair receipt-snapshot command only checks status/inventory stability and confirms live construction state and material stock are unchanged. The runtime command is not a proof that receipts replay into the live construction service, and this is not durable restart loading or a general construction persistence layer. Commit f149119 at 2026-04-29 09:09 UTC landed the pure targeted-combat foundation in adventure-content: deterministic combat zones, damage kinds, attack and mob profiles, bounded zone damage, fatigue, interruption, recognition, and alert propagation. The Aurelian adventure server now consumes generated mob combat profiles, and the client, scenario proof, and QEMU smoke exercise explicit target-zone attack, skill, and cast commands. Commit f4a7fdb at 2026-04-29 18:07 UTC landed the first bounded authority-combat verb: challenge-authority and the text alias challenge authority <target> let an accepted ward-writ attack the ward-wraith’s hostile ward authority instead of hp, with real scenario and shell-smoke coverage for wrong-target, missing-authority, success, and alias paths. Durable alert groups, broader authority-combat verbs beyond that first ward-wraith slice, and broad weapon handling remain open.

Subsequent phases keep the same capability architecture and grow the playable surface: more missions, more sites, durable profiles, persistent shared world state, party multiplayer, lawful PvP, an authoritative ledger of public history, and a richer presentation client. The deterministic transcript stays the proof gate; the game is what those transcripts increasingly exercise.

Setting

Working title: The Ninth Gate of Aurelian.

The game is set in an original late-imperial fantasy frontier: a Roman-like empire holds a fortified border against a hostile magical domain beyond a chain of unstable gates. Its frontier forces are mixed cohorts of shield soldiers, oath-bound magical warriors, field wizards, scouts, engineers, priests, and contracted hunters. Their authority is formal, audited, and revocable: orders, route rights, gate keys, supply access, spell licenses, and relic custody are all concrete grants.

The setting should emphasize duty, command culture, dangerous expeditionary work, practical battlefield magic, and rivalry between imperial command, temple witnesses, guild contractors, and licensed war mages. Those are broad genre ingredients. The capOS version must keep its empire, factions, locations, artifacts, NPCs, magic vocabulary, and plot original to capOS.

Core Fantasy

The player is a junior imperial operator assigned to a frontier gate-fort after a failed expedition. They are not a chosen-one archmage. They are useful because they can hold and delegate unusual authority safely: command passes, ward keys, evidence seals, squad orders, relic handles, and restricted gate routes.

The fantasy is not “collect three objects in four rooms” and it is not permission management with fantasy labels. The target fantasy is:

I gain rare authority, use it to enter forbidden places, command trusted
agents, bind dangerous relics, expose corrupt actors, and reshape the
frontier.

That means:

  • receive a narrow mission grant,
  • choose writs, companions, relics, and supplies,
  • inspect a dangerous frontier site,
  • discover conflicts between military, temple, guild, rebel, and wizard authorities,
  • fight, negotiate, delegate, expose, or revoke instead of only attacking,
  • route capabilities to allies who can act on them,
  • survive magical incidents with limited tools,
  • return with proof, prisoners, recovered relics, or a sealed gate.

That maps cleanly to capOS. Every interesting magical or political permission can be represented as a capability, and the player experience can expose the same idea through in-world language.

The game shape is a compact expedition RPG first:

accept mission
choose writs / companions / relics
enter dangerous site
discover authority conflicts
fight / negotiate / delegate / revoke
extract with loot, survivors, evidence, or consequences
upgrade rank, base, companions, and future authority

Every major RPG system should answer one question: how does this change what the player is lawfully, socially, or supernaturally able to do? If a feature only adds generic RPG numbers, cut it or make it authority-native.

Design Goals

  • Make the first ten minutes of any session engaging, and leave room for a long-running campaign across many sessions, profiles, and parties.
  • Keep all game verbs backed by typed service calls.
  • Use NPCs as processes with explicit caps, not scripted text pasted into the client.
  • Make authority visible: the player should understand what they are allowed to do and why.
  • Make revocation and delegation part of play without turning the UI into an OS lecture.
  • Make the next useful command discoverable from room text, status text, NPC advice, or command completions instead of requiring source reading.
  • Keep exact object and actor ids player-facing and stable, with aliases only as convenience paths to the canonical ids.
  • Preserve deterministic QEMU transcript coverage for proof slices, even as the wider game grows seeded variation, persistent world state, and multiplayer.
  • Demonstrate that the capability model is usable from more than one application language. Rust and Lua game code should both operate through typed caps; neither language should receive ambient authority.
  • Treat the persistent shared world as a real product surface: profiles, ledgers, expedition checkpoints, faction history, market state, and contributor evidence all live in capability-bounded services with authoritative server ownership.

Non-Goals

  • A parser-combinator text adventure.
  • Random combat outcomes inside the deterministic QEMU smoke proof. Variance in normal play is fine; transcript-critical paths must remain reproducible under a fixed mission seed.
  • Copyright-compatible retelling of any source novel.
  • Kernel-side game state. The kernel enforces capability authority; mission, profile, ledger, and world state belong in userspace services.
  • A user-owned save blob as authority over public world facts. User-owned Drive/Firebase capsules may back up private profile and explicit expedition state, but ledger records, multiplayer outcomes, market receipts, and contributor rewards remain server-authoritative.

Intuitiveness Layer

The current artifact exposes the right command categories but makes the player do too much vocabulary discovery. The next phase should treat every room view, inspection result, and failure message as part of the control surface.

Principles:

  • Always print canonical ids for objects, actors, mobs, writs, and exits.
  • Accept forgiving aliases such as livia, Livia, and magister only after resolving them to one canonical id in the response text.
  • When a command nearly matches a known id, return a suggestion: No broker offers ward here. Did you mean request ward-writ?
  • When an actor cannot execute an order, list one or two valid tasks if the player has enough information: Livia cannot execute guard. With ward-writ delegated, try order livia to dispel-sigil.
  • look should show the current objective, visible interactables, present actors, hostile mobs, and one short “lead” line when the player is stuck.
  • status should separate survival state from mission state so players can scan it quickly:
Status: hp 12/14, guard 5, fatigue 0
Mission: expose the tower sigil, defeat ward-wraith, seal gate
Held: ward-writ accepted
Delegated: ward-writ -> livia
Lead: Livia needs line of sight; stand in the signal tower and order dispel-sigil.

Failure text should preserve causality. A refusal should say whether the missing piece is location, knowledge, authority, inventory, rank, cooldown, or target state. That makes the game feel fair and also teaches the capability model without naming kernel concepts.

Denial should usually reward the player with a lead. A blocked action can reveal a missing witness, unknown jurisdiction, forged seal, rival grant, corrupt actor, unsafe state, rank gate, or alternate route:

The gate refuses your route grant. It names the tower road, not the aqueduct.
Mira notices an old witness mark beside the lock.

Secrets and incomplete jurisdiction knowledge are core RPG fuel. The player should often discover that they are blocked because they do not yet understand who has authority here, not because they lack a generic key.

The current text parser can implement the first half of this through canonical ids, aliases, and suggestions. The later CommandSession path should expose the same hints as dynamic completions rather than duplicating a parser.

Scripting And NPC Brains

The kernel capability model should enforce authority. The adventure service, Lua scripts, NPC processes, and any later agent runners are ordinary userspace clients of that model. They should hold only the caps their role needs and exercise world mutation through typed service calls.

Rust should remain the default for core service code, bounded simulation, and proof-critical state transitions. Once Lua Scripting exists, Lua is a good fit for deterministic scenario glue: mission beats, dialogue state machines, quest-board text, debrief variants, and scripted NPC reactions. Lua scripts should receive narrow host APIs and object caps; they should not receive raw cap IDs, broad spawn authority, or a way to bypass Rust service validation.

This is useful for the game because it proves the capability model is not a Rust-only convention. A transcript can show a Rust service and a Lua-scripted NPC both using typed authority correctly, including one denied ungranted path.

For NPC behavior that does not need deterministic transcript output, the Language Models and Agent Runtime design is the better fit. LLM-backed NPCs can provide tavern chatter, optional hints, flavor summaries, or reactive dialogue, but their output should be treated as data. They must not decide mission-critical authority, relic custody, combat damage, or policy denials, and they should not sit on the main QEMU proof path unless served by a deterministic stub. The model/embedder/agent-runner capabilities, per-tool permission modes (auto/consent/stepUp/forbidden), and budget plumbing described in Language Models and Agent Runtime are the upstream surface the Phase 11d fake-agent budget metadata foreshadows; the live-agent path must attach those typed caps to an AdventureNpc facet rather than handing ambient LLM authority to a chat process.

Long-lived or agent-controlled NPCs should also inherit the hosted-agent harness constraints from capOS-Hosted Agent Swarms. An NPC agent is a task-like process with a workspace/memory scope, advertised tools, audit, and budget, not an ambient identity. The adventure-specific budget should include per-NPC, per-session, and per-game-day token quotas, tool-call quotas, cooldowns, and model profiles. When quota, fatigue, sleep schedule, or policy blocks an answer, the NPC should refuse in-world, for example: I'm tired. Going to sleep. That refusal is part of gameplay state and audit, not a transport failure. Any memory/reflection output from the agent remains low-authority data until compiled into deterministic content or service-owned state through reviewed rules.

Player Loop

The 30-second loop is one meaningful command and one consequence:

  1. Explore, inspect, fight, negotiate, unlock, delegate, or extract.
  2. Receive at least one result: loot, map knowledge, danger, faction change, clue, shortcut, companion reaction, or new authority state.
  3. Re-evaluate the site with better or worse information.

The 10-minute loop is a writ-backed expedition:

  1. Briefing: an officer grants a mission writ and one restricted gate route.
  2. Preparation: the player chooses route, companion, relic loadout, and helper authorities such as ward-writ, scout-order, medic-token, or relic-seal.
  3. Expedition: the player enters two or three connected frontier locations.
  4. Encounter: a surprise reveals an authority conflict, enemy trick, secret jurisdiction fact, companion risk, or faction demand.
  5. Consequence: NPCs respond, a route opens or closes, and logs/evidence update.
  6. Extraction: the player returns with survivors, relics, evidence, scars, or consequences.

The multi-session loop is frontier reshaping:

  1. Increase rank and unlock new jurisdictions.
  2. Upgrade the base, temple, archive, court, and other authority modules.
  3. Build faction reputation and expose larger conspiracies.
  4. Gain companions, alter their trust and doctrine, and decide who can safely hold delegated power.
  5. Unlock future missions by expanding legal reach, not just damage and health.

Each loop should contain one deliberate choice, one reversible mistake, and one visible consequence. For example, the player can spend coin to buy a safe route, persuade the scout and keep the coin, or skip the route and risk an ambush. The transcript remains deterministic, but the player sees that the world is not a single locked command sequence.

The first implementation can cover one mission:

Mission: recover the missing eagle standard from the ruined signal tower.

Complication:
- the tower gate is unstable,
- a wounded legionary is trapped behind a ward,
- a guild scout wants payment before sharing a safe route,
- a temple witness refuses to certify the relic unless the player has not used
  a forbidden oath rite.

Good outcomes:
- standard recovered,
- survivor evacuated,
- gate sealed,
- scout paid or persuaded,
- temple witness records clean custody.

Narrative Model

The game should read like a compact expedition report unfolding through play, not like a list of room fixtures. The server owns mission state; NPC processes own voice, advice, rumors, and local reactions.

Use mission beats:

  • briefed: Varro states the objective and offers two optional authorities.
  • crossed-gate: the gate opens, chat switches from fort traffic to field traffic, and old fort chatter becomes history.
  • first-contact: the first hostile sign teaches inspect and intent text.
  • complication: a survivor, relic, or blocked route forces a tradeoff.
  • turning-point: the player delegates, spends, steals, persuades, or fights.
  • extraction: the player returns with relics, witnesses, prisoners, or sealed-route proof.
  • debrief: rank marks, faction opinion, and audit records update.

Narrative should be stateful and short. A room description can have variants before and after key facts are known, but it should not become walls of prose:

Ashen Road
Signal ash drifts across the old paving stones. Maro has marked a narrow
ditch-route east.
Exits: west tower-east ditch-east
Actors: maro
Lead: ask maro about route, or inspect ash-tracks.

NPCs should surface stakes. Livia cares about unstable wards, Varro cares about orders and casualties, Iunia cares about custody and forbidden rites, and Maro cares about payment, favors, and survival. Their objections create interesting command choices rather than static lore.

World Model

Replace the current static room list with a small graph of Site records:

site_id
title
description
region
threat_level
exits
visible_items
actors
active_wards
required_route_cap

Keep the first map small:

  • fort_aurelian: command room, quartermaster, temple annex.
  • gate_yard: portal control, squad muster, unstable gate.
  • ashen_road: contested approach with scout and ambush traces.
  • signal_tower: relic objective, wounded soldier, ward puzzle.
  • under_vault: optional dangerous route with an oath-echo hook.

The wider game should grow this into multiple settlements and outposts rather than a single hub with more rooms. fort_aurelian stays the first proof settlement. Later content can add a civilian city, a temple-administered site, a guild waystation, and resource-producing outposts such as mines, farms, timber camps, shrines, salvage yards, gate-yards, and repair yards. Routes between them carry distance, hazard, faction control, seasonal closure, cargo-limit, and authority metadata. Outposts produce bounded resources and consume supplies through service-owned state; user save capsules can back up private profile data but cannot invent public production or market facts.

Items should become capabilities or evidence, not inert nouns:

  • eagle_standard: relic custody cap; proves mission objective.
  • ward_writ: authority to request ward changes, but logs every use.
  • scout_marker: grants access to hidden route hints.
  • oath_echo: one-use inherited rite; powerful but politically risky.
  • temple_seal: certifies clean custody if conditions are met.

Actors

NPCs should be separate processes where possible:

  • Centurion Varro: mission issuer, grants route and squad authority.
  • Magister Livia: battlefield wizard, can identify ward failures.
  • Acolyte Iunia: temple witness, audits relic custody and forbidden magic.
  • Maro the Guild Scout: knows routes, trades in favors, can withhold help.
  • Wounded Legionary: rescue objective and source of battlefield facts.
  • Gate Echo: hostile magical presence that can corrupt routes or chat.

Each actor should own only the caps that fit its role. For example, the scout does not get relic custody, and the temple witness does not get squad command.

Mechanics

Authority As Inventory

The current inventory is just strings. Split the player-facing inventory view into:

  • items: physical objects visible in rooms,
  • writs: mission and faction permissions,
  • relics: dangerous objects with custody rules,
  • marks: progression/rank state,
  • evidence: facts or signed observations.

Player-facing commands can still say inventory, but output should show why each entry matters:

Writs:
  gate-route: tower approach, expires after return
  ward-writ: request imperial ward changes, audited
Relics:
  none
Evidence:
  broken sigil: tower ward failed from inside

This view must not imply that all entries are picked up with take. Physical objects use take and drop. Authorities use explicit grant, delegation, or custody verbs:

  • request <writ> asks an actor or broker for a grant.
  • accept <writ> receives a grant already offered by an actor.
  • delegate <writ> to <actor> grants a scoped child or NPC authority.
  • revoke <writ> withdraws authority when the holder allows it.

Failure text should distinguish object state from authority state:

You can see the southern ward, but your mission writ names only the tower gate.
Centurion Varro has not offered squad command authority.
The relic can be carried only after a temple witness seals custody.

Writs As Loot

Writs are RPG loot, not boring quest permissions. A writ is gear, skill tree, access key, and social status at the same time:

RPG conceptAurelian version
Weaponcombat writ, relic mandate, dueling license
Armorward writ, sanctuary bond, witness shield
Skilldelegation pattern, seal-breaking rite, custody transfer
Keyroute grant, gate mark, archive token
Reputationrank seal, faction trust, lawful standing
Cursecorrupted grant, forged writ, hostile obligation
Legendary itemancient relic with dangerous authority

A good writ should feel like “this changes what I can do,” not “this allows the next quest step.” Writs can carry bounded affixes and drawbacks:

Route Grant of Urgency
- Allows passage through old aqueduct
- Expires after 40 turns
- Cannot be delegated
- +1 faction trust if survivor extracted

Custody Writ of Burden
- Allows carrying sealed relic
- Reduces combat initiative
- Requires witness before transfer
- Breaking custody causes temple penalty

The modifier set must remain deterministic under the mission seed. A writ must make issuer, scope, expiry, allowed verbs, delegation rules, drawbacks, and revocation conditions inspectable.

Authority Archetypes

Classes are authority archetypes, not generic stat packages:

ArchetypeLegal, social, and supernatural power
Wardenprotects people, escorts survivors, creates safe routes, specializes in wards and evacuation
Marshalenforces law, arrests hostile agents, handles bounties, duels, raids, and frontier justice
Archivistfinds hidden evidence, preserves witness chains, decodes old grants, detects forged authority
Custodianhandles relics, dangerous artifacts, sealed rooms, containment failures, and temple politics
Factorcontrols logistics, markets, supply lines, caravan permissions, construction, and regional influence
Heretic/Renegadeuses forbidden authority faster while risking corruption, exile, unreliable witnesses, and hostile audits

Two archetypes may share combat numbers, but they should solve blocked situations with different verbs. A Warden invokes sanctuary, a Factor proves a supply right, an Archivist exposes the old seal, and a Heretic breaks the seal at a cost.

Delegation Buildcraft

Delegation is buildcraft. Companions are fallible agents, not portable stat bonuses:

TraitGameplay effect
Loyaltyobeys the spirit versus the letter of a writ
Ambitionmay exploit broad authority for personal goals
Competencehandles dangerous grants safely
Reputationaffects faction trust when holding delegated power
Fearmay abandon delegated duty under pressure
Doctrineinterprets ambiguous orders through law, temple rule, guild practice, or renegade code

The player’s build is partly deciding which powers to keep, which to delegate, and whom to trust. Carrying a relic personally may block entry to a polluted shrine. Delegating custody may free the player to act, but the companion can be bribed, frightened, corrupted, or forced to testify later. The service owns the deterministic result and prints the cause.

Item Use

Add explicit verbs:

  • use <thing>
  • give <thing> to <actor>
  • ask <actor> about <topic>
  • inspect <thing>
  • seal <site|gate|relic>
  • request <writ>
  • accept <writ>
  • delegate <writ> to <actor>
  • revoke <writ>
  • order <actor> to <task>

The service validates authority and state. Invalid actions should return specific text, not just the unchanged room.

Progression

Use a small rank model:

  • tiro: recruit/operator, tutorial grants only.
  • signifer: can carry relic custody.
  • centurion: can issue squad orders.
  • legate: future high-authority profile.

This is not a stats grind. Rank changes which capabilities the broker may grant in later missions. Magic progression can mirror this with circles:

  • first circle: detect wards,
  • second circle: reinforce shields,
  • third circle: stabilize a minor gate.

Progression should unlock reach, not only power:

Can issue temporary ward writs
Can enter disputed shrines
Can appoint one field deputy
Can challenge forged grants
Can hold two relics in custody
Can revoke delegated authority remotely
Can negotiate with hostile jurisdictions
Can operate without local witness once per mission

Base modules should also create new verbs, not passive bonuses:

ModuleFunction
Archivestores evidence, unlocks old maps, verifies claims, exposes forged records
Temple vaultstores relics, enables custody upgrades, binds dangerous artifacts
Barrackstrains deputies and companions, improves command delegation
Courtresolves disputes, converts evidence into rank, revokes corrupt grants
Market halltrades supplies and regional favors, supports escrow for ordinary goods
Signal towerextends remote revocation and delegation range, reveals route hazards
Sanctuaryprotects rescued NPCs and creates story consequences

Combat

Combat should exist, but it should stay tactical and bounded. The first interesting version does not need a full roguelike engine, but it does need enemies, danger, skills, spells, and readable outcomes.

Combat is turn-based at the command level. A fight starts when the player enters a hostile site, triggers a ward, fails a negotiation, or chooses to engage. Each turn the player picks one action, allied NPCs act if present, then hostile mobs act. The transcript should stay deterministic for smoke coverage. Combat should attack authority as well as HP. The distinctive tactical question is whether the player can keep lawful control while under pressure.

Enemy types should interact with authority directly:

EnemyThreat
Forgercreates fake writs and causes false accusations
Null-priestdisables local grants or sanctuary bonds
Bandit captainsteals custody tokens or route proofs
Corrupt magistraterevokes or contests authority mid-mission
Wraithignores physical defenses but obeys old seals
Spylearns route grants and ambushes exits
Oathbreakerturns delegated powers against the player

Good tactical verbs include:

inspect seal
challenge authority
bind relic
revoke grant
delegate ward to Mira
force witness
seal exit
expose forgery
claim custody
invoke sanctuary

Later combat can borrow a narrow set of Evil Islands-style tactical mechanics without inheriting its real-time randomness or punitive retreat traps. The grounding is recorded in Game Mechanics Prior Art. The useful ideas are visible preparation, careful fight selection, body-zone targeting, damage-type and armor interaction, stealth openings, and cast-time risk. Aurelian should translate those into deterministic command outcomes: scouting reveals threat and intent, inspected enemies expose vulnerable zones, weapons and spells target bounded zones, and failed positioning has explicit costs.

Player actions:

  • shield a wounded soldier with ward-writ,
  • call a legionary NPC if holding squad-order,
  • use oath_echo to bypass a lock at political cost,
  • attack with a weapon skill,
  • cast a prepared spell,
  • guard, retreat, seal a route, or order an ally.

This gives the player real decisions without turning combat into repeated attack commands.

Targeted attacks should stay small and readable:

attack ward-wraith head with spear
attack imp-scout legs
cast ember-dart at ghoul hands

Zone effects are deterministic and bounded:

  • head: harder to land, can increase critical or disruption outcomes;
  • hands: can reduce attack cadence, weapon use, or casting stability;
  • legs: can slow pursuit, block retreat prevention, or weaken charge intent;
  • core: the default reliable target, lower risk and lower swing.

Damage and mitigation should consider weapon type, spell type, zone armor, ward state, and inspected knowledge. A spear against a lightly armored weak point, a mace against armored limbs, or a ward spell against a revealed sigil should feel different in transcript text and outcome. The service still owns the exact result; clients do not roll hidden dice.

Mobs

Mobs should be small state machines owned by the adventure service or by separate actor processes once that split is useful. Initial mob types:

  • imp-scout: weak, fast, tries to flee and report.
  • ash-ghoul: slow melee enemy, punishes unguarded players.
  • ward-wraith: ignores ordinary weapons until a ward is inspected or broken.
  • gate-hound: blocks retreat unless stunned or distracted.
  • echo-centurion: elite magical-warrior enemy used as a mission boss.

Each mob has:

name
threat_level
hp
armor
zone_armor
ward
attack
morale
traits
intent

intent is visible when the player has scout or wizard support:

The gate-hound lowers its head. Intent: lunge at the weakest target.
The ward-wraith gathers blue fire. Intent: break shield next turn.

This makes fights more interesting than hidden dice rolls.

Unknown enemies should not expose full mechanical truth immediately. A scout, wizard, height advantage, prior codex evidence, or an inspection action can upgrade the view from rough threat to exact armor/ward/intent/counter data. That keeps stealth and observation relevant without forcing real-time mouse precision.

Basic Stats

Use a small stat block:

vigor      physical endurance and wound tolerance
discipline morale, command, resistance to fear
edge       weapon accuracy and quick action
ward       magical defense and shield capacity
focus      spell control and ritual stability

Derived values:

hp         8 + vigor * 2
guard      discipline + ward
initiative edge + focus
load       vigor + discipline

The player should see compact status:

Status: hp 12/14, guard 5, fatigue 1
Ranks: warrior 2 stars, wizard 1 circle
Prepared: shield-bind, ember-dart
Writs: tower route, ward-writ

Stats are not the main reward system. They exist so combat and spell choices are legible.

Leveling And Reputation

Progression should reward mission outcomes rather than enemy grinding. A successful debrief can grant:

  • rank marks: unlocks brokered authorities such as relic custody or squad order;
  • warrior stars: unlocks martial and command skills;
  • wizard circles: unlocks prepared spells and restricted ritual caps;
  • faction standing: changes prices, testimony, help, and PvP legal status;
  • codex entries: records inspected wards, mobs, relics, and route hazards.

Progression inputs should be auditable mission facts:

Recovered eagle-standard: +1 imperial standing
Evacuated wounded-legionary: +1 cohort standing
Used oath-echo: +1 breach power, -1 temple standing
Sealed gate with witness present: unlock relic-custody eligibility

Rank is therefore a policy input, not just a number. A player may be strong enough to win a fight but still unable to receive temple-seal authority after abusing forbidden magic.

Stars And Circles

Use separate progression tracks for martial and magical competence.

Warrior stars:

StarsMeaningUnlocks
0civilian or raw recruitflee, guard, basic strike
1trained legionaryshield wall, steady aim
2proven frontier fightercounter, command ally
3veteran signiferrally, hold line, relic carry
4centurion-gradeissue squad order, tactical stance
5heroic championbreak elite guard, inspire cohort

Wizard circles:

CircleMeaningUnlocks
0no formal spell licenseuse charged relics only
1apprentice field magicember dart, detect ward
2battlefield adeptshield-bind, mend wound
3gate specialiststabilize gate, dispel minor ward
4war magedome shield, bind hostile spirit
5archmage authorityrewrite route, seal major breach

Stars and circles are player-facing rank labels, not copied external lore. Implementation may rename them if a stronger capOS-specific vocabulary emerges. Both are capability policy inputs. A 3-star warrior can receive relic custody that a recruit cannot. A 2-circle wizard can receive a ward-writ but not a gate-rewrite cap. The player-facing fiction is rank and training; the system-facing implementation is brokered authority.

Skills And Spells

Skills are martial or command actions:

  • strike: basic weapon attack.
  • guard: reduce incoming damage and protect one ally.
  • shield-wall: requires 1 star and an allied legionary.
  • counter: requires 2 stars; punish a missed melee attack.
  • rally: requires 3 stars; restore morale and clear fear.
  • order: requires appropriate writ; make an allied NPC act now.

Spells are prepared actions with fatigue or reagent costs:

  • ember-dart: 1 circle; reliable ranged damage.
  • detect-ward: 1 circle; reveals ward traits and mob intent.
  • shield-bind: 2 circles; temporary guard bonus.
  • mend-wound: 2 circles; stabilize or heal a wounded target.
  • stabilize-gate: 3 circles; stops gate hazards or opens safe retreat.
  • dome-shield: 4 circles; protects the whole party for one turn.

Forbidden or risky techniques:

  • oath-echo: one-use inherited rite; strong effect, audit cost.
  • demon-brand: hostile shortcut; should exist as a temptation, not an ordinary optimal play path.

Prepared spells should be visible in status. The first mission can grant only ember-dart, detect-ward, and shield-bind.

Fight Commands

Add combat commands:

  • attack <mob>
  • skill <name> [target]
  • cast <spell> [target]
  • guard [ally]
  • order <ally> to <action>
  • retreat
  • status

Example:

[combat:signal_tower]> cast detect-ward wraith
The ward-wraith is bound to the broken tower sigil.
Intent: break shield next turn.

[combat:signal_tower]> order livia to dispel-sigil
Magister Livia spends the ward-writ grant. The sigil cracks.

[combat:signal_tower]> attack wraith
Your gladius bites through the fading ward. 5 damage.

Loot And Equipment

Loot should be sparse, inspectable, and tied to authority. The game should not become a pile of random nouns.

Item categories:

  • supplies: torches, bandages, reagents, gate-stabilizer parts.
  • equipment: gladius, shield, bow, focus ring, warded cloak.
  • relics: eagle standard, sealed tablets, oath-bound cores.
  • evidence: ash traces, broken sigil sketches, witness statements.
  • trade goods: coin, salvage bronze, guild favors, ration chits.

Every loot entry should have at least one of these uses:

  • opens a route,
  • changes a combat choice,
  • helps an actor,
  • sells for a predictable value,
  • acts as evidence in debrief,
  • is dangerous custody with audit consequences.

Equipment can remain simple:

  • weapon: affects attack damage or skill unlocks.
  • shield: affects guard and ally protection.
  • focus: affects spell fatigue and ward inspection.
  • cloak: affects stealth, ambush, or faction recognition.
  • load: caps carried gear before fatigue penalties.

Later equipment should support blueprint/artifact construction without turning the game into unbounded loot rolling. A construction job names a blueprint, materials, location/facility class, rank/star/circle gates, cost, expected duration, and output bounds. The service reserves materials and currency, validates the job, records it, and completes or releases it through the same transaction discipline used by markets. Item properties are derived from the base blueprint, material choices, crafter skill/rank, facility quality, and paid cost. Enchantment is a constrained post-process: object type, enchanter circle, lawful authority, and remaining enchantment slots determine valid results. Artifact-scale outputs include witness-sealed relic cases, warded cloaks, focus rings, route compasses, golem cores, and gate-stabilizer parts. This construction direction is grounded in Game Mechanics Prior Art, especially the Evil Islands and EVE Online notes.

Relics are not ordinary loot. They require custody authority, may be move-only, and should be visible in audit output. Dropping or trading a relic without the right witness should be a meaningful failure.

Buying, Selling, And Logistics

The shopkeeper should become a small economy service rather than ambient chat. The first version can be deterministic and local:

  • buy <item> from <actor>
  • sell <item> to <actor>
  • quote <item> from <actor>
  • trade <item> to <actor> for <item|favor>
  • repair <item> at <actor>

Markets should have roles:

  • quartermaster: sells supplies for ration chits, requires imperial standing;
  • guild scout: sells route hints and contraband for coin or favors;
  • temple annex: certifies relic custody and sells lawful wards;
  • field engineer: repairs gate parts, golems, and damaged equipment.

Prices should be legible and bounded:

Maro offers ditch-route for 1 coin or scout-favor.
Quartermaster refuses focus-ring: requires wizard circle 1.
Iunia will certify eagle-standard only if oath-echo was not used.

Buying and selling maps naturally to capabilities. A shop can only sell what its actor is authorized to transfer; the player can only receive items or writs permitted by rank, faction standing, and mission state. Trade failures should name the blocked authority, not pretend the item is missing.

The regional market target is closer to a brokered order book than a single shop inventory. Market services should define market-eligible item classes, regional buy orders, sell orders, price/time priority, immediate matching when prices cross, expiry, fees, and ordered ledger receipts. Items that are not market-eligible still move through explicit custody, barter, witness, or quest flows. If several services own profile inventory, expedition cargo, outpost stock, or cloud-backed records, the market coordinator needs reserve/escrow, commit/release, stale-version rejection, idempotency keys, cancellation, retry, and crash-recovery behavior before any player-visible two-party exchange is treated as implemented. This market direction is grounded in Game Mechanics Prior Art, especially the EVE Online notes.

Randomization

Randomness should make repeated play feel alive without making QEMU coverage fragile. Use seeded mission variation, not hidden unbounded dice. The legal model remains deterministic and auditable: under the same seed and discovered facts, authority grants, denials, revocations, custody outcomes, and faction consequences must replay exactly.

The mission seed can choose:

  • one of several mob placements,
  • one optional route hazard,
  • one mission complication,
  • one faction demand,
  • one shop inventory variant,
  • one companion behavior pressure,
  • one relic side effect,
  • one enemy authority trick,
  • one optional objective,
  • one loot or writ modifier,
  • one NPC rumor or personality line,
  • one loot cache location,
  • one debrief complication,
  • a calendar state: season, day, weather/hazard class, seasonal resource table, festival/event hook, and routine variant.

The seed should be visible through debug or transcript mode, and smoke tests should pass a fixed seed through the manifest or mission setup:

Mission seed: 0x0000_aure_0009
Variant: ash-ghoul at ashen_road, focus-ring at under_vault

Combat randomness should be constrained by intent text. If an attack can miss, the player should see why through guard, morale, terrain, fatigue, or a mob trait. Critical swings should be optional flavor unless the seed is fixed.

Calendar variation should be similarly explicit. Four 28-day seasons are a reasonable initial model. Seasonal crops, forage, fish, shops, route hazards, and outpost production have bounded availability tables. Ordinary seasonal crops and fragile goods expire or degrade at season change unless the content declares them as multi-season. Festivals and military events can alter actor routines, witness availability, shop stock, quests, gifts, and affection-style standing records, but those effects must be ledger/profile facts rather than client-local counters. This calendar direction is grounded in Game Mechanics Prior Art, especially the Stardew Valley notes.

The implemented foundation keeps this deterministic for the smoke seed: generated mission content carries a fixed season, day, weather, hazard class, bounded seasonal resource records for the proof categories, and fixed-smoke festival/military-event metadata. Status output prints that calendar state and the active event metadata. Production per-run seed selection, gameplay effects from events, actor routine changes, and gameplay consumption/expiry rules remain future work.

Seeded generation should produce explicit world artifacts, not invisible ambient randomness. The stable base game remains authored content: factions, major sites, named relics, law, core routes, capability interfaces, and proof missions. A production world can then select deterministic overlays from a WorldlineSeed:

  • local room/map variants under authored region constraints;
  • optional hazards, mob placement, loot caches, and rumor/debrief variants;
  • seasonal resource tables, route closures, outpost production, and shop stock;
  • festival or military-event schedules;
  • bounded NPC routine variants and non-critical chatter hooks;
  • regional market starting books, subject to service-owned order-book rules.

Every generated artifact should carry enough provenance to be replayed or rejected: content release id, worldline id, seed epoch, generator version, scope label, and bounded output size. Services should persist selected artifacts once admitted so later patches do not silently rewrite active worlds. Smoke runs keep using fixed authored selections until the generator itself has pure tests and a fixed-seed QEMU proof.

Chat And Room Memory

Room chat should become diegetic:

  • room channels are command, scout, temple, and expedition channels,
  • NPC process messages are radio/runner/magic-slate traffic,
  • history replay is labeled as “recent room record” if intentionally shown,
  • private messages require a separate cap or actor relation.

That turns the current chat persistence quirk into a feature.

Multiplayer

Multiplayer should be a first-class capability demonstration, not just several clients sharing one room. The current local foundation keeps per-player state keyed by live endpoint caller-session metadata and assigns service-local player labels such as player-1 for party commands. Those labels are not caller-chosen badges, global principal ids, or portable identity. The adventure service can add explicit shared expedition objects for parties, duels, trades, and contested sites when the single-service state model stops being sufficient.

Co-op mechanics:

  • party create <name> creates a shared expedition with a leader cap.
  • party invite <player> sends a join offer; accepting grants a party member cap with scoped verbs.
  • party delegate <writ> to <player> gives another player a narrow authority, such as route access, relic carry, or squad order.
  • assist <player> with <task> contributes a skill, spell, item, or witness action to another player’s command.
  • sync-turn or mission turn barriers let QEMU scripts prove deterministic multi-client combat without racing terminal input.
  • Split roles make co-op matter: scout reveals intents, wizard handles wards, warrior protects allies, witness certifies relic custody, quartermaster carries supplies.

Co-op failure should be interesting but bounded. A player can waste a shared turn, drop supplies, or revoke a delegated route, but cannot mutate another player’s private inventory without an accepted trade, custody transfer, or party rule.

Accepted trade and custody transfer must be service-mediated state transitions, not independent edits to two player save blobs. If one service owns both inventories, it should perform a single version-checked mutation and emit one ordered receipt. If ownership is split across profile, expedition, market, or cloud-backed stores, the Trade/Market/Expedition coordinator needs an escrow or saga protocol: reserve the item and consideration with idempotency keys, commit or release both sides, record an append-only ledger receipt, and make stale offers, cancellation, retry, and crash recovery explicit. User-owned Drive/Firebase save capsules cannot authorize these transfers; they may only back up the resulting private state after the authoritative receipt exists.

PvP mechanics should be opt-in and lawful in the fiction:

  • duel challenge <player> creates a temporary arena with agreed rules.
  • duel accept <player> grants a duel-combat cap scoped to the arena.
  • spar <player> allows nonlethal training damage and skill practice.
  • contest <site> lets factions compete over a route, relic, or witness record when the mission explicitly allows it.
  • bounty mark <player> can exist only as a future policy-backed authority, never as ambient attack permission.

PvP must not mean “any client can attack any badge.” Harmful verbs require an arena, duel, faction-war, or bounty capability. The service should reject unauthorized attacks with a policy explanation:

No lawful conflict grants target marcus. Challenge a duel or enter the contested yard.

Useful rich PvP/co-op surfaces:

  • shared threat tables where guarding an ally changes mob target intent;
  • formation commands that require two or more players to hold compatible ranks;
  • witness challenges where one party audits another party’s relic custody;
  • contraband markets where guild standing helps one player but hurts temple reputation;
  • route races where two parties can choose negotiation, sabotage, or legal contest depending on granted authorities;
  • post-mission debriefs that record contribution, friendly fire, revocations, trades, and witness disputes.

Architecture:

  • Expedition service owns shared party/site/combat state.
  • Adventure keeps private player profile and inventory state.
  • Chat carries room, party, duel, and faction channels.
  • Trade or Market service owns two-party item/currency exchange, including reserve/escrow, commit/release, stale-offer cleanup, and replay-safe receipts.
  • Audit records contested custody, PvP consent, and debrief evidence.

Near-term multiplayer should use live caller-session keys now and move to broker-granted service facets or service-created player objects once those are available. Manifest-issued receiver selectors are not a temporary Aurelian identity bridge; user-facing shell syntax must not choose or relabel another service identity. A useful QEMU proof uses two player objects or two distinct live caller sessions, one shared party, one delegated ward-writ, and one deterministic assist:

player1: party create tower
player1: party invite player2
player2: party accept tower
player1: delegate ward-writ to player2
player2: assist player1 with detect-ward
player1: attack ward-wraith

That proof is more valuable than adding network transport first because it exercises authority, shared state, and deterministic turn ordering locally.

Keep multiplayer scoped and desirable. Near-term multiplayer is cooperative expedition pressure: shared sites, dangerous relic custody requiring multiple witnesses, player deputies, faction-controlled regions, contested bounties, asynchronous rescue contracts, and public proof of heroic or criminal actions. Open ambient PvP is not the target; harmful verbs require explicit duel, contest, bounty, or faction-war authority.

MMO-scale open economies, broad player construction seasons, LLM-driven mission-critical NPCs, cross-instance federation, and worldline travel are deferred until the compact expedition loop works as a local and cooperative RPG.

Commit 335a9ee at 2026-04-28 22:22 UTC landed the first bounded Phase 12 foundation: the existing Adventure service now owns local party records for party create, party invite, party accept, party leave, party delegate, and assist. Party membership, pending invites, delegated ward-writ, and assist records are deterministic service state keyed by service-local player labels derived from caller-session keys, with transitions routed through the unit-tested adventure-content party state. The initial assist <player> with detect-ward path requires party membership and a matching delegated ward-writ; it does not transfer items, currency, or private inventory authority. A real one-client cap assertion covers the typed party surface, while the two-client proof remains open until the manifest and launcher/session APIs can run two real Adventure clients with distinct live caller-session keys without faking them inside one process.

Commit ac49375 at 2026-04-29 06:43 UTC landed the next bounded Phase 12 transfer foundation and keeps transfer state inside the existing Adventure service. The new typed Adventure.transfer(item, player) path supports transfer <item> to <player> for physical items only, derives both service-local player labels from live caller-session keys, requires shared party membership, refuses relic custody such as eagle-standard, and mutates source/target inventories atomically through unit-tested adventure-content transfer logic. The scenario process asserts one-client refusal paths without synthesizing a second session. Currency escrow, market-scale two-party exchange, and successful two-client QEMU transfer proof remain open.

Parallel Universes And Cross-Instance Worlds

Parallel universes fit Aurelian better as sovereign worldlines than as one shared mutable map. Each capOS instance can host one or more worldline services that use the same content release but different WorldlineSeed values, calendar epochs, market starts, event schedules, and generated regional overlays. Players should experience those worlds as alternate Aurelian frontiers, while the authority model treats each worldline as its own shard with its own ledger, market, expedition, and profile policy.

This is feasible, but only after the local authority model is solid. Raw capability slots, endpoint generations, session ids, and local player labels cannot be portable authority across kernels. Cross-instance play needs a federation gateway that presents narrow local facade caps backed by remote protocol messages:

  • WorldlineDirectory: lists known remote worlds by content release, worldline id, endpoint, policy, and current ledger head.
  • WorldlineVisit: grants read/observe/chat/travel-preview authority for a remote site without importing inventory or mutation rights.
  • WorldlineExpedition: creates a bounded cross-world expedition object with explicit participants, allowed verbs, timeout, and home-world settlement rules.
  • WorldlineTransfer: coordinates item, currency, custody, or profile-state movement through reserve/escrow, commit/release, replay-safe receipts, and content-version checks.
  • WorldlineAudit: verifies remote receipts, ledger-head continuity, content hashes, generated-artifact provenance, and policy compatibility.

The game should support several integration levels, each with a different authority cost:

  • Echo view: a player can inspect another worldline’s public map state, rumors, market summaries, and public history. This is read-only and can land first.
  • Envoy visit: the local world creates a temporary projected character in a remote worldline. The projection may chat, observe, or perform explicitly granted low-risk actions, but cannot spend home inventory directly.
  • Expedition bridge: two worlds run a shared mission instance with a fixed seed and a coordinator receipt. Contributions are recorded in both ledgers only after commit.
  • Trade or custody transfer: ordinary goods, relics, currencies, or reputation effects move only through a transfer coordinator. Partial failure releases reservations; retries are idempotent.
  • Worldline migration: a profile moves or copies into another worldline under policy. Public achievements and custody claims are imported as verifiable receipts, not trusted client save blobs.

Parallel universes make seeded generation more important. If every worldline is the same authored graph with the same resources and routines, federation is mostly remote chat. Meaningful worldline differences should come from bounded seeded overlays: seasonal economies, market starts, event schedules, route hazards, outpost outputs, NPC routines, optional dungeons, and regional map variants. The generator must not mint authority by accident. It can choose that guild_iron_mine is closed in one worldline and open in another, but the resulting travel, market, custody, and reward effects still flow through the same service-owned capability and ledger paths.

Cross-world compatibility rules:

  • A worldline advertises a content release id and generator version. Peers may reject incompatible worlds or fall back to echo-only mode.
  • Generated artifacts are referenced by stable ids and provenance hashes, not by trusting remote prose.
  • Remote markets expose authenticated order/receipt views; local UI hints are not authority.
  • Cross-world transfers name both ledger heads and both worldline ids in the receipt, so replay into a different universe fails closed.
  • Faction standing, rank, and contributor rewards should import as witnessed claims with local policy gates. A remote honor does not automatically grant a local writ.
  • Clock/calendar drift is part of the design: worlds may have different seasons, festivals, or wars. Shared expeditions must pin an event epoch or name which world’s calendar controls the mission.
  • Failure modes are ordinary gameplay states: remote world unavailable, receipt stale, policy mismatch, content hash unknown, or escrow timeout. Each should have a deterministic denial path.

Near-term proof should not attempt full network-transparent play. A useful first slice can run two worldline services on one capOS instance with different fixed seeds and prove echo view plus a denied transfer:

worldline list
worldline inspect aurelian-mirror
worldline echo aurelian-mirror fort_aurelian
worldline transfer eagle-standard to aurelian-mirror

The expected result is that public state can be observed with content/seed metadata, while relic transfer is rejected until custody escrow, remote policy, and dual-ledger receipts exist.

Capability Mapping

The setting should teach capability ideas through play:

Game conceptcapOS concept
mission writrestricted launcher or mission bundle
gate routeendpoint/router cap with revocation
ward writtyped authority to request ward-state mutation
relic custodymove-only cap with audit trail
temple witnessaudit/log service with policy checks
rank marksession/profile metadata influencing broker grants
oath echosealed inherited state or one-use privileged cap
hostile magicuntrusted service/domain with strict schema boundary

The player does not need to see cap IDs. The game text should make authority concrete: “You cannot open the southern ward; your writ names only the tower gate.”

Service Architecture

Target process split:

flowchart TD
    Shell[capos-shell] --> Client[adventure command client]
    Client --> Adventure[Adventure service]
    Client --> Chat[Chat service]
    Adventure --> Mission[Mission state service]
    Adventure --> Audit[Audit or witness service]
    NPC1[centurion process] --> Chat
    NPC2[scout process] --> Chat
    NPC3[temple witness process] --> Chat
    NPC1 --> Adventure
    NPC2 --> Adventure
    NPC3 --> Adventure

Near-term implementation can keep one adventure-server process and separate NPC processes that only hold console and chat, matching the current system-adventure.cue shape. NPCs that affect world state should initially do so indirectly through player-visible offers and chat events. Direct NPC calls into Adventure require session-bound service facets such as AdventureNpc, or an equivalent broker-granted authority that cannot mutate unrelated mission state. Receiver-selector compatibility grants are not NPC mutation authority.

Later, mission state, actor AI, and audit/witness behavior can split into separate services.

Player, World, And Game Persistence

Persistence should be explicit service state, not kernel process checkpoint/restore. The adventure game needs several kinds of state with different durability rules:

  • Session state: foreground client state, prompt mode, transient command context, and chat cursors. This is per client and may disappear when the client exits.
  • Expedition state: current site, active mobs, hp/fatigue, temporary effects, party membership, pending invites, turn ordering, and in-progress objective state. It is resumable only when the player explicitly resumes an expedition. Ordinary run adventure-client should start from the current profile but not silently continue a half-finished mission.
  • Profile state: player id, display handle, rank marks, warrior stars, wizard circles, faction standing, cosmetics, contributor badges, title choices, and settings. This is durable player data and must survive client exit; once a durable store exists it should survive reboot.
  • Ledger state: append-only mission facts, debrief records, relic custody, forbidden-rite use, witness certifications, market/trade receipts, reward mints, and revocations. This is the audit source for profile mutations and should be harder to rewrite than profile summary fields.
  • World/public state: server-authoritative shared world data: persistent room history, public faction standing and consequences, quest-board state, market stock and prices, contested-site outcomes, ledger-derived public history, and shared campaign events. This is owned by game services, not by any one client, and is separate from private profile inventory. Smoke transcripts exercise a bounded slice of this state; production game-world instances grow it under capacity policy and shard boundaries rather than a fixed cap.
  • Content state: mission definitions, generated content blobs, aliases, dialogue, map graph, and validation metadata. This is versioned read-only content selected by content hash or release id, not player-mutable state.
  • User-owned backup state: encrypted save capsules stored through a browser session in the user’s own Google Drive app data folder or Firebase-backed user document space. This is private backup/sync state, not an authoritative source for shared world facts, rewards, or multiplayer outcomes.

Internal service split:

flowchart TD
    Client[Adventure client] --> Adventure[Adventure service]
    Adventure --> Content[AdventureContentCatalog]
    Adventure --> Profile[AdventureProfileService]
    Adventure --> Expedition[AdventureExpeditionService]
    Adventure --> Ledger[AdventureLedger]
    Profile --> Save[AdventureSaveStore]
    Expedition --> Save
    Ledger --> Save
    Save --> Store[Store or CloudGameStore]
    Save --> Vault[UserOwnedSaveVault]
  • AdventureContentCatalog exposes validated read-only mission content by content hash or release id and reports the generator/schema version used to build it.
  • AdventureProfileService owns durable per-player profile summaries. The current pre-ledger substrate may expose direct bounded summary mutation for host-testable save/load behavior, but final reward, title, rank, faction, cosmetic, badge, and similar profile application must be ledger-backed once AdventureLedger exists.
  • AdventureExpeditionService owns active mission/world instances. It can keep short-lived expeditions in memory, but explicit resume requires a checkpoint written through AdventureSaveStore.
  • AdventureLedger is append-only from ordinary game clients. Correction and revocation require separate witness/admin authority and must leave a record rather than rewriting history.
  • AdventureSaveStore serializes bounded Cap’n Proto save records to whichever backing service it was granted. It hides whether the backing is RAM, local disk Store/Namespace, or a cloud bridge.
  • CloudGameStore is an optional bridge service, not a replacement for capOS storage semantics. It exposes the same save/load/append operations as the local store adapter and should be granted only to the profile, expedition, and ledger services that need it.
  • UserOwnedSaveVault is a browser-mediated backup target. The browser receives an encrypted, signed save capsule and writes it using user-granted Drive or Firebase authority. Encryption keys follow the storage domain: local capOS storage uses local capOS-host key material, while GCP-backed game-world data uses Cloud KMS envelope encryption with a per-world or per-shard KEK wrapping service-owned DEKs. capOS and the adventure service do not receive the user’s OAuth access token, Firebase refresh token, Drive file IDs beyond opaque handles, or provider credentials.

Recommended rollout:

  1. Volatile baseline: keep current in-memory state keyed by the live endpoint caller-session scoped ref plus epoch, but define the profile, expedition, ledger, and content records as bounded structs and add host tests for encode/decode and migration rules. Normal shell launch/grant commands now omit legacy badge and receiver-selector syntax; explicit selectors are low-level compatibility or hostile-path fixtures, not the state identity model.
  2. Local store baseline: use RAM-backed then disk-backed Store/Namespace caps to prove profile save/load, explicit expedition checkpoint/resume, and ledger append/replay. This is the offline and QEMU proof path.
  3. GCP-backed bridge: run a narrow CloudGameStore bridge outside capOS or as a capOS service once networking is available. A practical GCP deployment uses Cloud Run for the bridge endpoint, Firestore Native mode for mutable profile/index documents and transactional updates, Cloud Storage with object versioning/lifecycle policy for immutable snapshots and evidence blobs, and Secret Manager for bridge-side service credentials. capOS clients still see only the CloudGameStore capability.
  4. User-owned browser vault: for private player data, a web terminal or browser companion can store encrypted save capsules in Google Drive appDataFolder or a Firebase user document. This is useful before capOS has durable local disk or direct provider SDK support. It must be treated as user-controlled transport for game-world encrypted data: the user can delete, withhold, duplicate, or roll back blobs, but cannot decrypt or forge accepted state without the relevant local capOS key or game-world KMS authority. On restore, the game verifies signatures, schema/content hashes, profile id, monotonic capsule version, previous capsule hash, and policy bounds before accepting any state; decrypted ledger records still validate their own previous-record hash chains.
  5. Hybrid sync: local Store remains the source for QEMU/offline proof paths, while CloudGameStore replicates selected profile/ledger objects. The sync boundary must be explicit: profile summaries may be overwritten through a checked version, ledger records append, and expedition checkpoints resolve conflicts by rejecting stale writes rather than merging combat state.

Minimum save record set:

AdventureProfile {
  profile_id
  display_handle
  version
  ranks
  warrior_stars
  wizard_circles
  faction_standing
  cosmetics
  contributor_badges
  settings
  updated_at
}

AdventureExpeditionCheckpoint {
  expedition_id
  profile_id
  content_hash
  checkpoint_version
  site_id
  objective_state
  player_state
  party_state
  mob_state
  pending_events
  saved_at
}

AdventureLedgerRecord {
  record_id
  profile_id
  expedition_id
  content_hash
  kind
  previous_record_hash
  payload
  witness
  created_at
  revoked_by
}

User-owned save capsules wrap those records rather than replacing them:

UserSaveCapsule {
  schema_version
  capsule_version
  profile_id
  device_id
  content_hash
  migration_policy
  record_kind
  record_version
  previous_capsule_hash
  plaintext_hash
  ciphertext
  aead_algorithm
  signature_algorithm
  signer_public_key_id
  signature
  created_at
}

Capsule encryption follows the same storage-domain rule as the backing store. When state is stored locally on a capOS host, the encryption key is local capOS-host key material and local backup/restore needs an explicit local key recovery story. When state is stored in GCP services, Cloud KMS is the key-encrypting-key service: it wraps or unwraps a capsule DEK, while the game-world service decrypts and validates capsule plaintext internally using the unwrapped DEK as service authority, modeled as a SymmetricKey capability. The browser may transport ciphertext, wrapped DEKs, and opaque Drive/Firebase handles, but it should not receive a plaintext DEK, SymmetricKey cap, KeySource cap, KMS decrypt/unwrap grant, or provider-independent plaintext authority unless a later explicit user-managed key export design adds that mode. For GCP-backed worlds, access to unwrap and use the DEK is game-world service authority mediated by KMS/IAM, not ownership of the Drive or Firebase blob alone.

For the GCP path, each game-world instance or shard gets its own Cloud KMS key ring and symmetric CryptoKey KEK. Runtime grants are scoped to the CryptoKey where possible: encrypt-only writers use roles/cloudkms.cryptoKeyEncrypter to wrap new DEKs, restore/migration readers use roles/cloudkms.cryptoKeyDecrypter to unwrap existing DEKs, and only the small game-world service that must do both uses roles/cloudkms.cryptoKeyEncrypterDecrypter. Rotation affects future DEK wrapping but does not re-encrypt existing capsules or retire old key versions. Re-encryption or rewrapping is a managed service operation: decrypt and validate the capsule inside the game-world service, then write a new capsule with a new DEK or a DEK rewrapped by the current primary KEK version. Old versions stay enabled until no accepted wrapped DEK depends on them. Retiring a world removes decrypt IAM first, may disable key versions to make protected capsules inaccessible, and only schedules destruction after audit/recovery decisions because completed key version destruction is irreversible.

Every persisted record needs a schema version, content hash or release id, size limit, and migration rule. Save/load must fail closed when the content hash is unknown, the record exceeds bounds, a capsule or ledger hash chain does not match, or the caller lacks the profile/expedition authority.

Do not use user-owned Drive/Firebase blobs as authority for public state:

  • contributor rewards still require AdventureLedger witness records;
  • multiplayer outcomes and market trades still require service-side validation;
  • public room history and shared world events should be stored by the game service or cloud bridge, not accepted from a user’s private backup;
  • rollback of a private backup may restore local profile cosmetics or an explicit expedition checkpoint, but it must not erase append-only public ledger facts.

Interface Sequencing

Do not add new gameplay verbs only as ad hoc client text. Every verb that changes world state needs a typed route before it is accepted as implemented.

For Phase 1 verbs, update these surfaces together:

  • schema/capos.capnp: add methods or typed command records for inspect, use, give, ask, order, seal, status, and explicit authority verbs including request, accept, delegate, and revoke.
  • tools/generated/ and canonical generated bindings through the existing generated-code workflow.
  • demos/capos-chat: add request encoders, result decoders, and DTOs for the new adventure methods.
  • demos/adventure-server: validate state, authority, bounds, and failure text in server handlers.
  • demos/adventure-client: keep parsing thin; convert user text to typed calls rather than duplicating game rules.
  • tools/qemu-shell-smoke.sh: assert one success path and one failure path for each new state-changing method.

The future CommandSession interface can replace the text adapter, but it is not a reason to add stringly world mutation in the interim.

Resource Bounds And Determinism

Two distinct bound regimes apply: the deterministic QEMU smoke proof, which must stay small and reproducible, and the production game-world instance, which is bounded by service capacity, shard policy, and quotas rather than a single fixed cap. The numeric limits below are the smoke-instance defaults; production tuning belongs to the game-world deployment runbook and grows with profile/ledger/expedition substrate, multiplayer authority, and persistent world state.

Smoke-instance and per-shard rules:

  • Per-instance MAX_PLAYERS stays explicit; every per-player map entry is removed on leave or process teardown.
  • Cap per-player inventory entries, writs, relics, marks, evidence records, active effects, and remembered chat cursors.
  • Cap per-site mobs, items, active wards, actors, and pending events.
  • Cap combat transcript lines per turn and reject oversized action text before semantic parsing.
  • Smoke transcripts use fixed encounter scripts. Production play may seed variation from mission state, but transcript-critical paths must still force a stable seed.
  • Keep multiplayer parties, duels, trades, pending invites, and contested-site records bounded per mission/shard with explicit overflow behavior, not silent drop.
  • Chat history must either be cursor-based per client or printed under an explicit “recent room record” header with a bounded line count for live views; persistent chat history may grow under retention policy in ledger/world-state services.
  • The current StdIO adapter may accept 256-byte command lines, but typed ids inside service calls remain 64-byte ASCII ids unless a reviewed schema/runtime change raises that limit.
  • Keep free-form text fields separate from ids: say text and future rest-of-line command text may use the command-line limit, while object ids, actor ids, mob ids, writ ids, directions, spell names, and skill names use the id limit.
  • Generated mission content must define explicit bounds for titles, descriptions, lead text, aliases, dialogue, and debrief lines, and branches that check in generated content need a freshness check so generated Rust blobs cannot drift from source mission data.

Smoke-instance suggested limits (production shards may raise these under capacity policy, but transcript-critical paths must run inside these bounds):

players: 64
ordinary inventory entries per player: 6
writs per player: 8
evidence records per player: 16
active effects per player: 8
mobs per site: 8
party members: 4
pending trades per player: 4
pending invites per player: 4
chat history per room: 16 lines
command-line bytes: 256
typed id bytes: 64
room/site title bytes: 80
description bytes: 320
lead/failure-hint bytes: 160
actor dialogue/debrief line bytes: 320

Command Surface

The current StdIO parser can grow the first mission quickly, but the target should be the structured command session described in Interactive Command Surfaces.

Initial text commands:

  • look
  • go <direction>
  • inspect <thing>
  • take <thing>
  • use <thing>
  • give <thing> to <actor>
  • ask <actor> about <topic>
  • request <writ>
  • accept <writ>
  • delegate <writ> to <actor>
  • revoke <writ>
  • order <actor> to <task>
  • seal <target>
  • inventory
  • status
  • say <text>
  • quote <item> from <actor>
  • buy <item> from <actor>
  • sell <item> to <actor>
  • trade <item> to <actor> for <item|favor>
  • repair <item> at <actor>
  • party <create|invite|accept|leave|delegate>
  • assist <player> with <task>
  • duel <challenge|accept|yield>
  • spar <player>
  • contest <site>
  • quit

Dynamic completions should come from room state:

  • exits for go,
  • visible items and held writs for inspect and use,
  • present actors for ask and give,
  • quoted shop inventory for buy and sell,
  • party members and pending invites for party and assist,
  • mission targets for seal.

Rich Browser Client

A later browser client should be a real game presentation layer: pixel-art locations, animated characters, inventory and authority panels, combat affordance buttons, event feeds, and chat surfaces. It should not be a terminal emulator with decorative art around StdIO.

The presentation model should be a 2D tilemap, not prose-only room cards. World data sent to the browser can include maps, tilesets, tile layers, object layers, collision/interaction zones, spawn points, actor paths, region/outpost markers, and event triggers. Tiled JSON is a plausible authoring/export format if the content validator rejects oversized maps, missing tiles, unknown layer types, invalid object references, and presentation data that tries to carry authority. PixiJS plus @pixi/tilemap is a reasonable first rendering candidate because it targets WebGL 2D tile rendering with a canvas fallback. That renderer choice must stay client-side; the game service remains the owner of authoritative location, collision, interaction, market, custody, and combat state.

That client can bypass adventure-client. The text client remains valuable for QEMU proofs, scripted transcripts, and compatibility, but the browser UI should talk to the adventure and chat services through WebShellGateway-held session authority:

Browser pixel-art UI
  -> WebShellGateway / web shell capability-call proxy
  -> session-scoped AdventurePlayer and ChatParticipant caps
  -> adventure-server and chat-server

The browser does not hold capOS capabilities directly. It holds opaque web-session handles and sends typed UI actions such as movement, target selection, inventory use, delegation, order, spell, skill, and chat requests. The gateway maps those requests onto the real session-scoped capabilities and returns structured view state or event records for rendering. Raw capOS CapIds, badge selectors, game-world keys, provider credentials, broad network authority, and shell spawn authority must not cross into browser JavaScript.

The trusted-host transport pattern that the gateway must satisfy already exists for the operator remote-session UI: see Remote Session CapSet Clients for the redaction, view-model, and policy-preflight rules that keep capOS handles, redacted transcript bytes, and provider credentials inside a trusted Rust backend while browser JavaScript receives only typed view models, call results, and denial diagnostics. The adventure browser client should reuse that same authority boundary: a trusted backend owns the session-scoped AdventurePlayer and ChatParticipant caps, applies the same redaction and denial discipline, and ships only view models and event records to the pixel-art renderer.

For the purpose-built adventure UI, a narrow AdventurePlayer / ChatParticipant surface is a better primary ABI than generic terminal text:

AdventurePlayer.look()
AdventurePlayer.go(direction)
AdventurePlayer.status()
AdventurePlayer.inventory()
AdventurePlayer.useItem(item, target)
AdventurePlayer.order(actor, task)
AdventurePlayer.cast(spell, target)
AdventurePlayer.skill(skill, target)
AdventurePlayer.delegate(writ, actor)
AdventurePlayer.pollEvents(cursor, maxEvents)

ChatParticipant.say(text)
ChatParticipant.history(cursor, maxLines)

CommandSession can still exist for terminal-like front ends, command palettes, automation, and compatibility adapters. It is not required for a custom pixel-art client whose UI already knows it is presenting the adventure game. The non-negotiable boundary is that browser presentation never becomes authority. Every action still flows through typed game capabilities, and the server rejects invalid location, stale state, missing authority, bad custody, combat restrictions, and oversized input.

This belongs well after the current game-depth phases. It depends on WebShellGateway authentication/origin policy and teardown, session-bound adventure/chat identity, persistent profile/checkpoint semantics, and a stable core game loop. Asset manifests for sprites, portraits, tiles, VFX, UI sounds, and animation ids should be explicit data. Asset presence or selection must not grant game authority, and missing assets should fail as presentation errors rather than mutating game state.

The browser harness should verify more than successful loading. It should drive one deterministic mission through UI actions and check tilemap layer order, actor placement, viewport/camera bounds, collision affordances, event-feed updates, logout/tab-close teardown, and rejection of browser-side attempts to mutate authoritative state without the typed gateway call.

QEMU Proof Path

Keep a deterministic smoke path similar to make run-adventure, but make it prove game mechanics:

setup/login
run adventure-client
status
ask centurion about mission
request ward-writ
go gate
use ward-writ
go tower
inspect standard
recover eagle-standard
ask legionary about ward
give scout-marker to scout
go under-vault
seal gate
inventory
quit
exit

Assertions should check:

  • launch grants remain explicit,
  • no password leaks into logs,
  • invalid action returns a specific failure,
  • authority grant/delegation uses explicit verbs rather than take,
  • item use changes world state,
  • NPC process reacts to at least one player action,
  • mission completion records an audit/witness line,
  • replayed chat is either suppressed or labeled as history,
  • at least one canonical-id suggestion for a near-miss command,
  • one shop quote or rejected trade explains the authority or price gate,
  • a fixed mission seed prints stable variant and calendar metadata once randomization lands,
  • a two-client co-op proof can delegate one writ or assist one action without leaking private inventory authority.

Current implemented proof coverage is intentionally narrower than the eventual target game, but it now follows the Aurelian mission path. make run-adventure keeps the shell-driven adventure-client transcript focused on representative interactive behavior: typed inspect, status, attack, skill, cast, give, ask, order, request, accept, and delegate calls; room-view mission, lead, actor, mob, writ, item, and canonical exit context; categorized Items, Writs, Relics, Marks, and Evidence output; a rejected invalid inspect input; canonical-id suggestions for near-miss ward and wraith inputs; Maro route evidence on ashen_road; a separate NPC process reaction; a failed attack against a warded mob; delegated order livia to dispel-sigil exposing a ward; a resolved Livia actor alias with an improved task hint; repeated detect-ward idempotence on an already exposed ward; ember-dart spell damage; a 2-star warrior strike; and eagle-standard recovery.

The complex custody path is covered by adventure-scenario-test, a real capOS userspace process with only Console and Adventure caps. It calls AdventureClient methods under QEMU and asserts initial categories, under_vault denial before temple-seal, pre-recovery Iunia denial, ward-writ route authority setup, ward-wraith defeat, relic recovery, non-droppable relic behavior, missing-location custody denial, missing ward-writ authority denial, unsafe-route witness refusal, survivor evacuation, gate sealing, witness-certified temple-seal custody, final evidence tokens, and under_vault access after custody.

The test strategy should stay split by risk. Pure deterministic game logic should live in ordinary Rust unit tests where possible: calendar rollover, seasonal availability, market matching, escrow state machines, blueprint validation, artifact property derivation, enchantment limits, route constraints, and agent quota accounting. Cross-service gameplay scenarios should use a real Rust userspace test client process that calls game caps under QEMU, as adventure-scenario-test already does for custody. The shell-driven adventure-client transcript remains the basic command/client proof for parser behavior, rendering, representative typed calls, and smoke-path integration; it should not become the only coverage for complex market, construction, economy, or agent-NPC state machines.

Implementation Plan

Phase 1: Player-Visible Mission Substrate

  • Implemented so far:
    • typed inspect, use, status, attack, skill, cast, and guard methods across schema, generated bindings, client wrappers, server handlers, terminal parser, and QEMU transcript assertions,
    • typed give, ask, order, seal, request, accept, delegate, and revoke methods across the same schema/client/server/parser/proof path,
    • explicit result text for failed and successful go, take, and drop actions,
    • compact player combat stats in status: hp, guard, fatigue, warrior stars, wizard circles, prepared spells, and active mobs,
    • one deterministic ward-wraith encounter with a warded-mob failure path, spell reveal, spell damage, martial skill damage, guard effect, and mob defeat,
    • one explicit objective and completion condition tied to ward-wraith defeat,
    • minimal per-player authority state for ward-writ request, acceptance, delegation, revocation, and gate sealing,
    • bounded per-player evidence/effect storage surfaced in status,
    • replayed room chat labeled as history for later room joins,
    • bounded object-id validation for typed object inputs,
    • server-side canonical id normalization for common casing and title aliases, plus bounded near-miss suggestions for the current mission ids,
    • typed AdventureRoomView mission, lead, actor, mob, writ, item, and canonical exit context rendered by look,
    • structured status and inventory output split into survival, location, mission, physical items, writs, relic custody, marks, evidence, effects, and lead,
    • idempotent repeated spell behavior in the interactive transcript,
    • a dedicated adventure-scenario-test userspace process that calls the Adventure cap directly to prove relic custody denial, witness refusal, temple-seal certification, categorized evidence, and under_vault access.
  • Current playable slice: the Aurelian gate-fort mission now comes from demos/adventure-content/content/prototype.cue, with checked-in generated Rust output consumed by the server and verified by make generated-code-check. State-changing behavior remains in Rust handlers, and make run-adventure proves the interactive eagle-standard recovery and replay-history path, while adventure-scenario-test proves survivor, gate-seal, and temple custody outcomes through real Adventure cap calls.
  • Typed relic recovery: recover eagle-standard is the dedicated custody verb, with take and drop reserved for physical items.
  • Local party foundation: Adventure owns the first deterministic party state for service-local player labels, pending invites, scoped ward-writ delegation, and detect-ward assist records. PvP consent, transfer escrow, and the two-client QEMU proof remain future work.
  • Physical-item transfer foundation: Adventure.transfer performs same-party local item mutation for ordinary inventory items and leaves currency escrow, cross-service trade, relic custody transfer, and successful two-client proof as future work.

Phase 2: Imperial Frontier Mission

  • Replace current four-room content with the Aurelian gate-fort mission. Complete.
  • Preserve objective/lead text in look and status, plus canonical-id suggestions for common near-miss commands.
  • Add typed inventory categories: items, writs, relics, evidence, marks. Complete for player-facing status and inventory output.
  • Add at least three actor processes with distinct chat/personality behavior; keep them chat-only unless explicit scoped Adventure grants and tests land in the same slice.
  • Add one route requiring a capability-style permission.
  • Add one objective with two acceptable outcomes.
  • Add one narrative debrief that records rank, standing, evidence, and audit consequences.

Phase 3: Persistent Profile And Ledger Substrate

  • Define AdventureProfile, AdventureExpeditionCheckpoint, and AdventureLedgerRecord structs with schema versions, content hashes, size limits, and host migration tests.
  • Add AdventureProfileService, AdventureExpeditionService, AdventureLedger, and AdventureSaveStore interfaces before persisting profile or world state in ad hoc server maps.
  • Prove a local baseline first: profile save/load, ledger append/replay, and explicit expedition checkpoint/resume through RAM-backed or disk-backed Store/Namespace.
  • Keep ordinary client launch fresh by default; require an explicit resume command or profile option before loading an active expedition checkpoint.
  • Add one rejected stale-checkpoint write and one rejected wrong-profile load to QEMU or host-level proof coverage.

Phase 4: User-Owned Browser Save Vault

  • Define UserSaveCapsule and browser transport semantics for private encrypted profile, settings, and explicit expedition checkpoint backups.
  • Use Google Drive appDataFolder or Firebase user documents as opaque capsule transports only; browser-held OAuth/Firebase tokens must not enter capOS game services.
  • Add tamper, wrong-profile, stale-version, replay, unknown-content, and oversized-capsule rejection tests before real provider adapters.
  • Keep public world state, multiplayer outcomes, reward witnesses, and market receipts out of user-owned blobs.

Phase 5: Compact Authority-RPG Loop

The next implementation phase should build on the pure targeted-combat foundation from commit f149119, not reopen broad calendar, market, construction, agent-NPC, federation, or worldline systems. The goal is one excellent expedition loop where authority is RPG power: choose writs and companions, enter a dangerous site, discover authority conflict, fight or negotiate under pressure, delegate or revoke power, extract, and gain reach for future missions.

  • Generate combat profiles from CUE for current mobs and validate malformed zones, damage kinds, alert groups, recognition thresholds, and stealth references through make generated-code-check.
  • Integrate generated combat profiles into adventure-server so inspected attacks use deterministic zone damage, fatigue, interruption, recognition, and alert helpers. Clients must not submit computed damage.
  • Extend parser/proof coverage only as needed for unambiguous authority-RPG commands: attack <mob> [zone] [with gladius], cast <spell> at <mob> [zone], and the first challenge authority <target> authority-combat alias.
  • Add one authority-attacking enemy behavior to the existing expedition: a forged route/custody claim, stolen custody token, seal conflict, corrupt revocation, or old-law wraith claim that can be inspected, exposed, bound, or revoked.
  • Treat writs as loot. Add one fixed-seed or authored writ modifier with a meaningful drawback; print issuer, scope, expiry, delegation rules, revocation conditions, modifier, and drawback in inspect/status output, and enforce the drawback in service logic.
  • Add one delegation-buildcraft proof using an existing companion. A trait such as loyalty, competence, fear, reputation, or doctrine should change how a delegated ward-writ or custody authority behaves and explain the cause.
  • Add one reach-based debrief unlock, such as Archive evidence verification, Temple vault custody upgrade, Signal tower remote revocation, or appointing one field deputy. This should unlock a future verb or jurisdiction, not generic damage or health.
  • Keep denial rewarding: at least one new authority denial should reveal a lead about hidden jurisdiction, forged authority, missing witness, rival claim, or alternate route.
  • Prove the slice with pure Rust tests for deterministic rules and one adventure-scenario-test path covering inspected targeted attack, authority threat/lead, writ drawback, delegation consequence, and reach unlock. The shell transcript should remain representative parser and smoke coverage.

Broad systems remain deliberately demoted until this loop is strong: calendar/season gameplay, regional market order books, construction jobs, artifact/enchantment production, optional agent NPCs, MMO/open economy work, federation, and worldlines should not be treated as next local sequencing truth for implementation agents.

Phase 6: Structured Command Session

  • Move from app-owned StdIO parsing to CommandSession.
  • Expose dynamic command metadata and completions.
  • Keep a text adapter for QEMU scripts.

Phase 7: Multiplayer Authority Proof

  • Do not start this phase until Adventure and chat authority use session-bound caller identity, or future broker-granted service facets, rather than player receiver-selector identity. The first bounded slices key local player labels from live caller-session metadata.
  • Add local multi-client party state keyed by service-created player objects, with explicit invite, accept, leave, and delegation commands.
  • Add one deterministic co-op combat or ward puzzle where one player assists another without receiving unrelated inventory authority.
  • Add one opt-in duel or sparring proof with scoped harmful authority and a rejected unauthorized attack path.
  • Add bounded trade offers for ordinary loot and reject relic transfers unless custody authority permits them. The proof must show the transfer coordinator cannot duplicate, lose, or partially transfer an item when offers go stale, cancellation races with acceptance, or a retry repeats the same request.
  • Extend QEMU scripting to drive two clients or two command sessions through a stable multiplayer transcript.

Future Follow-Up: Golems, Gates, And Infrastructure

Golems should be imperial magotechnical infrastructure before they are enemies. They fit the setting as labor frames, cargo haulers, bridge-builders, sentries, field repair units, siege engines, and rare battlefield assets. Model each golem as a body, a bound core, and an energy source: the body defines role, the core defines identity and obedience, and the energy source defines endurance.

Initial golem types:

  • cargo-golem: moves sealed supplies or heavy relics only when granted a matching route authority.
  • ward-golem: guards a shrine, vault, or gate and recognizes proof tokens rather than passwords.
  • siege-golem: breaks barriers but requires multiple grants, such as engineer approval plus energy access.
  • field-repair-golem: restores damaged ward anchors when supplied with materials and repair authority.
  • corrupted-golem: obeys malformed or stale authority, making revocation and audit behavior visible in play.

A golem should not become ordinary inventory. The player receives scoped command authority over it: inspect, wake, route, repair, bind, delegate, audit, or revoke. A useful rule of thumb: order cargo-golem north-gate succeeds only when the player holds both cargo-command and north-gate-route.

Gates and portals should be imperial route infrastructure: roads made executable. They move authority, troops, messengers, and supplies across the frontier. Standing at a gate is not enough; use requires a physical anchor plus a valid writ, seal, route token, ward key, or alignment state.

Gate components:

  • gate-anchor: fixed legal endpoint.
  • route-writ: temporary authority to open one path.
  • ward-key: faction or office authorization.
  • stabilizer: consumable or repairable part that bounds usage.
  • gate-log: inspectable audit trail for deterministic investigation evidence.

Gate constraints should create missions rather than decoration:

  • gates open only between known anchors,
  • heavy constructs require freight routes rather than personal routes,
  • damaged gates can misroute, refuse cargo, or leak hostile entities,
  • emergency gates can open one-way and revoke the route after use.

Follow-up mission candidates:

  • Gate repair: recover a stabilizer, prove engineer authority, order a repair golem, and open a bounded evacuation route.
  • Golem command: delegate a narrow task to a ward golem after presenting the correct seal.
  • Logistics: move medicine, grain, or signal crystals through gate routes with cargo-size and route-authority limits.
  • Investigation: inspect gate logs, compare seals, and identify which faction abused or forged route authority.
  • Siege: choose between spending rare siege-golem command, negotiating gate access, or repairing an old military road.
  • Containment: seal a corrupted portal while wizard-circle spells stabilize the breach and warrior-star formations defend the site.

Typed verb candidates for these later slices include bind, route, repair, charge, open-gate, seal-gate, attune, stabilize, trace-route, and audit-gate. They should remain scoped and revocable, and status output should expose active seals, bound routes, charged spells, wounded formation members, unstable wards, and delegated golem tasks.

Open Questions

  • Should actor NPCs call the adventure service directly, or should they only communicate through chat in the first interesting version?
  • How much randomized event timing can be allowed in production play before QEMU transcript coverage becomes brittle, given the smoke path runs under a fixed seed?
  • Should shops and trades live inside Adventure at first, or split into a Market service once two-party trade exists?
  • Should parties be mission-local objects, profile-level groups, or future session broker grants?
  • What is the minimum PvP consent record that is useful for the game without overbuilding policy after the profile/ledger substrate exists?
  • How are game-world shards/instances scoped: one shared world, one per campaign, one per party, or one per deployment? This determines whether faction standing, ledger history, and market state are global or per-shard.
  • Where does the boundary sit between server-authoritative public history (visible to all players in a shard) and per-profile audit records that remain private even within the same shard?
  • After the echo-only worldline proof, should the next federation slice be envoy visits, expedition bridges, market/custody transfer, or profile migration?
  • What retention/archival policy applies to ledger records, debrief evidence, and chat history once the game runs long enough to accumulate them?

Follow-Up Proposal

After the Aurelian adventure design is implemented well enough to support stable profiles, ranks, evidence, debriefs, cosmetic items, and deterministic proof coverage, the game can grow a separate contributor-facing layer.

That follow-up is tracked in Contributor Quest Mechanics. It describes maintainer-witnessed “outer-world quests” for real capOS development work, such as fixing a full GitHub issue URL or improving QEMU proofs, and limits rewards to badges, temporary states, decorative items, and bounded game-only perks. It must not grant repository authority, OS authority, or any ability to mutate another player’s profile.

Design Grounding

Grounding files for this proposal:

  • CLAUDE.md
  • README.md
  • docs/proposals/index.md
  • docs/proposals/interactive-command-surface-proposal.md
  • docs/proposals/session-bound-invocation-context-proposal.md
  • docs/proposals/shell-proposal.md
  • docs/proposals/boot-to-shell-proposal.md
  • docs/backlog/runtime-network-shell.md
  • docs/proposals/service-object-capabilities-proposal.md
  • docs/backlog/stage-6-capability-semantics.md
  • docs/security/trust-boundaries.md
  • docs/proposals/cryptography-and-key-management-proposal.md
  • docs/proposals/volume-encryption-proposal.md
  • docs/proposals/storage-and-naming-proposal.md
  • docs/proposals/cloud-deployment-proposal.md
  • docs/proposals/contributor-quest-mechanics-proposal.md
  • docs/proposals/llm-and-agent-proposal.md
  • docs/proposals/hosted-agent-swarm-proposal.md
  • docs/proposals/remote-session-capset-client-proposal.md
  • docs/proposals/capos-repo-harness-engineering-proposal.md
  • docs/research/game-mechanics-prior-art.md
  • docs/research/plan9-inferno.md
  • docs/research/hosted-agent-harnesses.md
  • docs/backlog/aurelian-frontier.md
  • docs/backlog/hardware-boot-storage.md
  • schema/capos.capnp
  • system-adventure.cue
  • tools/qemu-shell-smoke.sh
  • demos/adventure-client/src/main.rs
  • demos/adventure-server/src/main.rs
  • demos/adventure-npc-wanderer/src/main.rs
  • demos/adventure-npc-shopkeeper/src/main.rs
  • demos/capos-chat/src/lib.rs

Proposal: Contributor Quest Mechanics

How capOS can later use the adventure game as a playful interface for real open-source development work without confusing game rewards with repository authority.

Purpose

The current adventure proposal makes capability ideas playable inside capOS. This follow-up uses the same fiction and authority vocabulary to encourage real-world contributions to capOS itself.

An in-game officer, quartermaster, guild broker, temple witness, or academy scribe can issue “outer-world quests” such as:

Outer-world quest: fix https://github.com/<org>/<repo>/issues/123
Proof: merged PR linked to the issue and passing required checks
Reward: bug-hunter seal, review-lantern cloak clasp, +1 cohort standing

The goal is to make useful project work more visible and more fun:

  • fix bugs,
  • reproduce failures,
  • write tests,
  • improve docs,
  • review security-sensitive changes,
  • reduce flaky QEMU harnesses,
  • triage issues,
  • mentor new contributors,
  • write design notes,
  • run release checklists.

This must not become a substitute for maintainership, code review, or security policy. Real maintainers still decide what merges. The game records recognized contributions after normal project workflow has already accepted them.

This proposal sits between the playful adventure substrate and the longer-term agentic-development substrate that already records how capOS sessions, reviews, and subagent runs are tracked. Read alongside:

  • docs/proposals/aurelian-frontier-proposal.md is the base game that supplies profiles, ranks, evidence, debriefs, decorations, and deterministic proof coverage. Contributor quest mechanics layer on top of those structures and must not run before the base game is stable.
  • docs/proposals/llm-and-agent-proposal.md defines the language-model, embedder, and agent-runner capability surface. When an in-game NPC, quest issuer, or scribe is later voiced or reasoned over by a language model, the authority surface used is the typed LLM/agent capabilities from that proposal, with per-tool consent/stepUp/forbidden gating. Game-side rewards must never widen what an agent can do at the OS or repository level.
  • docs/proposals/agentic-development-experiment-proposal.md covers the longitudinal study of agentic coding sessions, subagent dispatch, review agents, and session-recap tooling. Contributor quest evidence is a human-facing, reward-shaped overlay; the agentic-development experiment is the engineering-evidence overlay. They share the same underlying real-world contribution stream (merged PRs, closed issues, accepted proposals, review outcomes), but they intentionally keep separate ledgers: game-side rewards are cosmetic and reputational; agentic-development records are scientific observation and tooling artifacts. Neither overlay grants authority to the other.

Design Goals

  • Make contribution paths discoverable for people who arrive through the demo.
  • Reward real project progress without giving game systems repository power.
  • Keep rewards mostly cosmetic, narrative, reputational, or convenience-level.
  • Use full issue and PR URLs, commit hashes, and review records as evidence.
  • Let maintainers mint or revoke recognition through explicit authority.
  • Avoid incentives that make people spam issues, rush reviews, or optimize for game points over project quality.
  • Preserve privacy: public contributions can be celebrated; private identity links require explicit consent.

Non-Goals

  • No automatic merge, close, label, or assignment authority from the game.
  • No token handling inside the game client.
  • No paid bounty, token, cryptocurrency, or transferable reward system.
  • No leaderboard that pressures security reviewers or maintainers into rushed public rankings.
  • No reward that grants kernel, shell, broker, or repository authority.
  • No reward that lets a player mutate another player’s profile or inventory.

Core Loop

The player sees an in-game quest board after the ordinary Aurelian campaign has enough profile state to matter.

  1. A trusted quest issuer publishes a bounded list of outer-world quests.
  2. The player claims or follows one quest, such as a GitHub issue.
  3. The player does the work outside the game through normal GitHub and review workflow.
  4. A maintainer or verifier records the accepted proof: merged PR, linked issue closure, accepted docs patch, reproduced bug log, or completed review.
  5. The game mints an in-world mark, badge, decorative item, state, title, or bounded perk.
  6. Debrief text explains what real contribution was recognized and why.

The game should phrase this as imperial frontier logistics rather than as a raw task tracker:

Quest Board: Outer Works

Issue 123: repair the failing run-adventure transcript.
Need: one merged fix and one reviewer witness.
Reward: Lanternwright badge, smoke-runner sash, cohort standing +1.

Reward Types

Rewards should be valuable enough to feel visible, but not strong enough to turn project contribution into a grind.

Badges

Badges are durable profile marks:

  • bug-hunter: fixed a confirmed defect.
  • smoke-runner: repaired or improved a QEMU proof path.
  • doc-cartographer: improved docs, backlog, roadmap, or proposal clarity.
  • review-witness: completed a substantive review accepted by a maintainer.
  • security-sentinel: closed or helped validate a security finding.
  • first-boot-guide: helped a new contributor get a local boot/test flow.
  • release-quartermaster: completed release or dependency audit work.

Badges are evidence-backed, not self-declared. A badge record includes the proof URL or commit hash, issuer, timestamp, and short reason.

States

States are temporary profile or expedition conditions:

  • on outer patrol: player has claimed or followed an issue.
  • awaiting witness: contribution submitted, waiting for review/verification.
  • maintainer witnessed: accepted proof exists, reward can be minted.
  • needs reproduction: player is gathering logs or QEMU transcript evidence.
  • blocked by design: task needs a proposal or maintainer decision first.

States should expire or be explicitly cleared. Stale states must not block the player from ordinary game progress.

Decorative Items

Decorative items are visible in the game world without granting project power:

  • review lantern,
  • smoke-runner sash,
  • warded keyboard,
  • cartographer map case,
  • broken-panic trophy,
  • issue-forged signet,
  • release quartermaster ledger,
  • first-boot camp banner.

Decorative items can appear in status, profile views, housing, tavern dialogue, party banners, or debrief records.

Perks

Perks must stay bounded and low-risk:

  • title choices in chat or debrief text,
  • cosmetic room decorations,
  • additional flavor dialogue from NPCs,
  • small convenience options inside adventure missions,
  • access to optional lore logs or museum rooms,
  • non-combat party banner effects.

Avoid perks that create an optimal gameplay path only available to frequent contributors. The point is recognition and orientation, not a second economy.

Quest Types

First Contribution

Real-world task:

  • make a first accepted contribution to capOS,
  • pass the relevant checks,
  • respond to review without needing maintainer rescue,
  • leave the touched docs, tests, or code in a maintainable state.

Game framing:

  • receive a recruit’s field mark,
  • cross the first gate under supervision,
  • return with a witnessed service record.

Reward examples:

  • first-gate badge,
  • recruit sash,
  • academy standing,
  • optional mentor thank-you record.

This quest should reward the first accepted contribution once, not every small patch. It exists to make the contributor path visible and to recognize the friction of getting a local toolchain, QEMU proof, review loop, and project style working for the first time.

Bug Hunts

Real-world task:

  • fix a confirmed GitHub issue,
  • add regression coverage,
  • preserve or improve relevant QEMU transcript assertions.

Game framing:

  • track a breach,
  • seal a faulty gate,
  • prove the fix with a witness log.

Reward examples:

  • bug-hunter badge,
  • broken-panic trophy,
  • cohort standing.

Smoke Runner Work

Real-world task:

  • improve make run-* smoke stability,
  • add missing transcript assertions,
  • reduce brittle sleeps,
  • preserve password/log redaction.

Game framing:

  • run the frontier signal route,
  • repair a proof beacon,
  • return with a clean gate log.

Reward examples:

  • smoke-runner sash,
  • transcript lantern,
  • scout standing.

Documentation Cartography

Real-world task:

  • improve docs/tasks/README.md, backlog files, roadmap, proposal status, runnable demo docs, or research grounding,
  • remove stale status claims,
  • clarify next-step sequencing.

Game framing:

  • update the imperial route map,
  • reconcile witness records,
  • mark safe roads for new operators.

Reward examples:

  • doc-cartographer badge,
  • map-case decoration,
  • academy scribe title.

This should include small docs corrections, but the higher reward tier should require work that materially improves future contributors’ ability to navigate the project: clearer milestone state, sharper backlog decomposition, better runbook steps, or removal of misleading/stale status text.

Accepted Design Proposal

Real-world task:

  • submit a design proposal that maintainers accept,
  • ground it in existing project docs and relevant research when needed,
  • update proposal indexes and any affected roadmap/backlog status,
  • respond to review by narrowing unsafe scope or documenting tradeoffs.

Game framing:

  • present a plan before the imperial council,
  • have temple witnesses validate the authority chain,
  • receive a sealed charter for future field work.

Reward examples:

  • charter-writer badge,
  • council seal,
  • strategy-table decoration,
  • design-witness title.

“Accepted” means the proposal is merged or explicitly recorded as accepted in the repository. Drafts, brainstorms, and abandoned proposals can still receive ordinary participation flavor, but they should not mint the accepted-design badge.

Security Witnessing

Real-world task:

  • review trust-boundary changes,
  • close a review-finding task record,
  • add proof coverage for hostile input,
  • update security docs when a boundary changes.

Game framing:

  • testify before the temple annex,
  • certify relic custody,
  • expose a forged writ.

Reward examples:

  • security-sentinel badge,
  • temple witness seal,
  • lawful-custody title.

Mentorship And Onboarding

Real-world task:

  • help a contributor get local builds, tests, QEMU, or docs working,
  • improve setup notes based on observed friction,
  • pair on a first small patch.

Game framing:

  • guide a new recruit through the gate yard,
  • issue safe training writs,
  • staff the frontier academy.

Reward examples:

  • first-boot-guide badge,
  • recruit banner,
  • academy standing.

Evidence Model

Each recognized contribution should produce a bounded ContributionEvidence record:

quest_id
kind
full_issue_url
full_pr_url
commit_hash
issuer
subject_profile
summary
accepted_at
reward_ids
revoked

The evidence record is not a legal identity document. It is a project-visible game record that says a maintainer or authorized verifier recognized a public contribution.

Use full URLs for GitHub issues and PRs. Do not rely on shorthand issue numbers without repository identity.

Capability Mapping

Game conceptcapOS concept
quest boardread-only issue feed or maintainer-published mission list
quest claimoptional local profile state, not GitHub assignment
proof URLevidence record input
maintainer witnessauthority to certify accepted contribution evidence
badge mintscoped profile mutation capability
reward revocationaudit-backed correction capability
decorative itemnon-authority profile state
contributor standingbroker input only for game/social features

The game must never turn a decorative badge into OS or repository authority. If a future broker uses contributor standing, it must be for game/social features unless a separate security design explicitly says otherwise.

Service Architecture

Keep this out of the core adventure service until the base game is stable.

Target split:

flowchart TD
    GitHub[GitHub / Forge] --> Importer[Quest Importer]
    Maintainer[Maintainer Session] --> Witness[Contributor Witness]
    Importer --> Board[Quest Board]
    Witness --> Rewards[Reward Mint]
    Rewards --> Profile[AdventureProfileService]
    Rewards --> Ledger[AdventureLedger]
    Client[Adventure Client] --> Board
    Client --> Profile

Initial implementation can be manual:

  • a checked-in or manifest-provided quest list,
  • maintainer-issued proof records,
  • no network calls from capOS to GitHub,
  • no tokens in the demo VM.

Reward records use the adventure persistence split:

  • quest definitions and fixture issue lists are content/catalog data;
  • claims and temporary states are profile state;
  • accepted proof, issuer, timestamp, reward mint, and reward revocation are append-only ledger records;
  • badge, title, decoration, and cosmetic summary fields are applied to AdventureProfileService only after a matching ledger record exists.

Later implementation can add a ForgeConnector service with narrow read-only authority:

  • list selected issues by repository and label,
  • fetch merged PR metadata,
  • verify commit hashes or check statuses,
  • never mutate GitHub state.

Any mutating forge integration, such as labels or comments, requires a separate security proposal and must not be hidden inside the game service.

Abuse And Incentive Controls

The system should discourage low-quality contribution farming.

  • Rewards are maintainer-witnessed, not automatically minted from activity.
  • Repeated trivial fixes should not produce unbounded badges.
  • Security review rewards should avoid public speed rankings.
  • Issue claiming inside the game must not block real contributors on GitHub.
  • Reward descriptions should name quality criteria: tests, docs, review, and accepted project value.
  • Maintainers can revoke or amend mistaken rewards with an audit note.
  • Public profiles should allow hiding or unlinking personal identity details.

Privacy And Identity

Players may want to keep game profiles and GitHub identities separate.

Rules:

  • Linking a game profile to a GitHub account must be explicit.
  • Public GitHub evidence can be recorded as a URL without exposing private tokens or session state.
  • Private email addresses, tokens, and local machine paths must never appear in reward records.
  • The game can show “verified public contribution” without requiring the player to reveal more than the accepted public artifact.
  • If OIDC or passkeys later connect identity, use the user/session/policy proposals rather than adding identity shortcuts to the game.

Command Surface

Candidate commands after the base adventure command surface exists:

quests
quest inspect <quest-id>
quest follow <quest-id>
quest evidence <quest-id>
badges
badge inspect <badge-id>
decorate <slot> with <item-id>
title set <title-id>
profile share <public|private>

Maintainership commands must be separate and authority-gated:

quest publish <quest-id>
quest witness <quest-id> for <profile> with <proof-url>
reward mint <reward-id> for <profile>
reward revoke <reward-id> for <profile>

These are typed service calls, not shell-special strings.

Implementation Phases

Phase A: Manual Recognition

  • Add this proposal to the proposal index and docs summary.
  • Define bounded quest, evidence, badge, state, decoration, and perk data records.
  • Store accepted proof and reward mint/revocation as append-only AdventureLedger records, then derive visible profile badges from those records.
  • Add a small checked-in sample quest list using full GitHub issue URLs.
  • Add manual witness records in test content.
  • Show badges/decorations in profile/status output.
  • Add QEMU proof that a witnessed quest mints a cosmetic badge and that an unwitnessed claim does not.

Phase B: Game Integration

  • Add an in-game quest board location after the Aurelian campaign.
  • Add NPC dialogue that points contributors toward real project workflows: reproduce, test, document, review, and submit.
  • Add debrief text that ties accepted contribution evidence to in-world recognition.
  • Keep all rewards non-authority unless a separate reviewed design grants narrow game-only authority.

Phase C: Forge Read Model

  • Add a read-only forge import path for selected repositories and labels.
  • Verify merged PRs, issue links, check statuses, and commit hashes without storing tokens in game state.
  • Add host tests for malformed URLs, cross-repository ambiguity, oversized metadata, and stale proof records.
  • Add QEMU smoke using fixed fixture data rather than live network calls.

Phase D: Community Events

  • Model release weeks, bug hunts, docs sprints, and review days as bounded seasonal quest boards.
  • Add group recognition for cohorts without ranking individuals by raw activity count.
  • Add opt-in public profile export for contributor showcases.

Open Questions

  • Who can witness rewards before durable maintainer identity exists inside capOS?
  • How should reward revocation be displayed without creating public shaming mechanics?
  • Can issue labels be imported read-only without making GitHub availability a boot or smoke-test dependency?
  • What is the smallest useful “perk” that feels meaningful while remaining non-authoritative?
  • Should a local-only demo use fictional fixture issue URLs, real capOS issue URLs, or both?

Design Grounding

Grounding files for this proposal:

  • docs/proposals/aurelian-frontier-proposal.md
  • docs/backlog/aurelian-frontier.md
  • docs/proposals/llm-and-agent-proposal.md
  • docs/proposals/agentic-development-experiment-proposal.md
  • docs/proposals/storage-and-naming-proposal.md
  • docs/proposals/interactive-command-surface-proposal.md
  • docs/proposals/user-identity-and-policy-proposal.md
  • docs/proposals/oidc-and-oauth2-proposal.md
  • docs/proposals/security-and-verification-proposal.md
  • docs/security/trust-boundaries.md
  • docs/tasks/README.md

No docs/research/ report is directly applicable at this stage. This proposal is community workflow and game-design planning layered on existing project proposal documents, not a new OS/runtime architecture claim.

Proposal: Public Release and Maintainer Boundaries

How capOS can become publicly visible without accidentally promising general support, production security, broad feature review, or an always-on community moderation role.

This proposal is the maintainer-load and release-governance layer over the project’s existing security review and verification tracks. The owning trackers for the risk and process surfaces this document references are:

  • docs/design-risks-register.md for the consolidated index of long-horizon design risks and open architectural questions. The hygiene and split gates below do not redefine those risks; they only constrain what may be claimed publicly while they remain open;
  • docs/proposals/security-and-verification-proposal.md for the security review vocabulary, trust-boundary checklist, and verification tracks (host tests, model checks, fuzzing, QEMU smokes, review). The “evidence vs. assurance” wording in SECURITY.md below is the public-facing framing of that proposal’s conclusion that current checks are engineering evidence, not an independent audit;
  • docs/proposals/repository-composition-proposal.md for the core/sibling scope rule the adventure split and history-rewrite gates below enforce.

Purpose

Open-sourcing capOS should be a controlled publication step, not a commitment to operate a public support queue. The project can benefit from public inspection, reproducible demos, and selected outside contributions while still rejecting vague bug reports, low-effort questions, unsupported feature requests, and large drive-by changes.

The first public release should optimize for:

  • accurate public claims,
  • narrow maintainer commitments,
  • security honesty,
  • reproducible build and QEMU evidence,
  • curated contribution paths.

It should not optimize for community growth, support volume, or broad roadmap input.

Release Position

capOS should be described publicly as:

  • experimental research software,
  • x86_64/QEMU-first,
  • capability-system focused,
  • not production-ready,
  • not independently security-audited,
  • not suitable for hostile multi-user deployment,
  • not a supported general-purpose OS.

Top-level public wording should be direct:

capOS is experimental research software.

It has not undergone an independent security audit. Treat the current code as
unsafe for production use, unsafe for hostile multi-user deployment, and
unsuitable for protecting real secrets or availability-sensitive workloads.

Security boundary documents in this repository describe design intent and
current proof coverage, not certification or a guarantee of correctness. QEMU
demos, host tests, model checks, and review notes are engineering evidence.

This statement belongs in README.md and SECURITY.md before the repository is made public.

Non-Goals

  • No public support promise for local build environments.
  • No support guarantee for platforms beyond the documented QEMU path.
  • No expectation that maintainers answer questions already covered by docs.
  • No roadmap voting or feature-request queue.
  • No public chat server at launch.
  • No production-security claim.
  • No response-time SLA for ordinary issues or pull requests.
  • No acceptance of large unplanned subsystem PRs.

Maintainer Load Model

The initial public repository should run in source-visible mode:

  1. The code, docs, and reproducible demo paths are public.
  2. Issues are enabled only through strict templates.
  3. Pull requests are allowed but scoped by contribution rules.
  4. Discussions and chat are disabled at launch.
  5. Maintainers publish a small set of curated tasks they are willing to review.

Public visibility does not imply public support. Maintainers may close issues without extended discussion when they are:

  • usage questions answered by README.md, docs/, or command output;
  • vague bug reports without exact commands, commit hash, host/QEMU versions, or relevant log excerpts;
  • broad feature requests outside docs/roadmap.md;
  • requests to support unrelated platforms, package managers, or deployment environments;
  • debates about project direction without a concrete proposal;
  • large implementation PRs that did not start from an accepted design.

Closed issues should usually receive one short reason and, when possible, a link to the relevant document. Maintainers should not spend review time rewriting low-quality reports into actionable work.

Issue Intake

The public repository should not allow blank issues at launch. Use issue templates for:

  • reproducible bug report,
  • QEMU transcript failure,
  • documentation problem,
  • design proposal,
  • private security report pointer.

Bug reports should require:

  • capOS commit hash,
  • host OS and architecture,
  • QEMU version when QEMU is involved,
  • exact command run,
  • expected result,
  • actual result,
  • relevant bounded log excerpt.

Default labels should include:

  • needs-repro,
  • needs-design,
  • docs,
  • qemu,
  • security,
  • not-planned,
  • support-boundary,
  • maintainer-curated,
  • agent-assisted,
  • needs-human-owner,
  • agent-spam,
  • review-capacity,
  • too-large.

Suggested policy:

  • close needs-repro reports after 14 days without requested reproduction details;
  • close broad roadmap requests as not-planned;
  • convert recurring valid questions into documentation, then close later duplicates by linking the docs;
  • do not promise ordinary issue response times.

Automated issue batches should be closed without triage when they are not attached to a concrete human-owned reproduction, design proposal, or patch. Large generated reports are not useful by themselves; they must be reduced to one actionable issue with exact evidence.

Pull Request Intake

Small fixes, regression tests, docs corrections, and narrow bug fixes may be accepted without prior design discussion when they fit existing architecture.

Large changes should start as an accepted design proposal or maintainer-curated issue. This includes changes to:

  • kernel capability semantics,
  • schema or ABI,
  • userspace runtime behavior,
  • boot flow,
  • security boundaries,
  • dependencies,
  • hardware or architecture support,
  • public command surfaces,
  • persistent storage,
  • networking beyond the selected milestone.

Every non-trivial PR should state:

  • motivation,
  • changed trust boundary or docs-only,
  • commands run,
  • QEMU proof when behavior is user-visible,
  • generated-code and dependency notes when relevant,
  • design proposal link for large changes.

Maintainers may close unplanned large PRs without detailed review. That rule is necessary to avoid turning public visibility into unpaid architecture consulting.

Agent-Assisted Contributors

Public capOS will likely attract contributors using Claude Code, Codex, and similar tools to run long agent loops. That is acceptable only when the human owner remains accountable for the work.

Agent-assisted work must:

  • disclose that automation was used;
  • have one human owner who understands the diff and can answer review questions;
  • stay attached to an accepted issue, proposal, or maintainer-curated task unless it is a narrow obvious fix;
  • include exact verification commands and relevant output summary;
  • preserve the repo’s worktree, review, and security rules;
  • keep generated logs, prompts, transcripts, and local secrets out of the repository;
  • avoid opening batches of speculative issues or PRs from broad scans.

Suggested public policy:

  • one active non-trivial PR per new external contributor by default;
  • no more than two active PRs per established external contributor unless a maintainer explicitly raises the limit;
  • agent-generated drive-by refactors, mass lint churn, dependency churn, or roadmap reshuffles are closed as too-large or not-planned;
  • PRs without a responsive human owner are closed as needs-human-owner;
  • repeated automated noise is labeled agent-spam and may be blocked at the account level.

The merge rule still stays positive: a useful PR with the required review context, relevant verification, and no unresolved blocking findings should be merged. The throttles exist to protect review capacity, not to reject good work because an agent helped produce it.

Review Capacity

Maintainers should publish the current review mode in a pinned issue, project view, or CONTRIBUTING.md section:

  • open: accepting curated external work;
  • limited: reviewing only bugs, security reports, and maintainer-curated tasks;
  • paused: no new external PR review except private security reports.

When review capacity is full:

  • new feature PRs may be closed as review-capacity without technical review;
  • stale needs-repro issues may be closed after the documented timeout;
  • stale needs-changes PRs may be closed after the maintainer-requested changes are not addressed;
  • emergency capacity is reserved for private security reports and regressions in documented release/demo paths;
  • public contributors should be pointed toward existing reviewed tasks instead of creating new backlog entries.

Passing CI is evidence, not entitlement to review. The review bar remains REVIEW.md, useful project value, relevant verification, and no unresolved blocking findings.

Workflow Transition

The current private workflow uses local branches, dedicated worktrees, docs/tasks/README.md, backlog files, and local review loops as the planning and review source of truth. That is appropriate while capOS is mostly private and automation-heavy, but it is not the right public collaboration model forever.

At some point after public release, the project should migrate public work to:

  • GitHub Issues for user-visible bugs, accepted design tasks, support-boundary closures, and curated contributor work;
  • GitHub Pull Requests for review, CI evidence, design-grounding notes, and merge decisions;
  • GitHub Projects for milestone planning, sequencing, and status views.

That transition should replace roadmap/backlog-driven operational planning for public work. docs/roadmap.md should remain a high-level narrative roadmap, not the active task board. docs/backlog/ should become design context and historical decomposition, not the queue maintainers are expected to triage by hand.

The migration should not happen in the first quiet source-visible launch. It needs explicit gates:

  • issue templates and labels are stable enough to reject low-quality reports;
  • PR templates capture trust-boundary, verification, and design-grounding requirements;
  • CI exposes the baseline checks public contributors can run and cite as evidence;
  • REVIEW.md remains the merge bar: a useful PR with the required review context, relevant verification, and no unresolved blocking findings should be merged;
  • maintainers have decided which roadmap/backlog items become public issues;
  • a GitHub Project exists for the selected public milestone;
  • stale local backlog entries have been either converted to issues or marked as historical context;
  • private/security-sensitive planning remains outside public issues until the security policy says where it belongs.

During the transition, avoid dual sources of truth. If a task is public and tracked in GitHub Projects, its active status should live there. Repo docs should link to the project or issue instead of duplicating the live state.

Planning Source Migration Model

The private planning files should migrate by role, not by copying their whole contents into GitHub:

  • docs/roadmap.md remains a narrative of visible outcomes and design order. It should name the current public milestone and link to the GitHub Project view, but it should not list every public task status.
  • Local task records under docs/tasks/ remain the private/operator surface for local agent loops, security-sensitive tasks, and unreleased integration state. For public work, they should contain issue or PR URLs plus the reason the item is active, not a duplicate checklist.
  • docs/backlog/ becomes design context, decomposition history, and migration notes. Actionable public tasks move to issues with one owner, acceptance criteria, verification gates, and links back to the relevant design/backlog section.
  • Private unresolved-review work remains in task records until a finding is safe to disclose. Public findings become GitHub issues only after the security policy says disclosure is acceptable; the local task record should then link to the issue instead of restating live public status.

Issue bodies should preserve the review discipline already used locally: problem statement, affected files or trust boundary, design-grounding links, expected proof command, QEMU or host-test acceptance criteria, and explicit non-goals. GitHub Projects should track only status and sequencing fields such as milestone, area, risk, owner, review mode, and blocked-by. The design rationale belongs in docs or issue discussion, not in project-card metadata.

Before converting a backlog slice, maintainers should de-duplicate it against current docs and closed commits. A converted issue should be either:

  • a visible outcome that maps to a roadmap milestone;
  • a review finding with concrete remediation and verification;
  • a scoped implementation task under an accepted proposal; or
  • a documentation correction tied to a known drift point.

Avoid migrating stale checklists verbatim. If an item has already landed, move the historical note to docs/changelog.md or leave it in commit history. If an item is still speculative design, keep it in the proposal/backlog until a maintainer is ready to review implementation.

Communication Channels

Launch with:

  • GitHub Issues for reproducible bugs and accepted work tracking;
  • pull requests for reviewable scoped changes;
  • private security contact from SECURITY.md.

Do not launch with:

  • Discord,
  • Matrix,
  • Slack,
  • public support email,
  • office-hours promises,
  • GitHub Discussions unless maintainers explicitly budget moderation time.

Chat systems create interrupt-driven support load and reward questions that should become docs or issue templates. They should wait until the project has a known moderation policy and enough maintainers to enforce it.

Security Statement

SECURITY.md should state:

# Security Policy

capOS is experimental research software and has not undergone an independent
security audit. Do not use it to protect real secrets, production workloads, or
hostile multi-user environments.

Security reports are still useful. Report suspected vulnerabilities privately
to <security contact>.

Do not open public issues containing exploit details, private keys, tokens,
credential material, or instructions for attacking third-party systems.

At this stage, maintainers do not provide a security-fix SLA. Accepted reports
are triaged based on project relevance, reproducibility, and impact on
documented capOS security boundaries.

The security page should link to:

  • docs/security/trust-boundaries.md,
  • docs/security/verification-workflow.md,
  • docs/tasks/README.md,
  • docs/trusted-build-inputs.md.

It should also distinguish security evidence from security assurance. Current checks, QEMU smokes, Kani proofs, Loom models, fuzz targets, and reviews are useful engineering evidence. They are not an independent audit. The evidence vs. assurance line is the public-facing framing of docs/proposals/security-and-verification-proposal.md; open risks the public claim must remain silent about are tracked in docs/design-risks-register.md.

Repository Hygiene Gates

Before public visibility:

  • add a license file;
  • add CONTRIBUTING.md;
  • add SECURITY.md;
  • add issue templates and a pull request template;
  • add top-level experimental/no-audit wording;
  • split the adventure game server, client, NPC processes, content generator, and proposal/backlog/demo docs out of this repository into a dedicated adventure repository before public visibility (see “Adventure Repository Split” below);
  • rewrite git history into a curated public-import history before publication (see “Git History Rewrite” below); do not publish the current private agent-driven commit log unchanged;
  • run a secret and history scan against the rewritten history, not just the current tree;
  • scan the rewritten history and the public tree for personal identifiers (maintainer GitHub account names, personal cloud project / bucket names, personal home-directory paths, personal email addresses, personal host or user names) and either remove the artifact or replace those values with neutral placeholders or environment-driven configuration;
  • remove or sanitize local operator infrastructure that is not part of the public OS: maintainer-private CI configs, private cloud project references, maintainer-side automation services and scripts, and any other path that exists only because the current maintainer runs the project that way;
  • remove local-only artifacts from the public tree;
  • run the documented baseline checks;
  • create only maintainer-curated public starter issues.

Local-only and generated artifacts need an explicit policy before publication. Source-controlled generated bindings are acceptable when freshness checks remain documented. Generated adventure content moves with the adventure split and is no longer a capOS-repository hygiene concern after the split lands. Local caches, build output, QEMU images, transient manifests, automation logs, maintainer-private cloud infrastructure configs, and automation service units should not be part of a public source snapshot.

Adventure Repository Split

The Local MUD/adventure prototype, NPC-as-process fleet, Aurelian expedition content, save vault work, and contributor-quest framing exist primarily as a shared-service demo and as motivation for service-object capabilities and agent-shell tool surfaces. They are not part of the capability-OS core claim the first public capOS release should defend.

Keeping adventure in the same public repository would:

  • conflate “experimental research OS” with “experimental research game”, making the public scope statement and security boundaries harder to defend;
  • attract game-feature requests, balance debates, and content contributions through the same maintainer queue as kernel/capability/security issues;
  • pull narrative, world-building, and content-generation review onto the same review capacity that should be focused on capability semantics, schema/ABI, security boundaries, and the documented QEMU proofs;
  • expand the public attack surface and the public claim surface beyond what the OS work is ready to defend.

Before the first public capOS release, adventure-specific code, content, generators, proposals, backlog, and demo docs must move to a dedicated adventure repository that depends on capOS as a downstream consumer. In the capOS repository, only the minimum hooks needed for capOS’s own service-object and shared-service-demo proofs may remain, and they must be defensible without reference to game design.

Concretely the split must, at minimum, relocate or remove:

  • demos/adventure-server/, demos/adventure-client/, demos/adventure-content/, demos/adventure-chat-actors/, demos/adventure-npc-shopkeeper/, demos/adventure-npc-wanderer/, demos/adventure-scenario-test/, and any future adventure-named demo crates;
  • tools/adventure-content-gen/ and any adventure-specific generator fixtures or content blobs;
  • adventure-named manifests (for example system-adventure.cue and any derived manifest-adventure.bin / capos-adventure.iso build artifacts), make run-adventure* targets, and harness scripts that exist only to drive adventure demos;
  • adventure-only modes embedded in shared tooling – for example the drive adventure / assert adventure modes in tools/qemu-shell-smoke.sh and any adventure-shaped output handling inside other shared tools/qemu-*-smoke.sh or tools/qemu-*-harness.sh scripts – which must be removed from the core scripts and re-homed in the adventure repository;
  • adventure-specific build/check tooling such as tools/check-generated-adventure-content.sh, the generated-adventure-content-check Makefile target, and any adventure-named recipe stanzas, MANIFEST_SOURCE/MANIFEST_BIN/ISO overrides, and .PHONY entries (run-adventure, generated-adventure-content-check, etc.) that exist only to build or exercise adventure demos;
  • adventure entries in DEPENDENCY_POLICY_MANIFESTS/LOCKFILES, cargo workspace lists, and any other Makefile or CI list that names tools/adventure-content-gen/ or adventure-named crates;
  • docs/proposals/aurelian-frontier-proposal.md, docs/proposals/contributor-quest-mechanics-proposal.md (when its scope is game-shaped), docs/backlog/aurelian-frontier.md, docs/demos/adventure.md, and any other adventure-shaped narrative or content docs;
  • adventure-specific entries in docs/tasks/README.md, docs/roadmap.md, task records, and changelog narrative — keep only items that describe capability-OS invariants the core repository must continue to defend after the split.

The adventure-files/paths/assets prohibition in the history-rewrite gate applies to all of the above. The split is not complete while adventure-only logic still lives in shared tooling under names that do not say “adventure” – for example a generic-looking drive mode that in practice exists only to script adventure transcripts. A reviewed inventory of these shared-tooling crossings is part of the split task and must be cleared before the curated public-import history is built.

The split must not silently weaken capOS’s own proofs. Any adventure-anchored service-object, IPC, save-store, ledger, or chat-identity invariant currently enforced inside kernel/, capos-config/, capos-rt/, init/, or shell/ and exercised only by an adventure demo needs an equivalent non-adventure proof in capOS before the adventure code leaves, or the invariant moves to the new repository together with its proof. A reviewed inventory of these crossings is a prerequisite of the split task, not a follow-up.

The split is a release-time gate, not a routine refactor. Before it lands:

  • the new adventure repository must build against a tagged or pinned capOS reference so the cross-repo dependency direction is verified;
  • adventure-anchored capOS proofs must either be replaced by non-adventure equivalents or be moved to the adventure repository;
  • the public-release readme, security, and roadmap statements must reflect the narrowed capOS scope;
  • the rewritten public-import history (below) must not contain adventure-specific commits, paths, or assets.

Git History Rewrite

The current commit history was produced under a private automation-heavy workflow with high commit volume, mid-task narrative messages, exploratory branches, intermediate worktree state, and adventure-shaped commits that no longer belong in capOS once the adventure split is enforced. Publishing that history unchanged would:

  • expose private workflow detail (worktree names, automation checkpoints, intermediate planning notes) that is not valuable to public readers;
  • make the public claim surface match every speculative direction explored privately, instead of the actual current capability-OS shape;
  • carry adventure-specific paths, assets, and commit messages into the OS repository the split is meant to leave behind;
  • complicate any future secret/history scan by mixing release-relevant history with disposable automation narrative.

Before the first public release the maintainer must produce a curated public-import history. The acceptable approach is one of:

  • a single squashed initial-public-import commit on a fresh public branch, with a short message that describes the project state at publication and links to docs/changelog.md for historical milestone narrative; or
  • a small number of curated, signed commits that group related capability subsystems and proofs in a way that is reviewable by an outside reader, with the same forward link to the changelog.

The rewritten history must:

  • contain no adventure-specific files, paths, generated content, or commit messages;
  • contain no private automation narrative, private worktree names, internal reviewer-only notes, or local-only artifacts;
  • contain no personal identifiers (maintainer GitHub account names, personal home-directory paths, personal cloud project / bucket names, personal email addresses, personal host or user names) in committed files, commit messages, author/committer fields beyond the maintainer’s intended public attribution, or generated content;
  • pass a secret and history scan and a personal-identifier scan over the rewritten commits, not just the current tree;
  • carry an explicit license and attribution from the first public commit;
  • preserve docs/changelog.md as the narrative record of completed milestones and reviews so historical context is not lost when the raw commit history is collapsed.

Force-pushing a rewritten history over an already-public branch is not the intended use of this gate. The rewrite happens before the repository becomes public. The current private repository may continue to exist as an internal mirror; only the rewritten history is published.

Repository Composition

The adventure split is the first concrete instance of a wider rule: the public capOS repository should defend a narrow, recognizable claim, and non-core tracks should live in downstream repositories that depend on capOS rather than ride along inside it. The detailed scope rule, the full list of split candidates (whitepaper, public website, userspace network stack, production remote-access services, protocol stacks, language runtimes, GPU, agent shell, cloud images, volume encryption), the when-to-split criteria, the cross-repository mechanics, and the intended cap-os-dev GitHub organization placement live in docs/proposals/repository-composition-proposal.md.

For public-release readiness the only repository-composition gates are:

  • the adventure split (above) must be complete;
  • the rewritten public-import history (above) must respect the core/sibling scope rule defined in docs/proposals/repository-composition-proposal.md, so that no sibling-bound paths or assets enter the public capOS history;
  • public-facing READMEs, security pages, and roadmap statements must describe the narrowed core scope rather than the pre-split private workspace;
  • if the GitHub organization move to cap-os-dev happens before public release, the public-import history is published into that organization rather than into a personal account.

Other splits described in the Repository Composition proposal happen on their own readiness timelines and are not public-release prerequisites.

Launch Phases

Phase A: Quiet Source-Visible Launch

  • Repository is public.
  • Issues use templates.
  • Discussions and chat remain disabled.
  • PRs are accepted only for narrow fixes or maintainer-curated work.
  • README and security pages make the experimental/no-audit status explicit.

Phase B: Curated Contribution Phase

  • Publish a small list of tasks maintainers are willing to review.
  • Add good-first-issue only to tasks with enough context and an expected verification command.
  • Close unrelated feature requests instead of expanding the backlog.
  • Move repeated valid questions into docs.

Phase C: Broader Community Phase

Only after Phase A and B produce manageable signal:

  • consider GitHub Discussions;
  • consider a public chat room;
  • broaden accepted issue categories;
  • publish a maintainer rotation or moderation policy if more maintainers exist.

This phase is optional. capOS can remain source-visible and selectively contribution-friendly indefinitely.

Phase D: Hosted Public Demo

A public WebShellGateway or Adventure Game deployment is a separate operational milestone, not a side effect of making the source repository public. A Reddit-scale traffic spike should be assumed before any public link is posted.

The hosted demo must be treated as an untrusted public service:

  • demo sessions use guest-only or demo-only profiles;
  • no public demo path grants operator shell authority;
  • no public demo shell receives raw BootPackage, broad ProcessSpawner, provider-token, model-admin, storage-admin, or unrestricted network authority;
  • each browser session gets isolated caps, bounded resources, and deterministic teardown on logout, tab close, timeout, crash, or quota exhaustion;
  • sessions have maximum wall-clock duration, idle timeout, input/output byte limits, process/cap/resource quotas, and bounded transcript storage;
  • per-IP, per-session, and per-account rate limits exist before launch;
  • queueing and overload pages are preferred to silently starting unbounded VMs or capOS sessions;
  • maintainers have a kill switch that disables new sessions without affecting repository access;
  • logs are redacted and retention is documented;
  • the public page states that the demo is best-effort, may disappear, has no persistence guarantee, and is not a support channel.

Adventure Game traffic adds game-specific gates:

  • anonymous players cannot mutate authoritative public world state;
  • public profiles, rewards, and contributor-quest identity links are opt-in;
  • saved state, if offered, goes through the reviewed AdventureProfileService, AdventureLedger, and AdventureSaveStore boundaries;
  • public multiplayer uses service-created player objects, not user-selected identity badges;
  • NPC or agent-assisted game features hold only narrow per-NPC or demo caps;
  • abuse reports and moderation controls exist before public chat-like features are exposed.

The first hosted demo should be capacity-limited and disposable. It should not share credentials, sessions, storage, or authority with maintainer-operated development environments.

Public Claim Checklist

Public-facing docs should avoid claiming:

  • production readiness,
  • real-hardware support beyond documented experiments,
  • secure remote access,
  • independently audited security,
  • compatibility with ordinary OS workloads,
  • stable ABI,
  • stable contributor API,
  • support for every future roadmap track.

They may claim only what current docs and verification support:

  • x86_64/QEMU-focused research OS;
  • typed capability interfaces;
  • capability-ring transport;
  • current shell/login demos;
  • selected service demos;
  • documented verification commands;
  • known limitations and future tracks.

Design Grounding

Grounding files for this proposal:

  • README.md
  • docs/tasks/README.md
  • REVIEW.md
  • docs/roadmap.md
  • docs/design-risks-register.md
  • docs/proposals/aurelian-frontier-proposal.md
  • docs/proposals/boot-to-shell-proposal.md
  • docs/proposals/contributor-quest-mechanics-proposal.md
  • docs/proposals/llm-and-agent-proposal.md
  • docs/proposals/mdbook-docs-site-proposal.md
  • docs/proposals/repository-composition-proposal.md
  • docs/proposals/resource-accounting-proposal.md
  • docs/proposals/security-and-verification-proposal.md
  • docs/proposals/shell-proposal.md
  • docs/proposals/user-identity-and-policy-proposal.md
  • docs/backlog/aurelian-frontier.md
  • docs/backlog/runtime-network-shell.md
  • docs/security/trust-boundaries.md
  • docs/security/verification-workflow.md
  • docs/trusted-build-inputs.md

No docs/research/ report is directly applicable. This proposal is release governance and maintainer-load policy layered on existing project docs, not a new OS architecture or runtime design.

Proposal: Repository Composition

How capOS should be split across repositories so that the public capability-OS claim, the kernel review queue, and the security/release cadence stay recognizable as the project grows beyond a single private workspace.

Purpose

capOS currently lives in a single private repository that mixes the kernel, the userspace runtime, the native shell, generic capability/IPC/ring demos, the Aurelian Frontier game, an academic whitepaper draft, the public docs site sources, and proposals for protocol stacks, language runtimes, GPU support, cloud images, and other future tracks.

That packing is acceptable while the project is private and agent-driven: one workspace, one review loop, one history. It is not the right shape once capOS becomes public. A single-repository public capOS would conflate unrelated scopes, drag unrelated tracks through one review queue, attach the OS security posture to product-shaped surfaces, and force unrelated release cadences to share one tag stream.

This proposal defines:

  • what the public capOS core repository should defend (the scope rule);
  • what should ship in sibling repositories that depend on capOS;
  • the criteria for when a track is ready to split;
  • the cross-repository mechanics that keep splits honest.

It generalizes the “Repository Hygiene Gates” of docs/proposals/public-release-boundaries-proposal.md. The adventure split and the curated git-history rewrite remain release gates in that proposal; this proposal explains why those gates exist and how the same rule applies to other tracks over time.

Non-Goals

  • This proposal does not require splitting any track on a deadline beyond the explicit release gates already named in docs/proposals/public-release-boundaries-proposal.md. It defines a rule, not a calendar.
  • It does not redesign the capability model, schema, or kernel/runtime boundary. Those are owned by the relevant subsystem proposals.
  • It does not propose a multi-organization governance model. capOS may remain a single-maintainer or small-team project across multiple repositories.
  • It does not propose mirroring sibling repositories back into capOS. Once a track has split, capOS does not re-vendor it.
  • It does not promise a public chat or coordination forum for cross-repo work; that follows the launch phases in the public-release proposal.

Scope Rule For The Core Repository

The capOS core repository defends a narrow, recognizable claim. A track belongs in the core repository when at least one of the following is true:

  • removing it would weaken a capability-OS invariant the kernel or runtime currently enforces;
  • removing it would delete a proof the documented review process relies on;
  • it is part of the minimum surface required to boot capOS in QEMU and exercise the documented capability/IPC/ring/scheduling/security invariants.

A track does not belong in the core repository when its primary purpose is product, protocol, or language-runtime work that happens to run on capOS, even when it currently shares a workspace with the kernel.

In practice the core repository should contain:

  • the schema definitions and the generated bindings the kernel and runtime rely on (schema/, capos-abi/, capos-lib/, capos-config/);
  • the kernel itself, including arch-specific code under kernel/src/arch/;
  • the userspace runtime contract that consumes the schema (capos-rt/, init/, shell/);
  • the manifest and code-generation tooling needed to boot and build capOS (tools/mkmanifest/, tools/capnp-build/);
  • demos that exist only to exercise core capability/IPC/ring/scheduling/ trust-boundary invariants, not application-shaped product surfaces;
  • the security boundary, verification workflow, trusted-build-input, panic-surface inventory, and authority-accounting/transfer design documents;
  • the core proposals describing the OS itself: capability model, IPC, error handling, scheduling, SMP, networking architecture (high level), storage and naming (high level), service architecture, security and verification, formal MAC/MIC, live upgrade design, threading, key-management abstractions, user identity and policy abstractions;
  • docs/changelog.md, docs/roadmap.md, docs/tasks/README.md, REVIEW.md, and migrated review-finding task records for the narrowed core scope;
  • the documentation site sources that describe the core scope. The deployment of those sources can be a sibling concern (see “Public Website And Hosted Demos” below); the sources themselves stay with the OS they describe.

Tracks That Should Eventually Move Out

The following tracks already exist or are planned, and each one is or will become a candidate for a sibling repository. Each carries its own scope statement, security posture, maintainer load profile, and release cadence that should not be merged into capOS’s core scope statement.

The list is descriptive, not a queue. A track moves only when the split criteria below are satisfied for that track.

Adventure Ecosystem

Server, client, NPC processes, content generator, content blobs, adventure-named manifests/run targets, adventure proposals/backlog/demo docs, contributor-quest mechanics. The split is already a release gate in docs/proposals/public-release-boundaries-proposal.md. The dedicated sibling is capos-adventure. Any capOS invariant currently exercised only by an adventure demo needs a non-adventure equivalent in capOS, or moves into capos-adventure together with its proof.

Whitepaper And Academic Publication

papers/schema-as-abi/ is a Typst project, and docs/paper/plan.md, docs/paper/outline.md, and docs/paper/evidence-gaps.md are paper planning documents. Academic publication has its own review cycle, publication venue, citation cadence, and corrections process that should not share the OS’s tag stream. A capos-paper repository can cite capOS by tag or commit, track evidence-gap closure, and run paper-specific build/CI without expanding the OS repository’s review surface. docs/changelog.md and proof-evidence narrative remain in capOS so the paper has a stable reference target.

Public Website And Hosted Demos

The public landing page, marketing-shaped copy, hosted-demo deployment scripts (Cloudflare Pages glue, container images, CI for the public site, hosted WebShellGateway and adventure-demo deployment) are operational concerns with public-traffic implications. They should not share a release cadence with kernel changes, and their incident response must not pull on kernel review capacity.

mdBook content describing the OS itself stays in capOS. The deployment of that content as a public site can move to a sibling repository (for example capos-site) that depends on the capOS docs sources by tag or commit. The hosted public WebShellGateway or adventure-demo deployment follows Phase D of the public-release proposal and lives outside capOS.

Userspace Network Stack And NIC Drivers

The current QEMU smoke path keeps smoltcp, virtio-net, the line discipline, and the Telnet IAC filter inside kernel/. Once the userspace driver authority gate (docs/dma-isolation-design.md) lands and the userspace TCP/IP stack and NIC drivers leave the kernel (docs/proposals/networking-proposal.md Phase C), the resulting userspace components are large enough and carry enough independent attack surface to live in capos-net. capOS keeps the kernel-side DMA/MMIO/interrupt authority gates and the schema/ABI of the network capabilities; the implementation of the stack is a downstream consumer.

Production Remote-Access Services

The host-local Telnet demo is research evidence for the TerminalSession / SessionManager / AuthorityBroker / RestrictedShellLauncher boundary; it stays in capOS. The host-local SSH Shell Gateway research demo similarly stays as long as it is a host-local research artifact under docs/proposals/ssh-shell-proposal.md.

The production successors – a real OpenSSH-protocol gateway with production host-key management, persistent authorized-key/account storage, channel policy, audit, and remote-traffic threat model, and any production WebShellGateway with browser-side session UI and public moderation policy – are product-shaped services. They belong in dedicated repositories (for example capos-ssh-gateway, capos-web-shell) once they outgrow the host-local research surface.

Protocol Stacks Built On Key-Management Primitives

TLS/X.509, OIDC/OAuth2, ACME, OCSP, CT log handling, DPoP, workload identity federation, and similar large protocol surfaces described in docs/proposals/certificates-and-tls-proposal.md and docs/proposals/oidc-and-oauth2-proposal.md should ship as sibling repositories (for example capos-tls, capos-oidc) consuming the capOS key-management primitives. Their CVE response, dependency surface, and review queue should not be merged into the OS core’s. capOS keeps the abstract SymmetricKey, PrivateKey, KeySource, KeyVault, and audit primitives from docs/proposals/cryptography-and-key-management-proposal.md; the protocol stacks are downstream consumers.

Language Runtimes And Toolchain Ports

Go (GOOS=capos), libc / libcapos, WASI, Lua, and any future language runtime port belong in dedicated repositories (capos-go, capos-libc, capos-wasi, capos-lua, …). Language-runtime releases follow upstream language cadence, and porting work should not block kernel review. The capOS userspace ABI documented in capos-rt/, capos-abi/, and the schema is the contract these ports target.

GPU And CUDA Capability Integration

The GPU capability work in docs/proposals/gpu-capability-proposal.md brings a large external driver and toolkit dependency surface, vendor runtime distribution constraints, and hardware-specific testing needs. When implementation begins it belongs in a dedicated capos-gpu repository. capOS keeps the abstract device-authority gate and the relevant capability schema; vendor-specific glue and toolkit packaging is downstream.

LLM And Agent Runtime

The agent shell tool runner, model bindings, on-ISO local-model packaging, and provider-specific glue from docs/proposals/llm-and-agent-proposal.md and docs/proposals/realtime-voice-agent-shell-proposal.md carry independent supply-chain, content-policy, and operational concerns. Provider TOS, model weight redistribution, and content-safety reviews do not belong on the kernel review queue.

The shell capability and authority model – including how the agent shell’s per-tool consent/step-up/forbidden modes consume broker-issued capabilities – stays in capOS. The agent runner itself, the model bindings, the on-ISO local-model packaging, and the provider glue ship in a dedicated repository when implementation begins (for example capos-agent-shell).

Cloud Images And Instance Bootstrap

Cloud VM image building, AWS/GCP/Azure packaging, NVMe and cloud-NIC integrations, and the cloud-metadata bootstrap from docs/proposals/cloud-deployment-proposal.md and docs/proposals/cloud-metadata-proposal.md are operational image-building concerns with cloud-vendor dependency exposure. They should live in a capos-cloud-images repository that consumes capOS releases as inputs.

Volume Encryption And KMS Integration

The encryption-at-rest work from docs/proposals/volume-encryption-proposal.md will pull in cloud KMS clients, key-rotation policy, and cryptographic dependency exposure that should ship in a dedicated capos-volume-crypto (or similarly named) repository. The abstract key-management contracts and the storage-side authority gates remain in capOS.

Hosted Demo Tooling, Logs, And Operational Glue

Anything that is part of operating a public capOS deployment – session-quota policy, browser-side WebShellGateway UI, public landing copy, hosted log/metric pipelines, abuse-mitigation glue, public moderation tooling – is operational rather than OS work. It should live with the relevant sibling (for example public website or WebShellGateway service repositories) rather than inside capOS.

Tracks That Stay In The Core Repository

These tracks are intrinsic to the OS claim and should not be considered split candidates:

  • the kernel, including arch-specific code under kernel/src/arch/;
  • the schema definitions and generated bindings;
  • the userspace runtime (capos-rt), init, the native shell, and the manifest tools needed to boot capOS;
  • demos that exercise core capability/IPC/ring/scheduling invariants: capset-bootstrap, console-paths, ring-corruption, ring-reserved-opcodes, ring-nop, ring-fairness, endpoint-roundtrip, ipc-server, ipc-client, terminal-session, terminal-stranger, tls-smoke (the TLS userspace runtime smoke, not protocol stack), virtual-memory, timer-smoke, timer-flood, ipc-zerocopy-demo, and any future demo that exists only to exercise a core capability invariant;
  • the chat demo as a generic IPC and service-object example may stay, but only in a form that defends a capability-OS invariant. Game-shaped chat features (named NPC actors, contributor-quest framing, adventure-tied identity flows) follow the adventure split;
  • the security boundary, verification workflow, trusted-build-input, panic-surface inventory, authority-accounting/transfer design, and DMA-isolation design documents;
  • the core proposals listed in the “Scope Rule” section above;
  • docs/changelog.md, docs/roadmap.md, docs/tasks/README.md, REVIEW.md, and migrated review-finding task records for the narrowed core scope;
  • docs/research/, because each research note grounds a current capability-OS design decision; research notes that grow into full proposals follow the relevant subsystem.

When To Split A Track

A track should not be split prematurely. While a track lives only in proposal documents or a small experimental crate, the friction of a sibling repository (separate CI, separate review setup, separate license and security policy, cross-repo version pinning) outweighs the benefit.

The right time to split is when all of the following are true for the track:

  1. Independent product or protocol shape. The track has a recognizable purpose that is not “exercise a capOS invariant”. For example, a TLS stack, a Go port, a hosted public demo, or a game.
  2. Non-trivial implementation surface. The track draws review attention away from kernel review or carries an independent dependency surface large enough to need its own dependency-policy/audit posture.
  3. Defensible cross-repo dependency direction. The sibling can build against a tagged or pinned capOS reference without modifying capOS internals; the inverse direction (capOS depending on the sibling for a core invariant proof) is not required.
  4. Independent release cadence is desirable. The track wants its own tag stream, security advisory channel, or upstream synchronization schedule.

When any of these is missing, the track stays in the core repository or remains a proposal until it is ready.

A useful counter-test: would a public reader looking at the core capOS README, security policy, and release notes be misled by the presence of this track? If yes, that is a sign the scope statement is being stretched and the track is overdue to split. If a reader would not notice, the benefit of splitting is small.

Cross-Repository Mechanics

When a sibling repository is created, the following mechanics apply.

GitHub Organization Placement

The capOS core repository currently lives under a personal GitHub account. Once one or more siblings exist, hosting them all under the same personal account conflates personal projects with the capOS project, makes maintainer-set changes harder, and gives a confusing public landing surface for readers looking for the project.

The intended landing place for capOS and its siblings is a dedicated GitHub organization, cap-os-dev. Concretely:

  • the curated public-import history defined by the history-rewrite gate in docs/proposals/public-release-boundaries-proposal.md is published as a fresh cap-os-dev/capos repository when the organization is used. A GitHub repository transfer or fork from the current private capOS repository is not the intended mechanism, because it would carry the existing private uncurated history, branches, refs, and intermediate automation state into the public organization. The current private repository may continue to exist as an internal mirror after publication, but it is not the same repository as the public one;
  • siblings are created under cap-os-dev/<sibling> rather than under any individual maintainer’s account; for example cap-os-dev/capos-adventure, cap-os-dev/capos-paper, cap-os-dev/capos-site, cap-os-dev/capos-net, cap-os-dev/capos-ssh-gateway, cap-os-dev/capos-web-shell, cap-os-dev/capos-tls, cap-os-dev/capos-oidc, cap-os-dev/capos-go, cap-os-dev/capos-libc, cap-os-dev/capos-wasi, cap-os-dev/capos-lua, cap-os-dev/capos-gpu, cap-os-dev/capos-agent-shell, cap-os-dev/capos-cloud-images, cap-os-dev/capos-volume-crypto;
  • repository names listed in this proposal and in docs/proposals/public-release-boundaries-proposal.md are intent names, not reservations. Final naming happens at the moment a sibling is actually created and may collapse, rename, or skip entries based on what the project actually needs.

Using a dedicated organization also makes the public-release maintainer boundaries easier to enforce: organization-level security policy, issue-template defaults, branch-protection settings, and team membership apply consistently across capOS and its siblings without per-repository drift.

The org adoption is not a blocker for the public-release hygiene gates: the adventure split and history rewrite from docs/proposals/public-release-boundaries-proposal.md are the release-blocking gates, and they can land regardless of whether the public-import history is first published under cap-os-dev/capos or temporarily under another account. cap-os-dev is, however, the recommended public landing surface, and once it is used, public-facing materials should point at the organization rather than at any individual maintainer’s account.

Dependency Direction

  • The sibling depends on capOS by tag, commit, or other pinned reference; it does not depend on capOS by path-dependency into a private workspace.
  • capOS does not depend on the sibling for any core invariant or proof. capOS may declare an optional release artifact from a sibling (for example a packaged adventure demo image) when an end-to-end story requires it, but the artifact must be a declared release input, not a path link.
  • When a sibling demonstrates a capOS invariant by running on it, the sibling records the capOS reference (tag or commit) it was tested against, and the sibling carries the proof, not capOS.

Per-Repository Hygiene

  • Each sibling repository owns its own license, CONTRIBUTING.md, SECURITY.md, issue/PR templates, and review-capacity statement, even when the initial maintainer set overlaps with capOS.
  • Each sibling repository owns its own scope statement and public claim list. Public capOS claims do not extend over sibling content; sibling claims do not extend over capOS.
  • Generated artifacts, content blobs, and large binaries belong with the sibling that owns the source they describe, never with capOS unless capOS itself produced them.

Documentation Location Rule

  • Documentation about a sibling lives in the sibling. capOS may keep a short pointer in docs/proposals/index.md, the README, or a release-notes section so readers can find the sibling, but it does not duplicate sibling-internal proposals, backlog, or roadmap state.
  • Cross-repo planning that is privately coordinated must still respect the public-release rule that “if a task is public, its active status lives in one place”; capOS does not maintain a public mirror of sibling task state.

Security Coordination

  • During a transition phase, security reports affecting capOS and a sibling are coordinated through the capOS SECURITY.md contact, with downstream sibling SECURITY.md files pointing back to that contact until the sibling has its own staffed response.
  • Once a sibling has a staffed security response, its SECURITY.md becomes authoritative for sibling-only issues, and only cross-cutting reports require coordination.
  • Neither capOS nor a sibling promises a security-fix SLA at the research-software stage; the capOS security statement language remains the baseline.

Release And Tagging

  • Each sibling owns its own release cadence and tag stream.
  • A sibling release that requires a specific capOS revision pins it explicitly in the sibling’s release notes.
  • capOS releases do not promise sibling availability or compatibility beyond “the schema and userspace ABI used by sibling X at tag Y are what capOS at tag Z provides”.

History At Split Time

  • A split should not silently remove evidence. Before a sibling becomes the authoritative location for a track, the relevant proofs, demos, and documentation must be present and reviewed in the sibling.
  • The capOS history rewrite specified in docs/proposals/public-release-boundaries-proposal.md does not need to preserve the pre-split track history inside capOS. The sibling’s history begins at split time with whatever curated initial state the sibling chooses to publish.
  • The capOS docs/changelog.md continues to record completed capability-OS milestones; sibling milestones are recorded in the sibling.

Migration Approach

The split is gradual and gated by readiness, not by a release calendar beyond the explicit public-release prerequisites.

The intended order is:

  1. Adventure ecosystem – gated by the public-release adventure-split gate. This is the first concrete instance of the rule and produces a reusable pattern (cross-repo dependency direction, sibling hygiene, documentation pointers) for later splits.
  2. Whitepaper / academic publication – when the paper is ready to accept public review, or when its evidence-gap log starts to drive review cycles independent of the kernel review queue.
  3. Public website and hosted-demo deployment – when a hosted demo becomes a real operational milestone (Phase D of the public-release proposal) rather than a research artifact.
  4. Userspace network stack and NIC drivers – after the userspace driver authority gate lands and the in-kernel networking surface shrinks to the kernel-side authority gates.
  5. Production remote-access services, protocol stacks, language runtimes, GPU, LLM/agent, cloud images, volume encryption – as their implementations begin and meet the split criteria.

Splits earlier in this list set the precedent for splits later in the list. If the adventure split is messy, later splits should learn from it before being attempted.

Anti-Goals

  • Do not split the kernel. The kernel is one repository. Architecture layers (kernel/src/arch/<arch>/) stay inside capOS; aarch64 and other ports stay in-tree. The split rule is about distinguishing the OS from applications, protocols, and language runtimes, not about cutting the kernel into micro-repos.
  • Do not split userspace runtime internals. capos-rt, init, and the native shell stay together because they share the userspace ABI contract.
  • Do not vendor sibling repositories back into capOS. Once a track has split, capOS does not re-import it as a path or vendored copy. Cross-repo coordination uses tags and pinned references, not vendoring.
  • Do not split for marketing reasons alone. The split criteria are about protecting review capacity, security posture, and the public scope statement. Splitting only to project a larger ecosystem surface area without staffed maintenance is not allowed.
  • Do not block on a perfect split plan. A track that meets the split criteria can be moved with the minimum mechanics described above. Cross-repo mechanics will improve incrementally; waiting for an ideal model before any split is its own failure mode.

Open Questions

  • Where should the chat demo end up after the adventure split? It is partly generic IPC scaffolding and partly application-shaped (chat rooms, message history). The current intent is that a generic capability-IPC chat surface stays in capOS as a service-object proof, while game-shaped chat features follow adventure. The exact line is not yet drawn.
  • How should docs/research/ be treated long term? Each note grounds a current design decision, so it stays in capOS. If research notes proliferate after public release, a curated docs/research/index.md may be enough to keep them navigable without splitting them out.
  • Should the mdBook docs sources and the docs site deployment be in the same repository or split? The current intent is that the sources stay in capOS while the deployment can move to a sibling. Whether that split is worth doing before a hosted demo exists is open.
  • How should cross-repo CI evidence be presented when a paper or a service repository wants to cite a capOS proof run? A simple “tested against capOS commit X” record is the baseline; richer attestation can be added later if the project needs it.
  • When is the right moment to publish a sibling’s first release? Sibling-internal readiness criteria belong in the sibling; capOS does not gate sibling releases beyond the cross-repo mechanics described here.

Design Grounding

Grounding files for this proposal:

  • README.md
  • docs/tasks/README.md
  • REVIEW.md
  • docs/roadmap.md
  • docs/changelog.md
  • docs/proposals/public-release-boundaries-proposal.md
  • docs/proposals/aurelian-frontier-proposal.md
  • docs/proposals/contributor-quest-mechanics-proposal.md
  • docs/proposals/networking-proposal.md
  • docs/proposals/ssh-shell-proposal.md
  • docs/proposals/shell-proposal.md
  • docs/proposals/boot-to-shell-proposal.md
  • docs/proposals/cloud-deployment-proposal.md
  • docs/proposals/cloud-metadata-proposal.md
  • docs/proposals/cryptography-and-key-management-proposal.md
  • docs/proposals/certificates-and-tls-proposal.md
  • docs/proposals/oidc-and-oauth2-proposal.md
  • docs/proposals/llm-and-agent-proposal.md
  • docs/proposals/realtime-voice-agent-shell-proposal.md
  • docs/proposals/gpu-capability-proposal.md
  • docs/proposals/go-runtime-proposal.md
  • docs/proposals/userspace-binaries-proposal.md
  • docs/proposals/volume-encryption-proposal.md
  • docs/proposals/storage-and-naming-proposal.md
  • docs/proposals/security-and-verification-proposal.md
  • docs/proposals/mdbook-docs-site-proposal.md
  • docs/security/trust-boundaries.md
  • docs/security/verification-workflow.md
  • docs/dma-isolation-design.md
  • docs/trusted-build-inputs.md

No docs/research/ report is directly applicable. This proposal is project-composition policy layered on existing capOS architecture, not a new OS architecture or runtime design.

Proposal Group Archive

This page is retained as a compact grouping aid for older links and sidebar navigation. The canonical status table is Proposal Index; update that page first when a proposal changes role.

The public sidebar now nests proposal documents under the proposal index instead of exposing every long-form design page as a top-level entry.

Active Support

ProposalStatusPurpose
mdBook Documentation SitePartially implementedDefines the documentation site structure, status vocabulary, and curation rules for architecture, proposal, security, and research pages.

Future Runtime And Deployment

ProposalStatusPurpose
Go RuntimeFuture designPlans a custom GOOS=capos userspace port and runtime services for Go programs.
Lua ScriptingPartially implementedDefines Lua as a capability-scoped userspace runner with curated libraries and exact grants. Phase 0 and Phase 1 host bindings are in tree; Phase 2+ remains future work.
Cloud MetadataFuture designDescribes cloud bootstrap inputs and manifest deltas without importing cloud-init.
Cloud DeploymentPartially implementedRecords QEMU boot, ACPI/PCI/MSI-X discovery, the landed cloudboot image/harness, and the first GCP imported-image serial-console boot proof. Provider NIC/storage drivers, cloud clocking, AWS/Azure proofs, and aarch64 deployment remain future work.
Browser/WASMFuture designExplores a browser-hosted capOS model using WebAssembly and workers.

Future Security, Policy, And Lifecycle

ProposalStatusPurpose
User Identity and PolicyPartially implementedDefines user/session identity and policy layers over capability grants. Current implementation covers anonymous/operator/guest UserSession metadata, bootstrap credential/session flows, broker-issued shell bundles, and seed-account configuration; durable accounts, external bindings, session revocation, quotas, and broader ABAC/MAC remain future work.
Cryptography and Key ManagementFuture designDefines key, signing, encryption, and vault capabilities for later security services.
Certificates and TLSFuture designDefines X.509, trust store, ACME, and TLS configuration capabilities.
OIDC and OAuth2Future designDefines federated login, OAuth2 clients, token capabilities, and broker integration.
Volume EncryptionFuture designDefines encryption-at-rest for system and user volumes.
System MonitoringFuture designDefines scoped observability capabilities for logs, metrics, traces, health, status, crash records, and audit.
Formal MAC/MICFuture designDefines a formal access-control and integrity model for later proof work.
Live UpgradeFuture designDesigns service replacement while preserving handles, calls, and authority.
GPU CapabilityFuture designSketches isolated GPU device, memory, and compute authority.

Future Domains

ProposalStatusPurpose
Language Models and Agent RuntimeFuture designDefines model, embedding, and agent-runner capabilities.
Realtime Voice Agent ShellFuture designExtends the agent-shell path for realtime voice and media sessions.
capOS As A Robot BrainFuture designDefines capability-oriented robotics service graphs and actuator boundaries.
Contributor Quest MechanicsFuture designDefines contribution-linked game badges and bounded perks.
Public Release and Maintainer BoundariesFuture designDefines public release posture and maintainer-load boundaries.

Rejected Or Superseded

ProposalStatusPurpose
Endpoint Badges as Service IdentityRejectedPost-mortem for the seL4-style endpoint badge identity model that was superseded by Service Object Capabilities, then by Session-Bound Invocation Context.
Service Object CapabilitiesSupersededHistorical service-minted object capability model; the landed synthetic routing/lifecycle proof remains low-level coverage, but the implemented replacement is Session-Bound Invocation Context.
Cap’n Proto SQE EnvelopeRejectedRecords why ring SQEs stay fixed-layout transport records instead of becoming Cap’n Proto messages themselves.
Sleep(INF) Process TerminationRejectedRecords why infinite sleep should not replace explicit process termination, while preserving typed status and future sys_exit removal as separate lifecycle work.

Rejected Proposal: Endpoint Badges as Service Identity

Status

Rejected. This was the short-lived seL4-style model where a capability hold edge carried a u64 badge and endpoint servers used that badge as the service-visible caller identity.

The model was superseded by Service Object Capabilities, which reframed the badge field as an opaque receiver selector owned by a service object capability. That proposal is also superseded: the active direction is Session-Bound Invocation Context, where each process has one immutable session context and endpoint calls expose privacy-preserving caller-session metadata instead of caller-selected badges or service-object identity migration.

This document records what badges were, how they were intended to be used, what was implemented, and why the design was rejected.

Proposal

Add a word-sized badge to each capability hold edge and deliver that value to an endpoint server whenever the holder invokes the endpoint. Multiple clients could therefore share one endpoint object while the server still distinguished them:

endpoint object
  client cap hold badge 100 -> chat participant 100
  client cap hold badge 200 -> chat participant 200
  client cap hold badge 302 -> adventure player 302

The model came from seL4’s endpoint badge and mint pattern. A trusted holder of an endpoint owner capability could mint differently badged client facets for children or services. Copy and move transfer preserved the badge, so delegation kept the same service-visible identity unless a trusted mint path created a fresh one.

Intended Use

Badges were intended to solve a real early shared-service problem: chat, adventure, stdio bridges, and endpoint smokes needed more than one logical client on a resident service endpoint. Creating one kernel endpoint per client was unnecessary overhead for the demo stage, and putting a caller name or role in request bytes would have been trivial to spoof.

The intended rules were:

  • a badge is not a generic rights bitmask;
  • a badge is hold-edge metadata, not part of the endpoint object;
  • endpoint CALL delivery reports the invoked hold badge to the server;
  • copy and move transfer preserve the badge;
  • raw spawn grants preserve the source badge;
  • endpoint owners and ProcessSpawner-created parent endpoint result facets may mint a requested child client badge;
  • delegated client facets may be passed on only with the same badge.

Under that model, a chat server could key membership by badge and an adventure server could key per-player room/inventory state by badge. The badge was meant to be server-visible caller identity, not a user-facing permission flag.

Implementation Specifics

The concrete implementation landed in several steps:

  • Commit 3ee5240 (feat: propagate endpoint capability badges, 2026-04-22) added CapRef.badge to the manifest schema, parsed optional CUE badge fields, stored the value in CapHold, and changed endpoint CALL dispatch so call.badge came from the invoked capability slot. The cross-process IPC smoke asserted a nonzero badge on RECV and RETURN completions.
  • Commit df0d140 (feat: add spawn grant badge attenuation, 2026-04-22) added CapGrant.badge to the ProcessSpawner ABI. Raw grants failed closed if the requested badge differed from the source hold. ClientEndpoint grants could mint the requested badge only from an endpoint owner source. The init spawn proof printed [init] Spawn badge attenuation ok. after exercising the path.
  • Commit 2face05 (demos: extract badged endpoint service loop, 2026-04-24) extracted serve_badged_endpoint into demos/service-common/. The helper performed endpoint RECV, released unexpected transferred caps, decoded params, and called service handlers as handle_request(state, badge, method_id, params).
  • Chat and adventure used that helper to route per-client service state by badge. Manifest examples such as system-chat.cue and system-adventure.cue carried explicit badge values for shared-service clients and NPC/client identities.
  • Commit 3e59540 (fix: narrow endpoint result badge minting, 2026-04-25) stopped treating every endpoint ResultCap as trusted badge minting authority. Only endpoint owners and ProcessSpawner-created parent endpoint result facets retained mint authority; ordinary IPC result transfers stayed ResultCap and could not become a badge-mint path.
  • Commit f955cd5 (fix: reject delegated endpoint relabeling, 2026-04-25) fixed the first containment failure: already-delegated client facets could no longer request a different badge through ClientEndpoint spawn grants.
  • Commit a64c216 (spawn: preserve delegated endpoint identities, 2026-04-25) fixed the shell/defaulting case. Omitted shell badge syntax began preserving the source badge via PRESERVE_CLIENT_ENDPOINT_BADGE = u64::MAX, while explicit relabel attempts and low-level legacy badge-zero encodings failed closed for delegated client facets.

The final contained implementation still has a badge field in several ABI and implementation structs. Current docs call it legacy receiver metadata or a receiver selector when it is still needed for low-level tests, service-object history, or non-identity parameters such as scoped TCP listen ports. It is no longer the target identity model.

What Failed

The design gave too much meaning to an untyped number. Even when the kernel preserved badges across copy/move transfer, shell and spawn surfaces could still turn a caller-selected integer into service-visible identity unless every grant path handled mint authority perfectly.

The concrete failure was delegated endpoint relabeling. A shell holding a delegated chat client endpoint could request:

run "chat-client" with { chat: client @chat badge 200 }

Before the containment fixes, that could produce a child client facet whose service-visible identity differed from the delegated source. Omitted badge syntax was also dangerous because the old parser defaulted it to badge 0, which was another relabeling path for a nonzero source client.

The bug was narrow, but it exposed the wrong abstraction. The server was being asked to treat a generic transport field as identity. The kernel could enforce some mint rules, but the meaning of 100, 200, 302, or 0 lived in each service by convention. That made ordinary shell syntax look like an authority selector and made future network-backed shell exposure too easy to get wrong.

Rationale For Rejection

Endpoint badges are a useful low-level routing mechanism, but not a good service identity model for capOS.

Problems:

  • Caller-selected identity pressure. The natural user-facing syntax was client @service badge N, which invited users and tests to select service identity directly.
  • Untyped service semantics. The same u64 field could mean a chat member, an adventure player, an NPC, a stdio bridge, a TCP port, or a test fixture. The kernel could not validate those meanings.
  • Policy by convention. Each service had to remember whether a badge was a participant, a session, a role, a receiver cookie, or just a transport tag.
  • Delegation hazards. Copy/move propagation was straightforward, but spawn minting needed subtle distinctions between endpoint owners, ProcessSpawner-created parent endpoint result facets, ordinary IPC result caps, and delegated client facets.
  • Bad privacy shape. A server-visible endpoint field encouraged exposing stable caller identity by default, while the active model wants privacy-preserving session references and explicit bounded disclosure.
  • Poor long-term composition. Cross-service and network-transparent designs need typed roots/facets, session context, transfer policy, and disclosure policy. A single badge value cannot carry those contracts.

The accepted historical fix was first to contain relabeling, then to stop treating badges as the target architecture. Service Object Capabilities moved identity into service-minted object capabilities and receiver selectors. That was still too much machinery for normal workload identity and was replaced by Session-Bound Invocation Context.

Replacement Direction

The active replacement is:

  • capabilities answer whether the process may invoke a service at all;
  • each process has exactly one immutable session context;
  • endpoint delivery carries privacy-preserving caller-session metadata by default;
  • richer subject disclosure requires an explicit request and a matching broker/service disclosure scope;
  • shared services key user-facing state by broker-granted service capabilities plus service-scoped session references, not by caller-selected badges.

Legacy badge fields may remain as internal receiver metadata, hostile-test fixtures, or non-identity configuration encodings until the corresponding code paths are migrated. They should not appear as normal user-facing service identity syntax.

Design Grounding

Project files read for this post-mortem:

  • docs/capability-model.md
  • docs/architecture/ipc-endpoints.md
  • docs/proposals/service-object-capabilities-proposal.md
  • docs/proposals/session-bound-invocation-context-proposal.md
  • docs/authority-accounting-transfer-design.md
  • docs/research/capability-systems-survey.md
  • docs/security/trust-boundaries.md
  • docs/tasks/README.md
  • docs/tasks/README.md

Relevant research:

  • docs/research/sel4.md

The historical badge model followed the seL4 badge/mint precedent recorded in the repo research notes. The rejection is capOS-specific: schema-typed interfaces, session-bound process identity, broker-issued service authority, and privacy-bounded disclosure fit the project better than making a generic endpoint metadata word carry service identity.

Proposal: Service Object Capabilities

Status: Superseded by Session-Bound Invocation Context. This document remains as historical design context for the already-landed synthetic routing/lifecycle proof. Do not continue the subject/proof root-opening or shared-service service-object migration from this proposal.

Replace caller-selected endpoint identity with service-minted object capabilities.

Problem

Endpoint client metadata currently carries service identity. A client endpoint is a capability plus a caller-visible numeric tag; services can use that tag as a member, session, role, connection, or actor key. That is too close to a permission bitmask or ambient label: the generic IPC substrate accepts an untyped number, and each service has to assign security meaning by convention.

The pre-containment problem became concrete through shell spawn syntax. A shell that held a delegated chat client endpoint could request:

run "chat-client" with { chat: client @chat badge 200 }

The launcher path could then pass the child a client facet with a different value than the shell originally held. Gate 0 now rejects that relabeling for ordinary delegated client facets, but the chat example remains the reason the broader migration exists: service authority should not depend on a caller-selected numeric tag.

capOS needs multi-client services, per-client state, service-created attenuation, audit subject binding, and shell-spawned children. It does not need caller-selected numeric identities.

Goals

  • Make the capability object itself carry the service authority.
  • Keep endpoint transport generic while avoiding generic service roles, permission bits, or caller-selected labels.
  • Let services expose many logical objects through one resident process.
  • Let services bind subject/audit information at object creation time without putting identity policy into the kernel IPC fast path.
  • Preserve explicit transfer semantics: copy and move pass the same object authority unless a trusted minter creates a new one.
  • Provide a staged migration path for current chat, adventure, stdio, and endpoint smokes.

Non-Goals

  • A POSIX credential model.
  • PID, PID@host, UID, role strings, or host names as service authority.
  • Kernel interpretation of chat rooms, moderators, players, sessions, or principals.
  • Generic per-capability permission bitmasks.
  • Full network-transparent object references in the first slice.

Design

An Endpoint remains a transport queue owned by a server process. Ordinary clients should not hold “endpoint plus badge”; they should hold a capability to a service object exported by that server.

Design Grounding

Project files read for this design:

  • docs/capability-model.md
  • docs/architecture/ipc-endpoints.md
  • docs/proposals/service-architecture-proposal.md
  • docs/proposals/shell-proposal.md
  • docs/proposals/interactive-command-surface-proposal.md
  • docs/backlog/stage-6-capability-semantics.md
  • docs/backlog/runtime-network-shell.md
  • docs/backlog/shared-service-demos.md
  • docs/security/trust-boundaries.md
  • docs/authority-accounting-transfer-design.md
  • docs/tasks/README.md

Relevant research:

  • docs/research/sel4.md
  • docs/research/eros-capros-coyotos.md
  • docs/research/genode.md

The design deliberately supersedes the prior seL4-style badge/mint direction for service identity. Genode’s RPC/session object model is a closer fit for capOS services: clients hold capabilities to service-created objects, while delegation passes the same object authority. EROS/CapROS/Coyotos and the authority-accounting design reinforce the rule that authority should remain in the capability graph, not in caller-selected numeric metadata.

Examples by service:

Chat service:
ChatRoot
ChatParticipant
ChatRoom
ChatModerator

Terminal/child I/O service:
StdIO

Adventure service:
AdventurePlayer
AdventureNpc

Each object capability has one interface ID and one server-selected receiver selector. The receiver selector is opaque to the client. It is not a user field, not shell syntax, and not a policy label. It exists only so the kernel can route the call to the resident server and the server can dispatch it to the right object state.

Conceptually:

service object cap = target endpoint + interface id + opaque receiver selector

Only trusted minting paths may create a new receiver selector:

  • the endpoint owner/server,
  • a supervisor or broker that holds explicit mint authority from the server,
  • transitional manifest/init wiring for boot services.

Copying or moving a service object cap preserves the same receiver selector. An ordinary client cannot relabel a delegated cap into a sibling object.

Subject / Proof Binding

A service should be able to learn who or what a service object represents, but that subject must be bound through trusted issuance rather than caller payload claims.

The general shape is:

interface Subject {
    deriveProof @0 (request :DelegationRequest) -> (proof :SubjectProof);
}

interface SubjectProof {
    attest @0 (challenge :Challenge) -> (statement :SubjectStatement);
}

interface ServiceRoot {
    open @0 (proof :SubjectProof, request :OpenRequest)
        -> (object :ServiceObject);
}

interface ServiceObject {
    call @0 (request :Request) -> (response :Response);
}

UserSession is the interactive-user case and can derive a proof scoped to a service root, request digest, audience, and freshness window. A service account, workload identity, broker-issued proof, anonymous session, guest session, or other typed subject cap can fill the same role when that is the right trust boundary. The root/factory validates the proof through a verifier, broker, account, audit, or application policy interface it was granted, stores verified metadata in its own object table, and returns a service object cap.

Later calls on the returned object do not need caller-supplied identity. Possession of that object cap is the authority. The service can still record principal/session audit identifiers, display names, channel memberships, quota/accounting state, moderation state, workload labels, or other policy metadata internally, but those records are service state, not endpoint metadata that the caller can edit.

Example Chat Shape

The current chat service uses one Chat endpoint and maps legacy endpoint metadata to members. The target model is a root/factory plus participant objects.

interface ChatRoot {
    join @0 (channel :Text, handle :Text, session :UserSession)
        -> (participant :ChatParticipant);
}

interface ChatParticipant {
    join @0 (channel :Text) -> (joined :Bool);
    leave @1 (channel :Text) -> (left :Bool);
    send @2 (channel :Text, text :Text) -> (sent :Bool);
    who @3 (channel :Text) -> (members :List(Text));
    poll @4 (maxEvents :UInt16) -> (events :List(ChatEvent));
    close @5 () -> ();
}

interface ChatModerator {
    kick @0 (participant :ChatParticipant, channel :Text) -> (kicked :Bool);
}

ChatParticipant is the participant authority. If a child process receives that cap, it acts as that same participant. It cannot type another receiver selector and become another participant.

Moderator behavior is a separate cap/interface. The service may internally associate participant and moderator state with the same subject, but the kernel does not provide a role field and the client does not choose one.

Chat does not need to know about AdventurePlayer or AdventureNpc. Adventure-specific caps belong to the adventure service. Room speech should cross into chat through ordinary chat object caps such as ChatParticipant or a future room-scoped chat object; the chat service should see chat subjects and channels, not adventure interfaces.

Kernel Contract

The kernel should enforce object-cap invariants and avoid service semantics.

Required invariants:

  • Only endpoint owners or explicit mint-authority holders may create a service object cap for a new receiver selector.
  • Delegating an existing service object cap preserves the receiver selector.
  • Process spawning may copy or move service object caps but may not relabel them.
  • Client-held object caps cannot receive or return endpoint messages unless their interface explicitly grants server authority.
  • Receiver selectors are scoped to the target endpoint object; no global numeric namespace is part of the ABI.
  • Process exit and cap release still drive endpoint cleanup for queued calls, in-flight returns, and server-visible cancellation.

The first compatibility step can keep the current u64 storage field but change the rules: a delegated client endpoint’s numeric identity is preserved on re-delegation. The target step renames and narrows the concept from badge to an opaque receiver selector for service object caps.

Shell And Launcher Contract

Shell users should launch applications, not assign service identities.

Target user shape:

run "chat-client"
run "adventure-client"

Prototype explicit-grant shape while migration is incomplete:

run "chat-client" with { stdio: client @stdio, chat: @chat_participant }

The normal shell must not expose badge N as user-facing authority syntax. If a grant parser keeps legacy badge syntax for manifest or smoke migration, the kernel must still reject any delegated-client relabeling. Omitting a badge in shell syntax preserves the source identity; low-level legacy badge-zero encodings remain hostile-test inputs and must still fail closed for nonzero delegated client facets.

External And Network Boundaries

External identity assertions do not open service objects directly. OIDC ID tokens, passkey assertions, certificate chains, cloud workload tokens, and remote gateway-authenticated claims first pass through an admission service that normalizes provider kind, issuer, tenant, and subject; maps the result to a local or pseudonymous principal when policy allows; and mints a local subject/proof capability. Imported groups, roles, tenants, acr, amr, device posture, source network, and token age are ABAC inputs to that mint decision, not downstream object authority.

Network-transparent capability transport is also out of the first slice. A future bridge should maintain connection-local export/import tables and expose broken-reference semantics on disconnect. It must not serialize local cap-table handles, endpoint generations, receiver selectors, or server cookies as portable authority. Persistent restore, if needed, should go through a capability-bearing naming or persistence service that authorizes and mints a fresh live object.

Migration Plan

The current execution plan lives in docs/backlog/service-object-identity-migration.md and uses four large chunks. Gate 0 containment below is already historical substrate. It does not mean the service-object model is implemented. This proposal records the design sequence; the backlog owns task breakdown and verification gates.

0. Contain delegated-client relabeling, landed

The kernel and shell paths now reject ordinary delegated-client relabeling. This is containment, not the final model.

1. Core service-object routing and lifecycle, landed

Commit a4655f0 at 2026-04-28 14:10 UTC added the synthetic QEMU service proof. It covers trusted serviceObject minting, receiver-cookie routing, copy/move IPC transfer, nested spawn delegation, generation-checked service receiver cookies, close/revoke rejection, and stale-cookie rejection after record reuse.

2. Subject/proof root opening

Validate local subject/proof authority before object mint. External assertions must first normalize through admission into local or pseudonymous subject/proof caps.

3. Convert shared-service demos

Move chat, adventure, and stdio/terminal child bridges from caller-selected endpoint identity to root/factory-opened service object caps.

4. Retire legacy endpoint identity

Remove compatibility syntax and rename internal fields once normal smokes no longer depend on caller-selected endpoint identity.

Security Notes

This design keeps the kernel out of role and identity policy. The kernel only knows whether a caller holds a particular object cap and whether transfer rules allow that cap to move. Services decide what their object records mean.

PID, PID@host, and process names are diagnostics. They are not authority: process IDs recycle, hosts need cryptographic naming for federation, and a single subject can legitimately hold multiple service objects with different authority.

The broker and session services remain the right place to validate subjects and policy before a service object is minted. After minting, the object cap is the authority.

Rejected Proposal: Cap’n Proto SQE Envelope

Proposal

Replace the fixed C-layout CapSqe descriptor with a fixed-size padded Cap’n Proto message. Each SQ slot would contain a serialized single-segment Cap’n Proto struct with a union for call, recv, return, release, and finish, then zero padding to the chosen SQE size.

The live ring currently pins each SQ slot to 64 bytes (SQE_SIZE in capos-config/src/ring.rs), so any Cap’n Proto envelope would either have to fit inside that budget or motivate a slot-size bump. For a hypothetical 128-byte slot, the rough layout would be:

+0x00  u32 segment_count_minus_one
+0x04  u32 segment0_word_count
+0x08  word root pointer
+0x10  RingSqe data words, including union discriminant
+0x??  zero padding to 128 bytes

A compact schema would need to keep fields flat to avoid pointer-heavy nested payload structs:

struct RingSqe {
  userData @0 :UInt64;
  capId @1 :UInt32;
  methodId @2 :UInt16;
  flags @3 :UInt16;
  addr @4 :UInt64;
  len @5 :UInt32;
  resultAddr @6 :UInt64;
  resultLen @7 :UInt32;
  callId @8 :UInt32;

  union {
    call @9 :Void;
    recv @10 :Void;
    return @11 :Void;
    release @12 :Void;
    finish @13 :Void;
  }
}

Potential Benefits

A Cap’n Proto SQE envelope would make the ring operation shape schema-defined instead of Rust-struct-defined. That has some real advantages:

  • The ABI documentation would live in schema/capos.capnp next to the capability interfaces.
  • Future userspace runtimes in Rust, C, Go, or another language could use generated accessors instead of hand-mirroring a packed descriptor layout.
  • The operation choice could be represented as a schema union, making it clear that fields meaningful for CALL are not meaningful for RECV or RETURN.
  • Cap’n Proto defaulting gives a familiar path for adding optional fields while letting older readers ignore fields they do not understand.
  • Ring dumps and traces could be decoded with generic Cap’n Proto tooling.
  • A single “everything crossing this boundary is Cap’n Proto” rule is architecturally simpler to explain.

Those benefits are mostly about schema uniformity, generated bindings, and tooling. They do not remove the need for an operation discriminator; they move it from an explicit fixed descriptor field to a Cap’n Proto union tag.

Rationale For Rejection

The SQE is the fixed control-plane descriptor for a hostile kernel boundary. It should be cheap to classify and validate before any operation-specific payload parsing. A Cap’n Proto SQE envelope would still have a discriminator, but would move it into generated reader state and require Cap’n Proto message validation before the kernel even knows whether the entry is a CALL, RECV, or RETURN.

The current shape concentrates that hostile-input validation in one place: sqe_wire_validation_error in capos-config/src/ring.rs is the single source of truth shared by the kernel dispatch path and the sqe_validation fuzzer under fuzz/fuzz_targets/. Replacing the descriptor with a Cap’n Proto message would push some of that validation into generated reader state and split the fuzz surface across the framing parser and the per-opcode predicates.

Cap’n Proto framing also consumes slot space: a single-segment message needs a segment table and root pointer before the struct data. The live 64-byte slot would not fit a Cap’n Proto envelope without either dropping fields or growing the slot; a 128-byte envelope would spend much of the slot on framing and padding. Nested payload structs are worse because they add pointers inside the ring descriptor.

The accepted split is:

  • fixed #[repr(C)] ring descriptors for SQ/CQ control state;
  • Cap’n Proto for capability method params, results, and higher-level transport payloads where schema evolution is valuable;
  • endpoint delivery metadata in a small fixed EndpointMessageHeader followed by opaque params bytes.

EndpointMessageHeader is concretely 56 bytes today (see the static-size assertion in capos-config/src/ring.rs), which keeps the endpoint delivery header well under one cache line while leaving payload bytes opaque to the kernel.

There is also a layering issue. The capability ring is part of the local Cap’n Proto transport implementation: it is the mechanism that moves capnp calls, returns, and eventually release/finish/promise bookkeeping between a process and the kernel. The SQE itself is therefore below ordinary Cap’n Proto message usage. Making the transport substrate depend on parsing Cap’n Proto messages to discover which transport operation to perform would couple the transport implementation to the protocol it is supposed to carry. Method params and results are proper Cap’n Proto messages; the ring descriptor is the framing/control structure that gets the transport to the point where those messages can be interpreted.

This keeps queue geometry simple, preserves bounded hostile-input handling, and avoids running a Cap’n Proto parser on the hot descriptor path.

  • Ring v2 SMP Proposal – forward path for ring geometry that keeps the fixed-layout descriptor and negotiates sqe_size rather than wrapping each slot in a Cap’n Proto message.
  • ABI Evolution Policy – how non-capnp ring ABIs (including SQE/CQE layouts) evolve alongside the Cap’n Proto schema.
  • Error Handling Proposal – where Cap’n Proto does sit on the dispatch path: CapException payloads carried in SQE result buffers.

Rejected Proposal: Sleep(INF) Process Termination

Concern

Unix-style zombies are a poor fit for capOS. A terminated child should not keep its address space, cap table, endpoint state, or other authority alive merely because a parent has not waited yet. The remaining observable state should be a small, capability-scoped completion record, and only holders of the corresponding ProcessHandle should be able to observe it.

The current ProcessHandle.wait() -> exitCode :Int64 shape is also too weak for future lifecycle semantics. Raw numeric status cannot distinguish normal application exit from abandon, kill, fault, startup failure, runtime panic, or supervisor policy actions without inventing process-wide magic numbers.

Proposal

Introduce a system sleep operation and treat Sleep(INF) as a special terminal operation. The argument for this spelling is that a process that never wants to run again can enter an infinite sleep instead of becoming a zombie. The kernel would recognize the infinite case and handle it specially:

  • finite Sleep(duration) blocks the process and wakes it later;
  • Sleep(INF) never wakes, so the kernel tears down the process;
  • the process’s authority is released as if it had exited;
  • parent-visible process completion is either omitted or reported as a special status.

A variant also removes the dedicated sys_exit syscall and makes Sleep(INF) the only user-visible process termination primitive.

Candidate Semantics

Sleep(INF) as Exit(0)

The simplest version maps Sleep(INF) to normal successful exit.

This is rejected because it lies about intent. A program that completed successfully, a program that intentionally detached, and a program that chose to disappear without status are not the same lifecycle event. Supervisors would see the same status for all of them.

Sleep(INF) as Abandoned

A less lossy version gives Sleep(INF) a distinct terminal status:

struct ProcessStatus {
  union {
    exited @0 :ApplicationExit;
    abandoned @1 :Void;
    killed @2 :KillReason;
    faulted @3 :FaultInfo;
    startupFailed @4 :StartupFailure;
  }
}

struct ApplicationExit {
  code @0 :Int64;
}

ProcessHandle.wait() would return status :ProcessStatus instead of a bare exitCode :Int64. Normal application termination returns exited(code), while Sleep(INF) returns abandoned.

This fixes the type problem, but leaves the operation name wrong. Sleep normally means the process remains alive and keeps its authority until a wake condition. The infinite special case would instead release authority, reclaim memory, cancel endpoint state, complete process handles, and make the process impossible to wake. That is termination, not sleep.

Sleep(INF) as Detached No-Status Termination

Another version treats Sleep(INF) as detached termination and gives parents no status. That avoids inventing an exit code, but it weakens supervision. Init and future service supervisors need a definite terminal event to implement restart policy, diagnostics, dependency failure reporting, and “wait for all children” flows. A missing status is not a useful status.

Remove sys_exit Through a Typed Lifecycle Capability

Removing the dedicated sys_exit syscall is a separate, plausible future direction. The cleaner version is not Sleep(INF), but an explicit lifecycle operation:

interface ProcessSelf {
  terminate @0 (status :ProcessStatus) -> ();
  abandon @1 () -> ();
}

interface ProcessHandle {
  wait @0 () -> (status :ProcessStatus);
}

The process would receive ProcessSelf only for itself. Calling terminate would be non-returning in practice: the kernel would process the request, release process authority, complete any ProcessHandle waiter with the typed status, and not post an ordinary success completion back to the dying process.

The transport shape needs care. A generic Cap’n Proto call normally expects a completion CQE, but a self-termination operation cannot safely rely on the dying process to consume one. Viable implementations include:

  • a dedicated ring operation such as CAP_OP_EXIT targeting a self-lifecycle cap;
  • a ProcessSelf.terminate call whose method is explicitly non-returning and never posts a CQE to the caller;
  • keeping sys_exit temporarily until ring-level non-returning operations have explicit ABI and runtime support.

This path removes the ambient exit syscall without overloading sleep. It also forces terminal status to become typed before kill, abandon, restart policy, or fault reporting are added.

Rationale For Rejection

Sleep(INF) solves the wrong abstraction problem. The zombie problem is not that a process needs a forever-blocked state. The problem is retaining process resources after terminal execution. capOS should solve that by separating process lifetime from process-status observation:

  • process termination immediately releases authority and reclaims process resources;
  • a ProcessHandle is only observation authority, not ownership of the live process;
  • if a handle exists, a small completion record may remain until it is waited or released;
  • if no handle exists, terminal status can be discarded;
  • no ambient parent process table is needed.

Under that model, a sleeping process remains alive and authoritative, while a terminated process does not. Special-casing Sleep(INF) to perform teardown would make the name actively misleading and would create a hidden terminal operation with different semantics from finite sleep.

The accepted direction is therefore:

  • keep explicit process termination semantics;
  • replace raw exitCode :Int64 with typed ProcessStatus before adding more lifecycle states;
  • keep the minimal terminal self-exit ABI until a typed self-lifecycle capability or ring operation can replace it cleanly;
  • add future Timer.sleep(duration) only for real sleep, where the process remains alive and may wake.

Sleep(INF) remains rejected as a termination primitive. The concern it raises is valid, but the solution is typed terminal status plus status-record cleanup, not infinite sleep.

Papers

Long-form research write-ups produced from the capOS codebase. Each paper is typeset with Typst from sources under papers/<slug>/ in the repository and published as a PDF alongside this site.

Schema-as-ABI: Typed Capabilities and Ring-Transport Dispatch in capOS

Download PDF

A pre-evidence draft describing the schema-as-ABI thesis: Cap’n Proto schemas acting as kernel ABI, access-control mechanism, IPC wire format, and (planned) persistence and network-transparency substrate, layered over a shared-memory SQ/CQ ring with a two-syscall surface (exit and cap_enter).

The draft separates closed contributions (capability ring transport, exactly-once accounting rollback, capability lifecycle, the verification stack) from evidence-gated claims that depend on outstanding artifacts (C1 service-object migration, C2 measurement run, C3 persistence proof-of-concept, C4 network-transparency proof-of-concept). Sections that depend on missing artifacts are flagged with TODO admonitions naming the gap and the entry in docs/paper/evidence-gaps.md that closes them.

Source: papers/schema-as-abi/main.typ in the repository. Build locally with make paper; the same target runs in make cloudflare-pages-build and publishes the PDF at the link above.

Research Deep-Dive Index

The pages under docs/research/ are deep-dive reports informing capOS design decisions. Proposals and design notes cite them as grounding for capability model, IPC, scheduling, networking, error handling, runtime, agent, and prior-art choices. The Capability-Based and Microkernel Operating Systems survey records the cross-system design consequences pulled from this body of research; the entries below give the full alphabetical listing of individual reports for direct discovery.

Start here:

Individual reports:

Research: Capability-Based and Microkernel Operating Systems

Survey of existing systems to inform capOS design decisions across IPC, scheduling, capability model, persistence, VFS, and language support.

This survey records the cross-system design consequences; the research index lists the individual deep-dive reports. Read the consequences below first; open individual reports only when their design context is relevant.

Design consequences for capOS

  • Keep the flat generation-tagged capability table; seL4-style CNode hierarchy is not needed until delegation patterns demand it.
  • Treat the typed Cap’n Proto interface as the permission boundary; avoid a parallel rights-bit system that would drift from schema semantics.
  • Continue the ring transport plus direct-handoff IPC path, with shared memory reserved for bulk data once SharedBuffer/MemoryObject exists.
  • Treat seL4-style endpoint badges as historical receiver metadata, not as the active service identity model; use move/copy transfer descriptors, object-epoch revocation, and session-bound invocation context to make authority delegation explicit and reviewable.
  • Model session lifetime as revocable liveness state plus grant leases, not as generic capability expiry. EROS/CapTP-style revocation-by-indirection and Genode-style session closure are better precedents than refreshing every old reference in place.
  • Keep persistence explicit through Store/Namespace capabilities; do not adopt EROS-style transparent global checkpointing as a kernel baseline.
  • Push POSIX compatibility and VFS behavior into libraries and services rather than adding a kernel global filesystem namespace.
  • Add resource donation, scheduling-context donation, notification objects, and runtime/thread primitives only when the corresponding service or runtime path needs them.
  • Use Pingora-style lifecycle frameworks only above the capability transport: userspace service libraries can provide phase hooks, per-request context, readiness, graceful shutdown, retry policy, and observability, while kernel interfaces remain narrow typed capabilities with explicit authority.

Individual deep-dive reports:

  • seL4 – formal verification, CNode/CSpace, IPC fastpath, MCS scheduling
  • Fuchsia/Zircon – handles with rights, channels, VMARs/VMOs, ports, FIDL vs Cap’n Proto
  • Plan 9 / Inferno – per-process namespaces, 9P protocol, file-based vs capability-based interfaces
  • EROS / CapROS / Coyotos – persistent capabilities, single-level store, checkpoint/restart
  • Genode – session routing, VFS plugins, POSIX compat, resource trading, Sculpt OS
  • LLVM target customization – target triples, TLS models, Go runtime requirements
  • Linux sandboxes and virtualization for workloads – Linux namespaces, cgroup v2, seccomp, Landlock, bubblewrap, nsjail, systemd-nspawn, OCI runtimes and images, User-Mode Linux, gVisor, QEMU/KVM, Firecracker, Kata Containers, and capOS auto full-nohz interaction grounding for generic Linux workload execution, familiar user environments, and agent-initiated jobs
  • Cap’n Proto error handling – protocol, schema, and Rust crate error behavior used by the capOS error model
  • Cloudflare, Cap’n Proto, Workers RPC, and Cap’n Web – Cloudflare Workers, workerd, Durable Objects, Workers RPC, Cap’n Web, and production Cap’n Proto/KJ lessons for capOS remote-capability design
  • Spritely, OCapN, and CapTP – object capability network protocols, netlayers, locators, sturdyrefs, Syrup, promise pipelining, distributed GC, and third-party handoffs
  • Browser engines, document engines, and agent browsers – mainstream browser engine portability, cap-native document-engine substrate options, automation protocols, Donut Browser-style profile orchestration, and implications for visual and agent/shell browser capabilities
  • OS error handling – error patterns in capability systems and microkernels used by the capOS error model
  • IX-on-capOS hosting – clean integration of IX package/build model via MicroPython control plane, native template rendering, Store/Namespace, and build services
  • Out-of-kernel scheduling – whether scheduler policy can move to user space, and which dispatch/enforcement mechanisms must stay in kernel
  • Completion rings and threaded runtimes – completion ownership, io_uring, futex, and IOCP precedents for capOS’s full-SMP ring/threading ABI
  • x2APIC and APIC virtualization – x2APIC backend direction, QEMU/KVM validation constraints, and why the current xAPIC MMIO LAPIC path should remain the Phase C foundation
  • IOMMU remapping – primary-source Intel VT-d, AMD-Vi, and QEMU grounding for future real DMA remapping work, while current capOS remains diagnostics-only with direct DMA blocked
  • Cloud DMA provider evidence inventory – official AWS/Azure/GCP device-surface facts, the evidence-matrix schema, the live guest-probe checklist, and the fail-closed classification rules the cloud DMA backend decision consumes
  • Future scheduler architecture – Linux CFS/EEVDF, SCHED_DEADLINE, sched_ext, FreeBSD ULE, seL4 MCS, ghOSt, scheduler activations, Shenango, Caladan, Shinjuku, and Arachne lessons for capOS per-CPU queues, CPU accounting, fair scheduling, scheduling contexts, CPU isolation leases, realtime islands, and user-space scheduler policy
  • NO_HZ, SQPOLL, and realtime scheduling – Linux NO_HZ, clocksource/clockevent, CPU isolation/housekeeping, io_uring SQPOLL, SCHED_DEADLINE, PREEMPT_RT, and seL4 MCS grounding for capOS tickless idle, SQPOLL nohz, scheduling contexts, and realtime islands
  • HPC parallel patterns – Berkeley dwarfs, NAS Parallel Benchmarks, HPL/LINPACK, HPCG, Graph500, MPI collectives, and OpenMP loop/task/reduction grounding for future single-node and multi-node parallel benchmark coverage
  • Scientific agent-lab software stack – PARI/GP, SageMath, GAP, Singular, OSCAR, SymPy, SciPy, R, Octave, JupyterLab, Z3, cvc5, HiGHS, SCIP, OR-Tools, JuMP, CVXPY, Lean/mathlib, Rocq, Isabelle, Agda, Spack, Guix-HPC, Nix, and Apptainer grounding for a future capOS scientific standard package and LLM agent research lab
  • Pingora – phase-oriented service framework design, operational lifecycle, pooling/retry lessons, and why capOS should borrow the userspace library shape without importing Pingora’s HTTP or process model. The concrete capOS follow-up is capos-service, starting with terminal/networking lifecycle rather than HTTP.
  • Multimedia pipeline latency – PipeWire and JACK lessons for a capOS media graph optimized for the minimal possible guaranteed-stable stack latency, explicit latency ranges, admitted realtime islands, and xrun/deadline telemetry
  • Realtime multimodal agent APIs – OpenAI Realtime, Google AI Gemini Live API, and Vertex AI Live API implications for capOS voice agent-shell, realtime media sessions, tool-call gating, and provider adapters
  • Hosted agent harnesses – OpenClaw-like harness controls, hosted agent swarms, LLM-maintained wiki memory, schema-guided reasoning, MCP/A2A-style adapters, and implications for capability-scoped capOS agent services
  • Game mechanics prior art – Stardew Valley, EVE Online, and Evil Islands mechanics translated into capability-shaped Aurelian Frontier calendar, market, construction, and combat tasks
  • Robotics realtime control – ROS 2, micro-ROS, ros2_control, seL4 MCS, PREEMPT_RT, Xenomai, Orocos, Nav2, PX4, ArduPilot, Autoware, and OPC UA lessons for using capOS as a robot brain with explicit actuator authority and admitted realtime islands

Cross-Cutting Analysis

1. Capability Table Design

All surveyed systems store capabilities as process-local references to kernel objects. The key design variable is how capabilities are organized.

SystemStructureLookupDelegationRevocation
seL4Tree of CNodes (power-of-2 arrays with guard bits)O(depth)Subtree (grant CNode cap)CDT (derivation tree), transitive
ZirconFlat per-process handle tableO(1)Transfer through channels (move)Close handle; refcount; no propagation
EROS32-slot nodes forming treesO(depth)Node key passingForwarder keys (O(1) rescind)
GenodeKernel-enforced capability referencesO(1)Parent-mediated session routingSession close
capOSFlat table with generation-tagged CapId, hold-edge metadata, and Arc<dyn CapObject> backingO(1)Manifest exports plus copy/move transfer descriptors through Endpoint IPCLocal release/process exit, object-epoch revocation for child-local grants, and target session liveness/grant-lease checks

Recommendation for capOS: Keep the flat table. It is simpler than seL4’s CNode tree and sufficient for capOS’s use cases. Augment each entry with:

  1. Hold-edge metadata – transfer scope, disclosure scope, object id/epoch, and any legacy receiver metadata needed for transport compatibility.
  2. Generation counter (from Zircon) – upper bits of CapId detect stale references after a slot is reused. (Implemented.)
  3. Epoch (from EROS) – per-object revocation epoch. Incrementing the epoch invalidates all outstanding references. O(1) revoke, O(1) check.
  4. Session/grant lease reference (from Genode/EROS/CapTP-style lifecycle lessons) – a future pointer to mutable liveness or grant state so logout, renewal, and revocation do not require scanning all cap tables or relabeling a running process.

Not adopted: per-entry rights bitmask. Zircon and seL4 use rights bitmasks (READ/WRITE/EXECUTE) because their handle/syscall interfaces are untyped. capOS uses Cap’n Proto typed interfaces where the schema defines what methods exist. Method-level access control is the interface itself – to restrict what a caller can do, grant a narrower capability (a wrapper CapObject that exposes fewer methods). A parallel rights system would create an impedance mismatch: generic flags (READ/WRITE) mapped arbitrarily onto typed methods. Meta-rights for the capability reference itself (TRANSFER/DUPLICATE) may be added when Stage 6 IPC needs them. See Capability Model for the full rationale.

2. IPC Design

IPC is the most performance-critical kernel mechanism. Every capability invocation across processes goes through it.

SystemModelLatency (round-trip)Bulk dataAsync
seL4Synchronous endpoint, direct context switch~240 cycles (ARM), ~400 cycles (x86)Shared memory (explicit)Notification objects (bitmask signal/wait)
ZirconChannels (async message queue, 64KiB + 64 handles)~3000-5000 cyclesVMOs (shared memory)Ports (signal-based notification)
EROSSynchronous domain call~2x L4Through address space nodesNone (synchronous only)
Plan 99P over pipes (kernel-mediated)~5000+ cyclesLarge reads/writes (iounit)None (blocking per-fid)
GenodeRPC objects with session routingVaries by kernel (uses seL4/NOVA/Linux underneath)Shared-memory dataspacesSignal capabilities

Recommendation for capOS: Continue the dual-path IPC design:

Fast synchronous path (seL4-inspired, for RPC):

  • When process A calls a capability in process B and B is blocked waiting, perform a direct context switch (A -> kernel -> B, no unrelated scheduler pick). The current single-CPU direct handoff is implemented.
  • Future fastpath work can transfer small messages (<64 bytes) through registers during the switch instead of copying through ring buffers.

Async submission/completion rings (io_uring-inspired, for batching):

  • SQ/CQ in shared memory for batched capability invocations. This is the current transport for CALL/RECV/RETURN/RELEASE/NOP.
  • Support SQE chaining for Cap’n Proto promise pipelining.
  • Use Spritely/OCapN CapTP as the prior-art shape for remote capability sessions, third-party handoffs, answer namespaces, and distributed reference-release accounting, but do not treat current OCapN drafts as a frozen capOS ABI.
  • Signal/notification delivery through CQ entries (from Zircon ports).
  • User-queued CQ entries for userspace event loop integration.

Bulk data (Zircon/Genode-inspired):

  • SharedBuffer capability for zero-copy data transfer between processes.
  • Capnp messages for control plane; shared memory for data plane.
  • Critical for file I/O, networking, and GPU rendering.

3. Memory Management Capabilities

Zircon’s VMO/VMAR model is the most mature capability-based memory design. The Go runtime proposal shows why these primitives are essential.

VirtualMemory capability (baseline implemented; still central for Go and advanced allocators):

interface VirtualMemory {
    map @0 (hint :UInt64, size :UInt64, prot :UInt32) -> (addr :UInt64);
    unmap @1 (addr :UInt64, size :UInt64) -> ();
    protect @2 (addr :UInt64, size :UInt64, prot :UInt32) -> ();
}

MemoryObject capability (needed for IPC bulk data, shared libraries). Zircon calls this concept a VMO (Virtual Memory Object); capOS uses the name SharedBuffer – see docs/proposals/storage-and-naming-proposal.md for the canonical interface definition.

interface MemoryObject {
    read @0 (offset :UInt64, count :UInt64) -> (data :Data);
    write @1 (offset :UInt64, data :Data) -> ();
    getSize @2 () -> (size :UInt64);
    createChild @3 (offset :UInt64, size :UInt64, options :UInt32)
        -> (child :MemoryObject);
}

4. Scheduling

SystemModelPriority inversion solutionTemporal isolation
seL4 (MCS)Scheduling Contexts (budget/period/priority) + Reply ObjectsSC donation through IPC (caller’s SC transfers to callee)Yes (budget enforcement per SC)
ZirconFair scheduler with profiles (deadline, capacity, period)Kernel-managed priority inheritanceProfiles provide some isolation
GenodeDelegated to underlying kernel (seL4/NOVA/Linux)Depends on kernelDepends on kernel
Out-of-kernel policyKernel dispatch/enforcement + user-space policy serviceScheduling-context donation through IPCKernel-enforced budgets, user-chosen policy
User-space runtimesM:N work stealing, fibers, async tasks over kernel threadsRequires futexes, runtime cooperation, and OS-visible blocking eventsUsually runtime-local only

Recommendation for capOS: Start with round-robin (already done). When implementing priority scheduling:

  1. Add scheduling context donation for synchronous IPC: when process A calls process B, B inherits A’s priority and budget. Prevents inversion through the capability graph.
  2. Support passive servers (from seL4 MCS): servers without their own scheduling context that only run when called, using the caller’s budget. Natural fit for capOS’s service architecture.
  3. Add temporal isolation (budget/period per scheduling context) for the cloud deployment scenario.

For moving scheduler policy out of the kernel, see Out-of-kernel scheduling. The key finding is a split between kernel dispatch/enforcement and user-space policy: dispatch, budget enforcement, and emergency fallback remain privileged, while admission control, budgets, priorities, CPU masks, and SQPOLL/core grants can be represented as policy managed by a scheduler service. Thread creation, thread handles, scheduling contexts, and park authority should be capability-based from the start; the remaining research task is measurement: compare generic capnp/ring calls against compact capability-authorized park-shaped operations before deciding the park hot-path encoding.

5. Persistence

SystemModelConsistencyApplication effort
EROS/CapROSTransparent global checkpoint (single-level store)Strong (global snapshot)None (automatic)
Plan 9User-mode file servers with explicit writesPer-file serverFull (explicit save/load)
GenodeApplication-level (services manage own persistence)Per-componentFull
capOS (planned)Content-addressed Store + Namespace capsPer-serviceFull (explicit capnp serialize)

Recommendation for capOS: Three phases, as informed by EROS:

  1. Explicit persistence (current plan) – services serialize state to the Store capability as capnp messages. Simple, gives services control.
  2. Opt-in Checkpoint capability – kernel captures process state (registers, memory, cap table) as capnp messages stored in the Store. Enables process migration and crash recovery for services that opt in.
  3. Coordinated checkpointing – a coordinator service orchestrates consistent snapshots across multiple services.

Persistent capability references (from EROS + Cap’n Proto):

struct PersistentCapRef {
    interfaceId @0 :UInt64;
    objectId @1 :UInt64;
    permissions @2 :UInt32;
    epoch @3 :UInt64;
}

Do NOT implement EROS-style transparent global persistence. The kernel complexity is enormous, debuggability is poor, and Cap’n Proto’s zero-copy serialization already provides near-equivalent benefits for explicit persistence.

6. Namespace and VFS

Plan 9’s per-process namespace is the closest analog to capOS’s per-process capability table. The key insight: Plan 9’s bind/mount with union semantics provides composability that capOS’s current Namespace design lacks.

Recommendation: Extend Namespace with union composition:

enum UnionMode { replace @0; before @1; after @2; }

interface Namespace {
    resolve @0 (name :Text) -> (hash :Data);
    bind @1 (name :Text, hash :Data) -> ();
    list @2 () -> (names :List(Text));
    sub @3 (prefix :Text) -> (ns :Namespace);
    union @4 (other :Namespace, mode :UnionMode) -> (merged :Namespace);
}

VFS as a library (from Genode): libcapos-posix should be an in-process library that translates POSIX calls to capability invocations. Each POSIX process receives a declarative mount table (capnp struct) mapping paths to capabilities. No VFS server needed.

FileServer capability (from Plan 9): For resources that are naturally file-like (config trees, debug introspection, /proc-style interfaces), provide a FileServer interface. Not universal (as in Plan 9) but available where the file metaphor fits.

7. Resource Accounting

Genode’s session quota model addresses a gap in capOS: without resource accounting, a malicious client can exhaust a server’s memory by creating many sessions.

Recommendation: Session-creating capability methods should accept a resource donation parameter:

interface NetworkManager {
    createTcpSocket @0 (bufferPages :UInt32) -> (socket :TcpSocket);
}

The client donates buffer memory as part of the session creation. The server allocates from donated resources, not its own.

8. Language Support Roadmap

From the LLVM research, the recommended order:

StepWhatBlocks
1Custom target JSON (x86_64-unknown-capos)Done for booted userspace crates
2VirtualMemory capabilityDone for baseline map/unmap/protect; Go allocator glue remains
3TLS support (PT_TLS parsing, FS base save/restore)Done for static ELF processes and current-process ThreadControl; per-thread TLS remains
4Park authority capability + measured ABIGo threads, pthreads
5Timer capability (monotonic clock)Done for monotonic now/sleep; wall-clock and event timers remain future work
6Go Phase 1: minimal GOOS=capos (single-threaded)Runtime capability checkpoint done; Go fork remains
7Kernel threadingGo GOMAXPROCS>1
8C toolchain + libcaposC programs, musl
9Go Phase 2: multi-threaded + concurrent GCGo network services
10Go Phase 3: network pollernet/http on capOS

Key decisions:

  • Keep x86_64-unknown-none for kernel, x86_64-unknown-capos for userspace.
  • Use local-exec TLS model (static linking, no dynamic linker).
  • Implement park as capability-authorized from the start. Because it operates on memory addresses and must be fast, measure generic capnp/ring calls against a compact capability-authorized operation before fixing the ABI.
  • Go can start with cooperative-only preemption (no signals).

Recommendations by Roadmap Stage

Stage 5: Scheduling

SourceRecommendationPriority
ZirconGeneration counter in CapId (stale reference detection)Done
seL4Add notification objects (lightweight bitmask signal/wait)Medium
LLVMCustom target JSON for userspace (x86_64-unknown-capos)Done
LLVMPer-thread TLS state for Go/threadingMedium

Stage 6: IPC and Capability Transfer

SourceRecommendationPriority
seL4Direct-switch IPC for synchronous cross-process callsDone baseline
seL4Badge field on capability entries for server-visible caller identityHistorical / rejected as service identity; see Rejected: Endpoint Badges as Service Identity
ZirconMove semantics for capability transfer through IPCDone
ZirconMemoryObject capability (shared memory for bulk data)Done baseline
EROSEpoch-based revocation (O(1) revoke, O(1) check)High
ZirconSideband capability-transfer descriptors and result-cap recordsDone baseline
GenodeSharedBuffer capability for data-plane transfersHigh
Plan 9Promise pipelining (SQE chaining in async rings)Medium
GenodeSession quotas / resource donation on session creationMedium
seL4Scheduling context donation through IPCMedium
Plan 9Namespace union composition (before/after/replace)Low

Post-Stage 6 / Future

SourceRecommendationPriority
seL4MCS scheduling (passive servers, temporal isolation)When needed
EROSOpt-in Checkpoint capability for process persistenceWhen needed
GenodeDynamic manifest reconfiguration at runtimeWhen needed
Plan 9exportfs-pattern capability proxy for network transparencyWhen needed
EROSPersistentCapRef struct in capnp for storing capability graphsWhen needed
seL4Rust-native formal verification (track Verus/Prusti)Long-term

Design Decisions Validated

Several capOS design choices are validated by this research:

  1. Cap’n Proto as the universal wire format. Superior to FIDL (random access, zero-copy, promise pipelining, persistence-ready). The right choice. See Zircon Section 5.

  2. Flat capability table. Simpler than seL4’s CNode tree, sufficient for capOS. Only add complexity (CNode-like hierarchy) if delegation patterns demand it. See seL4 Section 4.

  3. No ambient authority. Every surveyed capability OS confirms this is essential. EROS proved confinement. seL4 proved integrity. capOS has this by design.

  4. Explicit persistence over transparent. EROS’s single-level store is elegant but the kernel complexity is enormous. Cap’n Proto zero-copy gives most of the benefits. See EROS, CapROS, Coyotos Section 6.

  5. io_uring-inspired async rings. Better than Zircon’s port model for capOS (operation-based > notification-based). See Zircon Section 4.

  6. VFS as library, not kernel feature. Genode’s approach, matched by capOS’s planned libcapos-posix. See Genode Section 3.

  7. No fork(). Genode has operated without fork() for 15+ years, proving it unnecessary. See Genode Section 4.

Design Gaps Identified

  1. Bulk data path is only a substrate. Copying capnp messages through the kernel works for control but not for file/network/GPU data. MemoryObject now provides the mapped-frame substrate; service-facing SharedBuffer APIs remain future Stage 6+ work.

  2. Resource accounting is partially unified. The authority-accounting design exists, and VirtualMemory plus FrameAllocator/MemoryObject frame grants now charge the process ResourceLedger::frame_grant_pages counter. Future shared-buffer, DMA, log-volume, and CPU-budget resources still need the same treatment.

  3. No notification primitive. seL4 notifications (lightweight bitmask signal/wait) are needed for interrupt delivery and event notification without full capnp message overhead.

  4. No per-thread TLS object yet. Static ELF TLS, context-switch FS-base state, and current-process ThreadControl exist, but future user threads still need independently settable FS bases per thread.


References

See individual deep-dive reports for full reference lists. Key primary sources:

  • Klein et al., “seL4: Formal Verification of an OS Kernel,” SOSP 2009
  • Lyons et al., “Scheduling-context capabilities,” EuroSys 2018
  • Shapiro et al., “EROS: A Fast Capability System,” SOSP 1999
  • Shapiro & Weber, “Verifying the EROS Confinement Mechanism,” IEEE S&P 2000
  • Pike et al., “The Use of Name Spaces in Plan 9,” OSR 1993
  • Feske, “Genode Foundations” (genode.org/documentation)
  • Fuchsia Zircon kernel documentation (fuchsia.dev)

seL4 Deep Dive: Lessons for capOS

Research notes on seL4’s design, covering formal verification, capability model, IPC, scheduling, and applicability to capOS.

Primary sources: “seL4: Formal Verification of an OS Kernel” (Klein et al., SOSP 2009), seL4 Reference Manual (v12.x / v13.x), “The seL4 Microkernel – An Introduction” (whitepaper, 2020), “Towards a Verified, General-Purpose Operating System Kernel” (Klein et al., 2008), “Principled Approach to Kernel Design for MCS” (Lyons et al., 2018), seL4 source code and API documentation.


1. Formal Verification Approach

What seL4 Proves

seL4 is the first general-purpose OS kernel with a machine-checked proof of functional correctness. The verification chain establishes:

  1. Functional correctness: The C implementation of the kernel refines (faithfully implements) an abstract specification written in Isabelle/HOL. Every possible execution of the C code corresponds to an allowed behavior in the abstract spec. This is not “absence of some bug class” – it is a complete behavioral equivalence between spec and code.

  2. Integrity (access control): The kernel enforces capability-based access control. A process cannot access a kernel object unless it holds a capability to it. This is proven as a consequence of functional correctness: the spec defines access rules, and the implementation provably follows them.

  3. Confidentiality (information flow): In the verified configuration, information cannot flow between security domains except through explicitly authorized channels. This proves noninterference at the kernel level.

  4. Binary correctness: The proof chain extends from the abstract spec through a Haskell executable model, then to the C implementation, and finally to the compiled ARM binary (via the verified CAmkES/CompCert chain or translation validation against GCC output). On ARM, the compiled binary is proven to behave as the C source specifies.

The Verification Chain

Abstract Specification (Isabelle/HOL)
    |
    | refinement proof
    v
Executable Specification (Haskell)
    |
    | refinement proof
    v
C Implementation (10,000 lines of C)
    |
    | translation validation / CompCert
    v
ARM Binary

Each refinement step proves that the lower-level implementation is a correct realization of the higher-level spec. The Haskell model serves as an “executable spec” – it’s precise enough to run but abstract enough to reason about.

Properties Verified

  • No null pointer dereferences – a consequence of functional correctness.
  • No buffer overflows – all array accesses are proven in-bounds.
  • No arithmetic overflow – all integer operations are proven safe.
  • No use-after-free – memory management correctness is proven.
  • No memory leaks (in the kernel) – all allocated memory is accounted for.
  • No undefined behavior – the C code is proven to avoid all UB.
  • Capability enforcement – objects are only accessible through valid capabilities, and capabilities cannot be forged.
  • Authority confinement – proven that authority does not leak beyond what capabilities allow.

Practical Implications

What verification buys you:

  • Eliminates all implementation bugs in the verified code. Not “most bugs” or “common bug classes” – literally all of them, for the verified configuration.
  • The security properties (integrity, confidentiality) hold absolutely, not probabilistically.
  • Makes the kernel trustworthy as a separation kernel / isolation boundary.

What verification does NOT cover:

  • The specification itself could be wrong (it could specify the wrong behavior). Verification proves “code matches spec,” not “spec is correct.”
  • Hardware must behave as modeled. The proof assumes a correct CPU, correct memory, no physical attacks. DMA from malicious devices can break isolation unless an IOMMU is used (and IOMMU management is proven correct).
  • Only the verified configuration is covered. seL4 has unverified configurations (e.g., SMP, RISC-V, certain platform features). Using unverified features voids the proof.
  • Performance-critical code paths (like the IPC fastpath) were initially outside the verification boundary, though significant progress has been made on verifying them.
  • The bootloader and hardware initialization code are outside the proof boundary.
  • Compiler correctness: on x86, the proof trusts GCC. On ARM, binary verification closes this gap.

Design Constraints Imposed by Verification

The requirement of formal verification has profoundly shaped seL4’s design:

  1. Small kernel: ~10,000 lines of C. Every line must be verified, so the kernel is as small as possible. Drivers, file systems, networking – everything lives in user space.

  2. No dynamic memory allocation in the kernel: The kernel does not have a general-purpose heap. All kernel memory is pre-allocated and managed through typed capabilities (Untyped memory). This eliminates an entire class of verification complexity (heap reasoning).

  3. No concurrency in the kernel: seL4 runs the kernel as a single- threaded “big lock” model (interrupts disabled in kernel mode). SMP is handled by running independent kernel instances on each core with explicit message passing between them (the “clustered multikernel” approach), or by using a big kernel lock (the current SMP approach, which is NOT covered by the verification proof).

  4. C implementation: Written in a restricted subset of C that is amenable to Isabelle/HOL reasoning. No function pointers (mostly), no complex pointer arithmetic, no compiler-specific extensions. This makes the code more rigid than typical C but provable.

  5. Fixed system call set: The kernel API is small and fixed. Adding a new syscall requires extending the proofs – a major effort.

  6. Platform-specific verification: The proof is per-platform. ARM was verified first; x86 verification came later with additional effort. Each new platform requires new proofs.


2. Capability Transfer Model

Core Concepts

seL4’s capability model descends from the EROS/KeyKOS tradition but with significant innovations driven by formal verification requirements.

Kernel Objects: Everything the kernel manages is an object: TCBs (thread control blocks), endpoints (IPC channels), CNodes (capability storage), page tables, frames, address spaces (VSpaces), untyped memory, and more. The kernel tracks the exact type and state of every object.

Capabilities: A capability is a reference to a kernel object combined with access rights. Capabilities are stored in kernel memory, never directly accessible to user space. User space refers to capabilities by position in its capability space.

CSpaces, CNodes, and CSlots

CSlot (Capability Slot): A single storage location that can hold one capability. A CSlot is either empty or contains a capability (object pointer

  • access rights + badge).

CNode (Capability Node): A kernel object that is a power-of-two-sized array of CSlots. A CNode with 2^n slots has a “guard” and a “radix” of n. CNodes are the building blocks of the capability addressing tree.

CSpace (Capability Space): The complete capability namespace of a thread. A CSpace is a tree of CNodes, rooted at the thread’s CSpace root (a CNode pointed to by the TCB). Capability lookup traverses this tree.

Thread's TCB
  |
  +-- CSpace Root (CNode, 2^8 = 256 slots)
        |
        +-- slot 0: cap to Endpoint A
        +-- slot 1: cap to Frame X
        +-- slot 2: cap to another CNode (2^4 = 16 slots)
        |     |
        |     +-- slot 0: cap to Endpoint B
        |     +-- slot 1: empty
        |     +-- ...
        +-- slot 3: empty
        +-- ...

Capability Addressing (CPtr and Depth)

A CPtr (Capability Pointer) is a word-sized integer used to name a capability within a thread’s CSpace. It is NOT a memory pointer – it is an index that the kernel resolves by walking the CNode tree.

Resolution works bit-by-bit from the most significant end:

  1. Start at the CSpace root CNode.
  2. The CNode’s guard is compared against the corresponding bits of the CPtr. If they don’t match, the lookup fails. Guards allow sparse addressing without allocating huge CNode arrays.
  3. The next radix bits of the CPtr are used as an index into the CNode array.
  4. If the slot contains a CNode capability, recurse: consume the next bits of the CPtr to walk deeper.
  5. If the slot contains any other capability, the lookup is complete.
  6. The depth parameter in the syscall tells the kernel how many bits of the CPtr to consume. This disambiguates between “stop at this CNode cap” and “descend into this CNode.”

Example: A CPtr of 0x4B with a two-level CSpace:

  • Root CNode: guard = 0, radix = 4 (16 slots)
  • Bits [7:4] = 0x4 -> index into root CNode slot 4
  • Slot 4 contains a CNode cap: guard = 0, radix = 4 (16 slots)
  • Bits [3:0] = 0xB -> index into second-level CNode slot 11
  • Slot 11 contains an Endpoint cap -> lookup complete

Flat Table vs. Hierarchical CSpace

seL4’s hierarchical CSpace has significant implications:

Advantages of hierarchical:

  • Sparse capability spaces without wasting memory. A process can have a huge CPtr range with only a few CNodes allocated.
  • Subtree delegation: a parent can give a child a CNode cap that grants access to a subset of capabilities. The child can manage its own subtree without affecting the parent’s.
  • Guards compress address bits, allowing efficient encoding of large capability identifiers.

Disadvantages of hierarchical:

  • Lookup is slower than a flat array index – multiple memory indirections per resolution.
  • More complex kernel code (and more complex verification).
  • User space must explicitly manage CNode allocation and CSpace layout.

capOS comparison: capOS uses a flat Vec<Option<Arc<dyn CapObject>>> indexed by CapId (u32). The shared Arc lets a single kernel capability back multiple per-process slots, which is what makes cross-process IPC work when another service resolves its CapRef via CapSource::Service. The flat layout is simpler and faster for lookup (single array index), but cannot support sparse addressing or subtree delegation. For capOS’s research goals, the flat approach is adequate initially. If capOS needs hierarchical delegation later (e.g., a supervisor delegating a subset of caps to a child without copying), it could add a level of indirection without adopting seL4’s full tree model.

Capability Operations

seL4 provides these operations on capabilities:

Copy: Duplicate a capability from one CSlot to another. Both the source and destination must be in the caller’s CSpace (or the caller must have CNode caps to the relevant CNodes). The new cap has the same authority as the original, minus any rights the caller chooses to strip.

Mint: Like Copy, but also sets a badge on the new capability. A badge is a word-sized integer embedded in the capability that is delivered to the receiver when the capability is used. Badges allow a server to distinguish which client is calling – each client gets a differently-badged cap to the same endpoint, and the server sees the badge on each incoming message.

Move: Transfer a capability from one CSlot to another. The source slot becomes empty. This is an atomic transfer of authority.

Mutate: Move + modify rights or badge in one operation.

Delete: Remove a capability from a CSlot, making it empty.

Revoke: Delete a capability AND all capabilities derived from it. This is the most powerful operation – it allows a parent to withdraw authority it granted to children, transitively.

Capability Derivation and the CDT

seL4 tracks a Capability Derivation Tree (CDT) – a tree recording which capability was derived from which. When capability A is copied or minted to produce capability B, B becomes a child of A in the CDT.

Revoke(A) deletes all descendants of A in the CDT but leaves A itself. This gives the holder of A the power to revoke all authority derived from their own authority.

The CDT is critical for clean revocation but adds significant kernel complexity. It requires maintaining a tree structure across all capability copies throughout the system.

Untyped Memory and Retype

One of seL4’s most distinctive features is that the kernel never allocates memory on its own. All physical memory is initially represented as Untyped Memory capabilities. To create any kernel object (endpoint, CNode, TCB, page frame, etc.), user space must invoke the Untyped_Retype operation on an untyped cap, which carves out a portion of the untyped memory and creates a new typed object.

This means:

  • User space (specifically, the root task or a memory manager) controls all memory allocation.
  • The kernel has zero internal allocation – all memory it uses comes from retyped untypeds.
  • Memory exhaustion is impossible in the kernel – if a syscall needs memory, user space must have provided it in advance via retype.
  • Revoke on an untyped cap destroys ALL objects created from it, reclaiming the memory. This is the mechanism for wholesale cleanup.

3. IPC Fastpath

Overview

seL4’s IPC is synchronous and endpoint-based. An endpoint is a rendezvous point: the sender blocks until a receiver is ready, or vice versa. There is no buffering in the kernel (unlike Mach ports or Linux pipes).

The IPC fastpath is a highly optimized code path for the common case of a short synchronous call/reply between two threads. It is one of seL4’s signature performance features.

How the Fastpath Works

When thread A calls seL4_Call(endpoint_cap, msg):

  1. Capability lookup: Resolve the CPtr to find the endpoint cap. In the fastpath, this is optimized to handle the common case of a direct CSlot lookup (single-level CSpace, no guard traversal needed).

  2. Receiver check: Is there a thread waiting on this endpoint? If yes, the fastpath applies. If no (receiver isn’t ready), fall to the slowpath which queues the sender.

  3. Direct context switch: Instead of the normal path (save sender registers -> return to scheduler -> pick receiver -> restore receiver registers), the fastpath performs a direct register transfer:

    • Save the sender’s register state into its TCB.
    • Copy the message registers (a small number, typically 4-8 words) from the sender’s physical registers directly into the receiver’s TCB (or leave them in registers if possible).
    • Load the receiver’s page table root (vspace) into CR3/TTBR.
    • Switch to the receiver’s kernel stack.
    • Restore the receiver’s register state.
    • Return to user mode as the receiver.

    This is a direct context switch – the kernel goes directly from the sender to the receiver without passing through the scheduler. The IPC operation IS the context switch.

  4. Reply cap: The sender’s reply cap is set up so the receiver can reply. In the classic (non-MCS) model, a one-shot reply capability is placed in the receiver’s TCB. The receiver calls seL4_Reply(msg) to send the response directly back.

Performance Characteristics

seL4 IPC is among the fastest measured:

  • ARM (Cortex-A9): ~240 cycles for a Call+Reply round-trip (including two privilege transitions, a full context switch, and message transfer).
  • x86-64: ~380-500 cycles for a Call+Reply round-trip depending on hardware generation.
  • Message size: The fastpath handles small messages (fits in registers). Longer messages require copying from IPC buffer pages and take the slowpath.

For comparison:

  • Linux pipe IPC: ~5,000-10,000 cycles for a round-trip.
  • Mach IPC (macOS XNU): ~3,000-5,000 cycles.
  • L4/Pistachio: ~700-1,000 cycles (seL4 improved on this).

Fastpath Constraints

The fastpath is only taken when ALL of these conditions hold:

  1. The operation is seL4_Call or seL4_ReplyRecv (the two most common IPC operations).
  2. The message fits in message registers (no extra caps, no long messages that require the IPC buffer).
  3. The capability lookup is “simple” – single-level CSpace, direct slot lookup, no guard bits to check.
  4. There IS a thread waiting at the endpoint (no need to block the sender).
  5. The receiver is at sufficient priority (in the non-MCS configuration, higher priority than any other runnable thread – or in MCS, the scheduling context can be donated).
  6. No capability transfer is happening in this message.
  7. Certain bookkeeping conditions are met (no pending operations on either thread, no debug traps, etc.).

When any condition fails, the kernel falls through to the slowpath, which handles the general case correctly but with more overhead (~5-10x slower than the fastpath).

Direct Switch Mechanics

The key insight is: when thread A calls thread B synchronously, A is going to block until B replies. There is no scheduling decision to make – the only correct action is to run B immediately. So the kernel skips the scheduler entirely:

Thread A (running)          Kernel              Thread B (blocked on recv)
    |                         |                        |
    | seL4_Call(ep, msg) ---> |                        |
    |                         | [fastpath]             |
    |                         | Save A's regs          |
    |                         | Copy msg A -> B        |
    |                         | Switch page tables     |
    |                         | Restore B's regs       |
    |                         | ---------------------->|
    |                         |                        | [running, processes msg]
    |                         |                        |
    |                         | <--- seL4_Reply(reply)  |
    |                         | [fastpath again]       |
    |                         | Save B's regs          |
    |                         | Copy reply B -> A      |
    |                         | Switch page tables     |
    |                         | Restore A's regs       |
    | <-----------------------|                        |
    | [running, has reply]    |                        |

The entire round-trip involves exactly two kernel entries and two context switches, with no scheduler invocation.

Implications

  1. RPC is the natural IPC pattern: seL4’s IPC is optimized for the client-server call/reply pattern. Fire-and-forget or multicast patterns require different mechanisms (notifications, shared memory).

  2. Notifications: For async signaling (like interrupts or events), seL4 provides notification objects – a lightweight word-sized bitmask that can be signaled and waited on without message transfer. These are separate from endpoints.

  3. Shared memory for bulk transfer: IPC messages are small (register- sized). For large data transfers, the standard pattern is: set up shared memory, then use IPC to synchronize. This is explicit – the kernel doesn’t transparently copy large buffers.


4. CNode/CSpace Architecture in Detail

CNode Structure

A CNode object is a contiguous array of CSlots in kernel memory. The size is always a power of two. The kernel metadata for a CNode includes:

  • Radix bits: log2 of the number of slots (e.g., radix=8 means 256 slots).
  • Guard value: a bit pattern that must match the CPtr during resolution.
  • Guard bits: the number of bits in the guard.

The total bits consumed during resolution of one CNode level is: guard_bits + radix_bits.

Multi-Level Resolution Example

Consider a two-level CSpace:

Root CNode: guard=0 (0 bits), radix=8 (256 slots)
  Slot 5 -> CNode B: guard=0x3 (2 bits), radix=6 (64 slots)
    Slot 42 -> Endpoint X

To reach Endpoint X with a 16-bit CPtr at depth 16:

  • CPtr = 0b 00000101 11 101010
  • Root CNode consumes 8 bits: 00000101 = 5 -> Slot 5 (CNode B cap)
  • CNode B guard: next 2 bits = 11 -> matches guard 0x3 -> OK
  • CNode B radix: next 6 bits = 101010 = 42 -> Slot 42 (Endpoint X)
  • Total bits consumed: 8 + 2 + 6 = 16 = depth -> resolution complete

CSpace Layout Strategies

Flat: One large root CNode with radix=N, no sub-CNodes. Simple, fast lookup (one level). Wastes memory if the CPtr space is sparse.

Two-level: Small root CNode pointing to sub-CNodes. Common for processes that need moderate capability counts.

Deep: Many levels. Useful for delegation: a supervisor gives a child a cap to a sub-CNode, and the child manages its own CSpace subtree below that point.

Comparison with capOS’s Flat Table

AspectseL4 CSpacecapOS CapTable
StructureTree of CNodesFlat Vec<Option<Arc<dyn CapObject>>>
Lookup costO(depth) memory indirectionsO(1) array index
Sparse supportYes (guards + tree)No (dense array, holes via free list)
Subtree delegationYes (grant CNode cap)No
Memory overheadCNode objects are power-of-2Exact-sized Vec
ComplexityHigh (bit-level CPtr resolution)Low
Capability identityPosition in CSpaceCapId (u32 index)
Verification burdenVery highN/A (Rust safety)

5. MCS (Mixed-Criticality Systems) Scheduling

Background

The original seL4 scheduling model is a simple priority-preemptive scheduler with 256 priority levels and round-robin within each level. This model has a known flaw: priority inversion through IPC. When a high-priority thread calls a low-priority server, the reply might be delayed indefinitely by medium-priority threads preempting the server. The classic solution (priority inheritance) is complex to verify and doesn’t compose well.

The MCS extensions redesign scheduling to solve this and provide temporal isolation.

Key Concepts

Scheduling Context (SC): A new kernel object that represents the “right to execute on a CPU.” An SC holds:

  • A budget (microseconds of CPU time per period)
  • A period
  • A priority
  • Remaining budget in the current period

A thread must have a bound SC to be runnable. Without an SC, a thread cannot execute regardless of its priority.

Reply Object: In the MCS model, the one-shot reply capability from classic seL4 is replaced by an explicit Reply kernel object. When thread A calls thread B:

  1. A’s scheduling context is donated to B.
  2. A reply object is created to hold A’s return path.
  3. B now runs on A’s scheduling context (A’s priority and budget).
  4. When B replies, the SC returns to A.

This solves priority inversion: the server (B) inherits the caller’s priority and budget automatically.

Passive servers: A server thread can exist without its own SC. It only becomes runnable when a client donates an SC via the Call operation. When it replies, it becomes passive again. This is powerful:

  • No CPU time is “reserved” for idle servers.
  • The server executes on the client’s budget – the client pays for the work it requests.
  • Multiple clients can call the same passive server; each brings its own SC.

Temporal Isolation

MCS SCs provide temporal fault isolation:

  • Each SC has a fixed budget/period. A thread cannot exceed its budget in any period. When the budget expires, the thread is descheduled until the next period begins.
  • This is enforced by hardware timer interrupts – the kernel programs the timer to fire when the current SC’s budget expires.
  • A misbehaving (or compromised) component cannot starve other components because its SC bounds its CPU consumption.
  • This works even across IPC: if client A calls server B with A’s SC, the combined execution of A+B is bounded by A’s budget.

Comparison with capOS’s Scheduler

capOS currently has a round-robin scheduler (kernel/src/sched.rs) with no priority levels and no temporal isolation:

#![allow(unused)]
fn main() {
struct Scheduler {
    processes: BTreeMap<Pid, Process>,
    run_queue: VecDeque<Pid>,
    current: Option<Pid>,
}
}

Timer preemption, cap_enter blocking waits, Endpoint IPC, and a baseline direct IPC handoff are implemented. The MCS model is relevant for the next scheduling step because the same priority inversion problem arises when a high-priority client calls a low-priority server through a capability.


6. Relevance to capOS

6.1 Formal Verification

Applicability: Low in the near term. seL4’s verification is done in Isabelle/HOL over C code, which doesn’t transfer to Rust. However, the constraints that verification imposed are valuable design guidance:

  • Minimal kernel: seL4’s ~10K lines of C demonstrate how little code a microkernel actually needs. capOS should resist adding kernel features and instead move them to user space.
  • No kernel heap allocation on the critical path: seL4’s “untyped memory” approach where user space provides all memory is worth studying. capOS has removed the earlier allocation-heavy synchronous ring dispatch path, but it still uses owned kernel objects and preallocated scratch rather than a user-supplied untyped-memory model.
  • No kernel concurrency: seL4 avoids kernel-level concurrency entirely (SMP uses separate kernel instances or a big lock). capOS currently uses spin::Mutex around the scheduler and capability tables. The seL4 approach suggests this is acceptable until/unless per-CPU kernel instances are needed.

Rust alternative: Rust’s type system provides memory safety guarantees that overlap with some of seL4’s verified properties (no buffer overflows, no use-after-free, no null dereference in safe code). This is not a substitute for functional correctness proofs, but it significantly raises the bar compared to unverified C. Ongoing research in Rust formal verification (e.g., Prusti, Creusot, Verus) may eventually enable seL4-style proofs over Rust kernels.

6.2 Capability Model

CNode tree vs. flat table: capOS’s flat CapTable is the right choice for now. seL4’s CNode tree exists to support delegation (granting a subtree of your CSpace to a child) and sparse addressing. capOS’s current model gives each process its own independent flat table and now supports manifest-provided caps plus explicit copy/move transfer descriptors through Endpoint IPC. If capOS later needs fine-grained delegation (a parent granting access to a subset of its caps without copying), it can add a level of indirection:

Option A: Proxy capability objects that forward to the parent's table
Option B: A two-level table (small root array -> larger sub-arrays)
Option C: Shared capability objects with refcounting

Badge/Mint pattern: seL4’s badge mechanism was initially applied to capOS as endpoint receiver metadata: multiple clients could share one endpoint while the server saw a word-sized caller tag. capOS implemented that substrate by adding badge metadata to capability references and hold edges; endpoint CALL delivery reported the invoked hold badge to the receiver, and copy/move transfer preserved badge metadata.

That model is now historical. Badge-as-service-identity was rejected after spawn and shell paths exposed delegated-client relabeling hazards. The active direction is session-bound invocation context: endpoint metadata may remain as internal receiver metadata or hostile-test fixture, but normal shared-service identity should come from process session context, broker-granted service facets, and privacy-bounded disclosure. See docs/proposals/rejected-endpoint-badges-proposal.md and docs/proposals/session-bound-invocation-context-proposal.md.

Current ring SQEs carry cap id and method id separately. The cap table stores badge and transfer-mode metadata alongside the object reference:

#![allow(unused)]
fn main() {
struct CapEntry {
    object: Arc<dyn CapObject>,
    badge: u64,
    transfer_mode: CapTransferMode,
}
}

Revocation (CDT): seL4’s Capability Derivation Tree is its most complex internal structure. For capOS, full CDT-style transitive revocation is probably overkill initially. The service-architecture proposal already identifies simpler alternatives:

  • Generation counters: Each capability has a generation number. Bumping the generation invalidates all references without traversing a tree.
  • Proxy caps: A proxy object that can be invalidated by its creator. Callers hold the proxy, not the real capability.
  • Process-lifetime revocation: When a process dies, all caps it held are automatically invalidated (seL4 does this too, but the CDT allows more fine-grained revocation within a living process).

Untyped memory: seL4’s “no kernel allocation” model via untyped memory is elegant but probably too heavyweight for capOS’s current stage. The key takeaway is the principle: user space should control resource allocation as much as possible. capOS’s FrameAllocator capability already moves frame allocation authority into the capability model.

6.3 IPC Design

This is the most directly actionable area for capOS’s Stage 6.

seL4’s model (synchronous rendezvous + direct switch) vs. capOS’s model (async rings + Cap’n Proto wire format):

AspectseL4capOS
IPC primitiveSynchronous endpointAsync submission/completion rings
Message formatUntyped words in registersCap’n Proto serialized messages
Bulk transferShared memory (explicit)TBD (copy in kernel or shared memory)
Message sizeSmall (register-sized, ~4-8 words)Variable (up to 64KB currently)
Scheduling integrationDirect switch (caller -> callee)Baseline direct IPC handoff implemented
BatchingNo (one message per syscall)Yes (io_uring-style ring)

Key lessons from seL4’s IPC for capOS:

  1. Direct switch for synchronous RPC: Even with async rings, capOS needs a synchronous fast path. The baseline single-CPU direct IPC handoff is implemented for the case where process A calls an Endpoint and process B is blocked waiting in RECV. Future work is register payload transfer and measured fastpath tuning.

  2. Register-based message transfer for small messages: seL4 avoids copying message bytes through kernel buffers for small messages by transferring them through registers during the context switch. capOS currently moves serialized payloads through ring buffers and bounded kernel scratch. For cross-process IPC, minimizing copies is critical. Options:

    • Small messages (<64 bytes) could be transferred in registers during direct switch.
    • Large messages could use shared memory regions (mapped into both address spaces) with IPC used only for synchronization.
    • The io_uring-style rings are already shared memory – the submission and completion ring buffers could potentially be mapped into both the caller’s and callee’s address spaces for zero-copy IPC.
  3. Separate mechanisms for sync and async: seL4 uses endpoints for synchronous IPC and notification objects for async signaling. capOS’s io_uring approach inherently supports batched async operations, but the common case of a simple RPC call-and-wait should have a fast synchronous path too. The two mechanisms complement each other.

  4. Notifications for interrupts and events: seL4’s notification objects (lightweight bitmask signal/wait) map well to capOS’s interrupt delivery model. When a hardware interrupt fires, the kernel signals a notification object, and the driver thread waiting on that notification wakes up. This is cleaner than delivering interrupts as full IPC messages.

The Cap’n Proto dimension: capOS’s use of Cap’n Proto wire format for capability messages is a significant divergence from seL4’s untyped word arrays. Tradeoffs:

  • Pro: Type safety, schema evolution, language-neutral interfaces, built-in serialization/deserialization, native support for capability references in messages (Cap’n Proto has a “capability table” concept in its RPC protocol).
  • Con: Serialization overhead. Even Cap’n Proto’s zero-copy format requires pointer validation and bounds checking that seL4’s raw register transfer does not. For very hot IPC paths, this overhead may be significant.
  • Mitigation: For the hot path, capOS could define a “small message” format that bypasses full capnp serialization – just a few raw words, similar to seL4’s register message. Fall back to full capnp for larger or more complex messages.

6.4 MCS Scheduling

Priority donation via IPC: Directly relevant when capOS implements cross-process capability calls. If process A (high priority) calls a capability in process B (low priority), B needs to run at A’s priority to avoid inversion. The seL4 MCS approach of “donating” the scheduling context with the IPC message is clean and composable.

For capOS, the io_uring model complicates this slightly: if submissions are batched, which submitter’s priority should the server inherit? Options:

  • Inherit the highest priority among pending submissions.
  • Each submission carries its own priority/scheduling context.
  • Use the synchronous fast-path (with donation) for priority-sensitive calls, and the async ring for bulk/background operations.

Passive servers: The MCS concept of servers that only consume CPU when called (by borrowing the caller’s scheduling context) maps well to capOS’s capability-based services. A network stack server that only runs when a client sends a request, consuming the client’s CPU budget, is a natural fit for capOS’s service architecture.

Temporal isolation: Budget/period enforcement prevents denial-of-service between capability holders. Even if process A holds a capability to process B, A cannot cause B to consume unbounded CPU time – B’s execution on behalf of A is bounded by A’s scheduling context budget. This is worth considering for capOS’s roadmap, especially for the cloud deployment scenario where isolation is critical.

6.5 Specific Recommendations for capOS

Near-term (Stages 5-6):

  1. Badge field on cap holds: Done. Manifest CapRef badge metadata is carried into cap-table hold edges, delivered to Endpoint receivers, and preserved across copy/move transfer.

  2. Implement direct-switch IPC for synchronous calls: Baseline done for Endpoint receivers blocked in RECV. Remaining work is the measured fastpath shape, especially small-message register transfer.

  3. Keep the flat CapTable: seL4’s CNode tree complexity is justified by formal verification constraints and subtree delegation. capOS’s flat table is simpler and sufficient. Add proxy/wrapper capabilities for delegation rather than restructuring the table.

  4. Add notification objects: A lightweight signaling primitive (word- sized bitmask, signal/wait operations) for interrupt delivery and event notification. Much cheaper than sending a full capnp message for “wake up, there’s work to do.”

Medium-term (post-Stage 6):

  1. Scheduling context donation: When implementing priority scheduling, attach a scheduling context to IPC calls so servers inherit caller priority. This prevents priority inversion through the capability graph.

  2. Capability rights attenuation: Add a rights mask to capability references so a parent can grant a cap with reduced permissions (e.g., read-only access to a read-write capability). seL4’s rights bits are: Read, Write, Grant (can pass the cap to others), GrantReply (can pass reply cap only).

  3. Revocation via generation/epoch counters: Generation-tagged CapIds catch stale slot reuse, and object-wide epoch revocation now invalidates current child-local grant copies without a seL4-style derivation tree.

Long-term (research directions):

  1. Zero-copy IPC via shared memory: For bulk data transfer between processes, map shared memory regions (Cap’n Proto segments) into both address spaces. Use IPC only for synchronization and capability transfer. This combines seL4’s “shared memory + IPC sync” pattern with capOS’s Cap’n Proto wire format.

  2. Rust-native verification: Track developments in Verus, Prusti, and other Rust verification tools. capOS’s Rust implementation is better positioned for future formal verification than a C implementation would be, given the type system guarantees already present.

  3. Untyped memory model: Consider moving kernel object allocation entirely into capability-gated operations (like seL4’s Retype). User space provides memory for all kernel objects, ensuring the kernel never runs out of memory on its own. This is a significant architectural change but aligns with the “everything is a capability” principle.


Summary Table

seL4 FeatureMaturitycapOS EquivalentRecommended Action
Functional correctness proofProductionNone (Rust type safety)Track Rust verification tools
CNode/CSpace treeProductionFlat CapTableKeep flat
Capability badge/mintProductionHold-edge badgeDone baseline
Revocation (CDT)ProductionGeneration-tagged CapId; object-epoch revocation for child-local grantsKeep epoch revocation instead of adding CDT
Untyped memory / RetypeProductionFrameAllocator capConsider for hardening phase
Synchronous IPC endpointsProductionEndpoint CALL/RECV/RETURNDone baseline
IPC fastpath (direct switch)ProductionDirect IPC handoffDone baseline; tune register payload later
Notification objectsProductionNoneImplement as lightweight signal primitive
MCS Scheduling ContextsProductionRound-robin schedulerImplement SC donation for IPC
Passive serversProductionNoneNatural fit with cap-based services
Temporal isolationProductionNoneConsider for cloud deployment

References

  1. Klein, G., et al. “seL4: Formal Verification of an OS Kernel.” SOSP 2009.
  2. seL4 Reference Manual, versions 12.1.0 and 13.0.0.
  3. “The seL4 Microkernel – An Introduction.” seL4 Foundation Whitepaper, 2020.
  4. Lyons, A., et al. “Scheduling-context capabilities: A principled, light-weight operating-system mechanism for managing time.” EuroSys 2018.
  5. Heiser, G., & Elphinstone, K. “L4 Microkernels: The Lessons from 20 Years of Research and Deployment.” SOSP 2016.
  6. seL4 source code: https://github.com/seL4/seL4
  7. seL4 API documentation: https://docs.sel4.systems/

seL4 HAMR: Model-Based High-Assurance Engineering

HAMR (High Assurance Modeling and Rapid engineering) is an open-source model-driven development framework for safety-critical embedded systems, developed by the SAnToS Lab at Kansas State University (lead: Prof. John Hatcliff) in collaboration with Collins Aerospace, Dornerworks, and Aarhus University. It was applied on the DARPA CASE (Cyber Assured Systems Engineering) program to generate seL4/C-based applications for UAV mission computing on the Boeing CH-47 Chinook platform.

Primary sources: Sireum HAMR (hamr.sireum.org); Belt et al., “Model-Driven Development for the seL4 Microkernel Using the HAMR Framework” (J. Systems Architecture, 2022); Hatcliff et al., “HAMR: An AADL Multi-Platform Code Generation Toolset” (ICSA 2021); Hatcliff et al., seL4 Summit 2025 keynote (“Model-based Development for seL4 Microkit/Rust with Integrated Formal Methods using HAMR”); GUMBO contract language (Galois / SAnToS Lab); seL4 Foundation CAmkES documentation.


1. What HAMR Is

HAMR operates across three development layers:

  1. Architecture modeling – The system is specified in AADL (SAE AS5506) or SysMLv2. The model captures component topology, port-based communication, timing/scheduling properties (periodic, sporadic, aperiodic threads), and GUMBO behavioral contract annotations.

  2. Code generation – HAMR generates deployment infrastructure (inter-component communication glue, tasking, platform configuration) and typed component skeletons developers fill with application logic. Output languages: Slang, C, and (as of 2025) Rust.

  3. Verification infrastructure – GUMBO model contracts are translated to source-level contracts for Logika (Slang), Verus (Rust), and executable property-based test oracles.

Platform backends: JVM, Linux, seL4/CAmkES (C), and seL4 Microkit (Rust, 2025 work).


2. The AADL Model

AADL is an SAE international standard (AS5506) for architecture description of embedded, real-time, safety-critical systems. Key concepts relevant to the HAMR/seL4 mapping:

  • Components: hierarchically typed – system, process, thread, device, data, subprogram, bus. thread is the unit of concurrent execution; process is the protected address space containing one or more threads.

  • Ports: typed communication endpoints attached to components.

    • Data port: most-recent-value semantics; sender writes, receiver reads at next dispatch.
    • Event port: queued notification with no data payload.
    • Event data port: queued notification with a typed data payload.
  • Connections: directed edges between compatible ports that define the data-flow and event-flow topology of the system. Connections are typed and directional; the model enforces that only compatible port kinds connect.

  • Properties: attach timing, scheduling, size, and other non-functional attributes to components and connections (e.g. Dispatch_Protocol => Periodic, Period => 10 ms, Queue_Size => 8).

  • Behavior Annex (SAE AS5506/3): an optional state-machine sub-language for attaching internal behavioral specifications to components, formalizing the implicit execution semantics of threads.

  • GUMBO: a contract language (developed by Galois and KSU) that extends AADL with requires / guarantees / compute clauses attached to component implementations, serving as the model-level precondition/postcondition and data-invariant language. GUMBO integrates with the AADL Behavior Annex and is translated by HAMR into Slang/Logika contracts or Verus proof obligations.


3. The seL4/CAmkES Pipeline

HAMR starts from the AADL instance model (as produced by OSATE, the open-source AADL editor) and generates:

3.1 CAmkES Topology Specification

HAMR generates the complete CAmkES .camkes file describing the deployment topology. The mapping is:

AADL conceptCAmkES / seL4 concept
process componentCAmkES component (seL4 protection domain / “partition”)
thread componentCAmkES component with seL4 domain assignment (1-to-1)
thread scheduling domainseL4 domain scheduler domain ID
Data port (sender → receiver)CAmkES dataport (shared memory), write-only cap on sender, read-only cap on receiver
Event/event-data portCAmkES notification or queue construct
Connection (A.out → B.in)CAmkES connection with read/write permission split

The key isolation invariant: CAmkES read/write permission specifications are used to configure the seL4 kernel to enforce the directionality of AADL ports. The sender component holds a write-only capability to the shared dataport; the receiver holds a read-only capability. The kernel enforces this at the capability level – no bypass is possible without a new capability grant.

3.2 Scheduling

AADL’s timing model (periodic/sporadic threads with bounded periods and deadlines) maps to the seL4 domain scheduler. Each AADL thread gets a static domain assignment. On the DARPA CASE work, time partitioning was enforced via the domain scheduler: each thread’s time slice is determined at build time from the AADL timing properties and the domain schedule is generated as part of the HAMR output.

3.3 Generated Component Skeletons

For each AADL thread, HAMR generates:

  • A component skeleton with initialize, compute (or timeTriggered/ eventTriggered), and finalize entry points.
  • Port API stubs: get_<portName>(), put_<portName>() functions that hide CAmkES shared-memory / notification mechanics behind a typed, uniform interface. This API is identical in shape across JVM, Linux, and seL4 backends – developer code calls the same interface regardless of platform.

3.4 Slang Reference Implementation

HAMR’s C skeleton APIs are derived from the Slang reference implementation. Slang is Sireum’s safety-critical subset of Scala: immutable-by-default, bounded loops, no reflection, and a restricted type system suited to Logika verification. The Slang implementation serves as a verified reference that the C and Rust backends are expected to match semantically.

3.5 The 2025 seL4 Microkit / Rust Extension

As of the 2025 seL4 Summit work (DARPA PROVERS INSPECTA project), HAMR generates Rust component skeletons deployable in seL4 Microkit protection domains. HAMR auto-generates the Microkit system description file, developer- facing channel/notification APIs for Rust threads, and Verus contract stubs from GUMBO model annotations. This is an active development track; the C/CAmkES backend is the more mature path.


4. Verification Model

HAMR’s verification approach is layered:

  1. GUMBO model contracts: requires/guarantees clauses on AADL components capture the intended behavioral contract at the architecture level. These are part of the model, not the code.

  2. Translated code contracts: HAMR translates GUMBO into Slang/Logika proof obligations or Verus specifications. The translation preserves the model-level contract’s semantic intent in the target language’s contract system.

  3. Logika / Verus verification: Tools verify that the developer’s component implementation satisfies the translated contracts. Logika operates on Slang; Verus operates on Rust.

  4. Property-based test oracles: HAMR also generates executable test harnesses that check GUMBO contract conformance at runtime, complementing formal verification with systematic testing.

  5. seL4 kernel verification: The underlying seL4 kernel is formally verified (machine-checked proof of functional correctness in Isabelle/HOL covering integrity and confidentiality). HAMR sits above this: its generated CAmkES specification maps to a seL4 capability topology that the verified kernel enforces. The combination targets the argument that the system’s isolation structure (as modeled in AADL) is correctly realized by the verified kernel.

The assurance case HAMR targets is roughly: AADL model (structural) + GUMBO (behavioral) + Logika/Verus (code-level conformance) + seL4 (kernel-level isolation proof) → high-assurance system suitable for DO-178C / DO-331 objectives. This layered argument is the distinguishing feature versus a conventional RTOS-based development process.


5. Applicability to capOS

5.1 Where the Approaches Align

Both HAMR and capOS treat the formal interface definition as the authoritative contract layer. HAMR uses the AADL model + GUMBO contract annotations; capOS uses the Cap’n Proto schema. Both insist that the interface is the permission: in HAMR, an AADL connection determines which component can write to which port, and the generated CAmkES capability configuration enforces that topology; in capOS, holding a capability to a CapObject determines what methods a caller can invoke, and narrower capabilities enforce tighter access.

Both generate typed, platform-adapted communication glue from the interface definition. HAMR generates port API stubs and CAmkES/Microkit configuration; capOS generates (via capnpc + capos-rt) the typed method dispatch layer that clients call.

5.2 Static vs. Dynamic Capability Topology

The sharpest structural difference: HAMR produces a closed, static topology. All components, connections, and capability distributions are fixed at build time. CAmkES explicitly does not allow runtime changes – the set of components and their communication channels is defined at system configuration time and instantiated at boot. This is intentional: the full topology can be statically analyzed, and the seL4 capability distribution can be checked against a capDL (capability distribution language) model as part of the assurance case.

capOS is designed around dynamic capability routing. The kernel acts as a capnp-rpc router; new capabilities can be forged by authorized processes, transferred via Move/Copy grants, and held in per-process CapTables that grow and change at runtime. The ProcessSpawner, AuthorityBroker, and SessionManager capabilities enable runtime-created service graphs. This is not a weakness – it is the whole point of a capability-rpc OS – but it means the topology at any moment is not checkable against a static model.

For capOS’s current research target, the dynamic model is the right fit. For a flight-critical avionics partition, the static model is the right fit. These are different points on the assurance-vs-flexibility tradeoff.

5.3 Generated Glue vs. Manual CapObject Dispatch

In HAMR, the developer writes only application logic in initialize/ compute/finalize entry points; all communication infrastructure is generated. The developer-visible API is uniform across backends – the same get_altimeter() call works on JVM, Linux, and seL4.

In capOS, capability dispatch is currently manual: each CapObject implementation handles capnp message bytes directly via match-on-method-ID. The typed client wrappers in capos-rt abstract this for callers, but the server-side skeleton is hand-written per capability type. HAMR’s approach suggests an achievable improvement: if capOS had a capnpc plugin or a build tool that generated CapObject dispatch stubs and server-side skeletons from .capnp schemas, the authoring burden per capability type would shrink significantly. The schema already carries everything needed to generate the match arm, parameter decode, and return encode.

5.4 Model-Driven Partition Generation

HAMR demonstrates the utility of driving the entire partition topology – not just per-component skeletons – from the model. The CAmkES .camkes file, the domain schedule, the capability permission split, and the component binaries all originate from a single AADL instance model. This is “the model is the system” in the most literal sense.

capOS has no equivalent today. Service-graph topology is described in CUE/AADL manifests and executed dynamically by init. For future high-assurance work (e.g., flight-critical or safety-certified deployments), a model-driven generation step that produces both the system.cue manifest and the capability grant topology from a formal model would be directly applicable. The capnp schema would serve as the interface contract (as it already does), while a system-level architecture model would specify the instantiation and wiring.

5.5 Contract Verification Gap

HAMR demonstrates a full contract pipeline: model annotation (GUMBO) → generated code contracts (Logika / Verus) → formal verification. capOS has no equivalent for CapObject implementations. The .capnp schema defines the method signatures and types, but there is no Logika/Verus-style annotation layer for pre/postconditions on individual capability method handlers.

For a research OS this is acceptable – capOS’s assurance comes from seL4-style kernel isolation, not from verified component behavior. But the HAMR model shows what the path to component-level behavioral verification looks like when starting from a schema-as-contract baseline.

5.6 AADL vs. Cap’n Proto as the Schema Layer

AADL carries significantly more non-functional information than Cap’n Proto schemas: scheduling properties (period, deadline, dispatch protocol), port queue depths, memory footprint bounds, required hardware (device associations), and safety annex annotations. Cap’n Proto schemas carry method signatures, field types, and (via annotations) some semantic metadata, but scheduling and resource-budget properties are out of scope for the format.

For capOS’s current use – typed RPC dispatch, schema-stable ABI, and code-generation for typed clients and server stubs – Cap’n Proto is the right tool. AADL is not a replacement: it is a higher-level architecture modeling language that sits above the RPC schema layer and consumes it. A future model-driven capOS toolchain would use AADL or SysMLv2 at the system level and capnp schemas at the interface level, not choose one over the other.


6. Open Questions for Future Evaluation

  • capnpc → CapObject stub generation: Given that the capnp schema fully describes method signatures, types, and return shapes, how much of the server-side CapObject dispatch boilerplate could a code-gen plugin eliminate? HAMR’s generated skeletons suggest this is tractable.

  • System-manifest generation from a topology model: Could a lightweight AADL-or-SysMLv2 instance model (or a CUE-native equivalent) generate the system.cue manifest, the initial CapTable grants, and a capDL-style verification model for the static portion of the system graph?

  • GUMBO-inspired contract annotations in capnp schemas: Could capnp annotation syntax be used to attach precondition/postcondition stubs (analogous to GUMBO’s requires/guarantees) to interface methods, enabling future Verus or Creusot verification of CapObject implementations?

  • seL4 Microkit vs. CAmkES: The 2025 HAMR work migrates from CAmkES to the newer seL4 Microkit, which uses Rust and a simpler protection-domain model. If capOS ever targets seL4 as an optional verified kernel backend, Microkit + HAMR would be the current recommended entry point.


Sources

Fuchsia Zircon Kernel: Research Report for capOS

Research into Zircon’s design for informing capOS capability model, IPC, virtual memory, async I/O, and interface definition decisions.

1. Handle-Based Capability Model

Overview

Zircon implements capabilities as handles. A handle is a process-local integer (similar to a Unix file descriptor) that references a kernel object and carries a bitmask of rights. The kernel maintains a per-process handle table that maps handle values to (kernel_object_pointer, rights) pairs. Processes can only interact with kernel objects through handles they hold.

There is no ambient authority in Zircon. A process cannot address kernel objects by name, path, or global ID – it must possess a handle. The initial set of handles is passed to a process at creation time by its parent (or by the component framework).

Handle Representation

Internally, a handle is:

  • A process-local 32-bit integer (the “handle value”). The low two bits encode a generation counter to detect use-after-close.
  • A reference to a kernel object (refcounted Dispatcher in Zircon’s C++).
  • A rights bitmask (zx_rights_t, a uint32_t).

The handle table is per-process, so handle value 0x1234 in process A and 0x1234 in process B refer to completely different objects (or nothing).

Rights

Rights are a bitmask that constrain what operations a handle can perform. Key rights include:

RightMeaning
ZX_RIGHT_DUPLICATECan be duplicated via zx_handle_duplicate()
ZX_RIGHT_TRANSFERCan be sent through a channel
ZX_RIGHT_READCan read data (channel messages, VMO bytes)
ZX_RIGHT_WRITECan write data
ZX_RIGHT_EXECUTEVMO can be mapped as executable
ZX_RIGHT_MAPVMO can be mapped into a VMAR
ZX_RIGHT_GET_PROPERTYCan query object properties
ZX_RIGHT_SET_PROPERTYCan modify object properties
ZX_RIGHT_SIGNALCan set user signals on the object
ZX_RIGHT_WAITCan wait on the object’s signals
ZX_RIGHT_MANAGE_PROCESSCan perform management ops on a process
ZX_RIGHT_MANAGE_THREADCan manage threads

When a syscall is invoked on a handle, the kernel checks that the handle’s rights include the rights required by that syscall. For example, zx_channel_write() requires ZX_RIGHT_WRITE on the channel handle.

Rights can only be reduced, never amplified. zx_handle_duplicate() takes a rights mask and the new handle gets original_rights & requested_rights.

Handle Lifecycle

Creation: Syscalls that create kernel objects return handles. For example, zx_channel_create() returns two handles (one for each endpoint). zx_vmo_create() returns a VMO handle. The initial rights are defined per object type (e.g., a new channel endpoint gets READ|WRITE|TRANSFER|DUPLICATE|SIGNAL|WAIT).

Duplication: zx_handle_duplicate(handle, rights) -> new_handle. Creates a second handle to the same kernel object, possibly with reduced rights. The original is untouched. Requires ZX_RIGHT_DUPLICATE on the source handle.

Transfer: Handles are transferred through channels. When a message is written to a channel, handles listed in the message are moved from the sender’s handle table to a transient state inside the channel message. When the message is read, those handles are installed into the receiver’s handle table with new handle values. The original handle values in the sender become invalid. Transfer requires ZX_RIGHT_TRANSFER on each handle being sent.

Replacement: zx_handle_replace(handle, rights) -> new_handle. Atomically invalidates the old handle and creates a new one with the specified rights (must be a subset). This avoids a window where two handles exist simultaneously (unlike duplicate-then-close). Useful for reducing rights before transferring.

Closing: zx_handle_close(handle). Removes the handle from the process’s table and decrements the kernel object’s refcount. When the last handle to an object is closed, the object is destroyed (with some exceptions like the kernel itself keeping references).

Comparison to capOS

capOS’s current CapTable maps CapId (u32) to an Arc<dyn CapObject>. The shared Arc lets a single kernel capability (for example, a kernel:endpoint owned by one service and referenced by another through CapSource::Service) back multiple per-process CapTable slots for cross-process IPC. This is conceptually similar to Zircon’s handle table, but with key differences:

AspectZirconcapOS (current)
RightsBitmask per handleNone (all-or-nothing)
Object typesFixed kernel types (Channel, VMO, etc.)Extensible via CapObject trait
TransferMove semantics through channelsCopy/move descriptors through Endpoint IPC
DuplicationExplicit with rights reductionCopy transfer for transferable holds
RevocationClose handle; object dies with last refRemove from table; no propagation
InterfaceFixed syscall per object typeCap’n Proto method dispatch
Generation counterLow bits of handle valueUpper bits of CapId

Recommendations for capOS:

  1. Keep method authority in typed interfaces for now. Zircon’s rights bitmask is useful for an untyped syscall surface. capOS currently uses narrow Cap’n Proto interfaces plus hold-edge transfer metadata; generic READ/WRITE flags would duplicate schema-level authority unless a concrete cross-interface need appears.

  2. Handle generation counters. Implemented: capOS encodes a generation tag in the upper bits of CapId, with lower bits selecting the table slot. This catches stale CapId use after slot reuse.

  3. Move semantics for transfer. Implemented for Endpoint CALL/RETURN sideband descriptors. Copy transfer remains explicit and requires a transferable source hold.

  4. replace operation. An atomic replace (invalidate old, create new with reduced rights) is cleaner than duplicate-then-close for rights attenuation before transfer.

2. Channels

Overview

Zircon channels are the fundamental IPC primitive. A channel is a bidirectional, asynchronous message-passing pipe with two endpoints. Each endpoint is a separate kernel object referenced by a handle.

Creation and Structure

zx_channel_create(options, &handle0, &handle1) creates a channel and returns handles to both endpoints. Each endpoint can be independently transferred to different processes. When one endpoint is closed, the other becomes “peer-closed” (signaled with ZX_CHANNEL_PEER_CLOSED).

Message Format

A channel message consists of:

  • Data: Up to 65,536 bytes (64 KiB) of arbitrary byte payload.
  • Handles: Up to 64 handles transferred with the message.

Messages are discrete and ordered (FIFO). There is no streaming or partial reads – you read a complete message or nothing.

Write and Read Syscalls

Write: zx_channel_write(handle, options, bytes, num_bytes, handles, num_handles)

  • Copies bytes into the kernel message queue.
  • Moves each handle in the handles array from the caller’s handle table into the message. If any handle is invalid or lacks ZX_RIGHT_TRANSFER, the entire write fails and no handles are moved.
  • The write is non-blocking. If the peer has been closed, returns ZX_ERR_PEER_CLOSED.

Read: zx_channel_read(handle, options, bytes, handles, num_bytes, num_handles, actual_bytes, actual_handles)

  • Dequeues the next message. Copies data into bytes, installs handles into the caller’s handle table, writing new handle values into the handles array.
  • If the buffer is too small, returns ZX_ERR_BUFFER_TOO_SMALL and fills actual_bytes/actual_handles so the caller can retry with a larger buffer.
  • Non-blocking by default.

zx_channel_call: A synchronous call primitive. Writes a message to the channel, then blocks waiting for a reply with a matching transaction ID. This is the primary mechanism for client-server RPC. The kernel optimizes this path to avoid unnecessary scheduling: if the server thread is waiting to read, the kernel can directly switch to it (similar to L4 IPC optimizations).

Handle Transfer Mechanics

When handles are sent through a channel:

  1. The kernel validates all handles (exist, have TRANSFER right).
  2. Handles are atomically removed from the sender’s table.
  3. Handle objects are stored inside the kernel message structure.
  4. On read, handles are inserted into the receiver’s table with fresh handle values.
  5. If the channel is destroyed with unread messages containing handles, those handles are closed (objects’ refcounts decremented).

This is critical: handle transfer is move, not copy. The sender loses the handle. To keep a copy, the sender must duplicate before sending.

Signals

Each channel endpoint has associated signals:

  • ZX_CHANNEL_READABLE – at least one message is queued.
  • ZX_CHANNEL_PEER_CLOSED – the other endpoint was closed.

Processes can wait on these signals using zx_object_wait_one(), zx_object_wait_many(), or by binding to a port (see Section 4).

FIDL Relationship

Channels carry raw bytes + handles. FIDL (Section 5) provides the structured protocol layer on top: it defines how bytes are laid out (message header with transaction ID, ordinal, flags; then the payload) and how handles in the message correspond to protocol-level concepts (client endpoints, server endpoints, VMOs, etc.).

Every FIDL protocol communication happens over a channel. A FIDL “client end” is a channel endpoint handle where the client sends requests and reads responses. A “server end” is the other endpoint where the server reads requests and sends responses.

Comparison to capOS

capOS currently uses shared submission/completion rings with Endpoint objects for cross-process CALL/RECV/RETURN routing. Same-process capabilities dispatch directly through the holder’s table; cross-process Endpoint calls queue to the server ring and can trigger a direct IPC handoff when the receiver is blocked.

AspectZircon ChannelscapOS
TopologyPoint-to-point, 2 endpointsEndpoint-routed capability calls
AsyncNon-blocking read/write + signal waitsShared SQ/CQ rings
Handle/cap transferEmbedded in messagesSideband transfer descriptors
Message formatRaw bytes + handlesCap’n Proto serialized
Size limits64 KiB data, 64 handles64 KiB params (current limit)
BufferingKernel-side message queueEndpoint queues plus per-process rings

Recommendations for capOS:

  1. Capability transfer alongside capnp messages. Zircon embeds handles as out-of-band data alongside message bytes. capOS has adopted the same separation with ring sideband transfer descriptors and result-cap records. That keeps the kernel from parsing arbitrary Cap’n Proto payload graphs.

  2. Two-endpoint channels vs. Endpoint calls. Zircon’s channels are general-purpose pipes. capOS uses a lighter Endpoint CALL/RECV/RETURN model where a capability invocation is routed to the serving process rather than requiring a channel object per connection.

  3. Message size limits. Zircon’s 64 KiB limit has been a pain point (large data must go through VMOs). capOS’s capnp messages naturally handle this because large data can be a separate VMO-like capability referenced in the message. Keep the per-message limit reasonable (64 KiB is a good default) and use capability references for bulk data.

3. VMARs and VMOs

Virtual Memory Objects (VMOs)

A VMO is a kernel object representing a contiguous region of virtual memory that can be mapped into address spaces. VMOs are the fundamental unit of memory in Zircon.

Types:

  • Paged VMO: Backed by the page fault handler. Pages are allocated on demand. This is the default.
  • Physical VMO: Backed by a specific contiguous range of physical memory. Used for device MMIO.
  • Contiguous VMO: Like a paged VMO but guarantees physically contiguous pages. Used for DMA.

Key operations:

  • zx_vmo_create(size, options) -> handle: Create a paged VMO.
  • zx_vmo_read(handle, buffer, offset, length): Read bytes from a VMO.
  • zx_vmo_write(handle, buffer, offset, length): Write bytes to a VMO.
  • zx_vmo_get_size() / zx_vmo_set_size(): Query/resize.
  • zx_vmo_op_range(): Operations like commit (force-allocate pages), decommit (release pages back to system), cache ops.

VMOs can be read/written directly via syscalls without mapping them. This is useful for small transfers but less efficient than mapping for large data.

Copy-on-Write (CoW) Cloning

zx_vmo_create_child(handle, options, offset, size) -> child_handle

Creates a child VMO that is a CoW clone of a range within the parent. Several clone types exist:

  • Snapshot (ZX_VMO_CHILD_SNAPSHOT): Point-in-time snapshot. Both parent and child see CoW pages. Writes to either side trigger page copies. The child is fully independent after creation – closing the parent does not affect committed pages in the child.

  • Slice (ZX_VMO_CHILD_SLICE): A window into the parent. No CoW – writes to the slice are visible through the parent and vice versa. The child cannot outlive the parent.

  • Snapshot-at-least-on-write (ZX_VMO_CHILD_SNAPSHOT_AT_LEAST_ON_WRITE): Like snapshot but allows the implementation to share unchanged pages between parent and child more aggressively (pages only diverge when written).

CoW cloning is central to how Fuchsia implements fork()-like semantics for memory (though Fuchsia doesn’t have fork()) and how it shares immutable data (e.g., shared libraries are CoW-cloned VMOs).

Virtual Memory Address Regions (VMARs)

A VMAR represents a contiguous range of virtual address space within a process. VMARs form a tree rooted at the process’s root VMAR, which covers the entire user-accessible address space.

Hierarchy:

Root VMAR (entire user address space)
  +-- Sub-VMAR A (e.g., 0x1000..0x10000)
  |     +-- Mapping of VMO X at offset 0x1000
  |     +-- Sub-VMAR B (0x5000..0x8000)
  |           +-- Mapping of VMO Y at offset 0x5000
  +-- Sub-VMAR C (0x20000..0x30000)
        +-- Mapping of VMO Z at offset 0x20000

Key operations:

  • zx_vmar_map(vmar, options, offset, vmo, vmo_offset, len) -> addr: Map a VMO (or a range of it) into the VMAR at a specific offset or let the kernel choose (ASLR).
  • zx_vmar_unmap(vmar, addr, len): Remove a mapping.
  • zx_vmar_protect(vmar, options, addr, len): Change permissions (read/write/execute) on a mapped range.
  • zx_vmar_allocate(vmar, options, offset, size) -> child_vmar, addr: Create a sub-VMAR.
  • zx_vmar_destroy(vmar): Recursively unmap everything and destroy all sub-VMARs. Prevents new mappings.

ASLR: Zircon implements address space layout randomization through VMARs. When ZX_VM_OFFSET_IS_UPPER_LIMIT or no specific offset is given, the kernel randomizes placement within the VMAR.

Permissions: Mapping permissions (R/W/X) are constrained by the VMO handle’s rights. A VMO handle without ZX_RIGHT_EXECUTE cannot be mapped as executable, regardless of what the zx_vmar_map() call requests.

Why VMARs Matter

VMARs provide:

  1. Sandboxing within a process. A component can be given a sub-VMAR handle instead of the root VMAR, limiting where it can map memory.
  2. Hierarchical cleanup. Destroying a VMAR recursively unmaps everything beneath it.
  3. Controlled mapping. The parent decides the address space layout for child components by allocating sub-VMARs and passing only sub-VMAR handles.

Comparison to capOS

capOS currently has AddressSpace plus a VirtualMemory capability for anonymous map/unmap/protect operations. FrameAllocator returns typed MemoryObject ownership caps rather than raw physical frame grants, but MemoryObject does not yet provide mapping, cloning, or zero-copy sharing.

AspectZirconcapOS (current)
Memory objectsVMO (paged, physical, contiguous)Owned MemoryObject caps plus anonymous VirtualMemory mappings
CoWVMO child clones (snapshot, slice)Not implemented
Address spaceVMAR treeFlat AddressSpace plus VirtualMemory cap
SharingMap same VMO in multiple processesNot implemented
PermissionsPer-mapping + per-handle rightsPer-page flags at mapping time

Recommendations for capOS:

  1. VMO-equivalent capability. A “MemoryObject” capability that represents a range of memory (backed by demand-paging or physical pages). This becomes the unit of sharing: pass a MemoryObject cap through IPC, and the receiver maps it into their address space. Define it in schema/capos.capnp.

  2. Sub-VMAR capabilities for sandboxing. When spawning a process, instead of granting access to the full address space, grant a sub-region capability. This limits where the process can map memory.

  3. CoW cloning is valuable but not urgent. The primary use case (shared libraries, fork) may not apply to capOS’s early stages. Design the VMO interface to support cloning later.

  4. VMO read/write without mapping. Zircon allows reading/writing VMO contents via syscall without mapping. This is useful for small IPC data and avoids TLB pressure. Consider supporting this in capOS’s MemoryObject.

4. Async Model (Ports)

Overview

Zircon’s async I/O model is built around ports – kernel objects that receive event packets. A port is similar to Linux’s epoll but with important differences. It is the foundation for all async programming in Fuchsia.

Port Basics

A port is a kernel object with a queue of packets (zx_port_packet_t). Packets arrive either from signal-based waits or from direct user queuing.

Key operations:

  • zx_port_create(options) -> handle: Create a port.
  • zx_port_wait(port, deadline) -> packet: Dequeue the next packet, blocking until one is available or the deadline expires.
  • zx_port_queue(port, packet): Manually enqueue a user packet.
  • zx_port_cancel(port, source, key): Cancel pending waits.

Signal-Based Async (Object Wait Async)

zx_object_wait_async(object, port, key, signals, options):

This is the primary mechanism. It tells the kernel: “when object has any of these signals asserted, deliver a packet to port with this key.”

Two modes:

  • One-shot (ZX_WAIT_ASYNC_ONCE): The wait fires once and is automatically removed. The user must re-register after handling.
  • Edge-triggered (ZX_WAIT_ASYNC_EDGE): Fires each time a signal transitions from deasserted to asserted. Stays registered.

Packet Format

typedef struct zx_port_packet {
    uint64_t key;              // User-defined key (set during wait_async)
    uint32_t type;             // ZX_PKT_TYPE_SIGNAL_ONE, ZX_PKT_TYPE_USER, etc.
    zx_status_t status;        // Result status
    union {
        zx_packet_signal_t signal;   // Which signals triggered
        zx_packet_user_t user;       // User-queued packet payload (32 bytes)
        zx_packet_guest_bell_t guest_bell;
        // ... other packet types
    };
} zx_port_packet_t;

The signal variant includes trigger (which signals were waited on), observed (current signal state), and a count (for edge-triggered, how many transitions).

Async Dispatching (libasync)

Fuchsia’s userspace async library (libfidl, async-loop) provides a higher-level event loop:

  1. async::Loop: An event loop that owns a port and dispatches events to registered handlers.
  2. async::Wait: Wraps zx_object_wait_async() with a callback. When the signal fires, the loop calls the handler.
  3. async::Task: Runs a closure on the loop’s dispatcher.
  4. FIDL bindings: The async FIDL bindings register channel-readable waits on the loop’s port. When a message arrives, the FIDL dispatcher decodes it and calls the appropriate protocol method handler.

The typical pattern:

loop = async::Loop()
loop.port -> zx_port_create()

// Register interest in channel readability
zx_object_wait_async(channel, loop.port, key, ZX_CHANNEL_READABLE)

// Event loop
while True:
    packet = zx_port_wait(loop.port)
    handler = lookup(packet.key)
    handler(packet)
    // Re-register if one-shot

Comparison to Linux io_uring

AspectZircon PortsLinux io_uring
ModelEvent notification (signals)Operation submission/completion
SubmissionNo SQ; operations are separate syscallsSQ ring: batch operations
CompletionPort packet queueCQ ring in shared memory
Kernel transitionsOne per wait_async + one per port_waitOne per io_uring_enter (batched)
Memory sharingNo shared ring buffersSQ/CQ are mmap’d shared memory
Zero-copyNot for port packetsRegistered buffers, fixed files
BatchingNo inherent batchingCore design: submit N ops, one syscall
ChainingNot supportedSQE linking (sequential/parallel)
ScopeSignal notification onlyFull I/O operations (read, write, send, recv, fsync, …)

Key differences:

  1. Ports are notification-based; io_uring is operation-based. A port tells you “something happened” (a signal was asserted), then you do separate syscalls to act on it (read the channel, accept the socket, etc.). io_uring lets you submit the actual I/O operation and the kernel does it asynchronously, returning the result in the completion ring.

  2. io_uring avoids syscalls for submission. The submission queue is shared memory – userspace writes SQEs and the kernel reads them without a syscall (in polling mode) or with a single io_uring_enter() for a batch of operations. Ports require a syscall per wait_async registration.

  3. io_uring supports chaining. SQE linking allows dependent operations (e.g., “read from file, then write to socket”) without returning to userspace between steps.

  4. Ports are simpler. The signal model is straightforward and composes well with Zircon’s object model. io_uring’s complexity (dozens of opcodes, registered buffers, fixed files, kernel-side polling) is much higher.

Performance Tradeoffs

Ports:

  • Pro: Simple, well-integrated with kernel object model, easy to reason about.
  • Con: Extra syscalls per operation (wait_async to register, port_wait to receive, then the actual operation syscall). At least 3 syscalls per async operation.

io_uring:

  • Pro: Can batch many operations in a single syscall. Shared-memory rings avoid copies. Kernel-side polling can eliminate syscalls entirely.
  • Con: Complex API surface, security attack surface (many kernel bugs have been in io_uring), complex state management.

Comparison to capOS’s Planned Async Rings

capOS plans io_uring-inspired capability rings: an SQ where userspace submits capnp-serialized capability invocations and a CQ where the kernel posts completions.

AspectZircon PortscapOS Planned Rings
SubmissionSeparate syscallsSQ in shared memory
CompletionPort packet queue (kernel-owned)CQ in shared memory
Operation scopeSignal notification onlyFull capability invocations
BatchingNoneNatural (fill SQ, single syscall)
Wire formatFixed packet structCap’n Proto messages

Recommendations for capOS:

  1. The io_uring model is better than ports for capOS’s use case. Since every operation in capOS is a capability invocation (not just a signal notification), putting the full operation in the submission ring eliminates the extra round-trip that ports require. This is the right choice.

  2. Keep a signal/notification mechanism too. Even with async rings, capOS needs a way to wait for events (e.g., “data available on this channel”, “process exited”). Consider a simple signal/wait mechanism alongside the capability rings – perhaps signal delivery goes through the CQ as a special completion type.

  3. Study io_uring’s SQE linking. Chaining dependent capability calls (e.g., “read from FileStore, then write to Console”) without returning to userspace is powerful. This maps naturally to Cap’n Proto promise pipelining: “call method A on cap X, then call method B on the result’s capability” – the kernel can chain these internally.

  4. Registered/fixed capabilities. io_uring has “fixed files” (registered fd set for faster lookup). capOS could have a “hot set” of capabilities pinned in the SQ context for faster dispatch (avoid per-call table lookup).

  5. Completion ordering. io_uring completions can arrive out of order. capOS’s CQ should also support out-of-order completion (each SQE has a user_data tag echoed in the CQE) to enable true async pipelining.

5. FIDL (Fuchsia Interface Definition Language)

Overview

FIDL is Fuchsia’s IDL for defining protocols that communicate over channels. It serves a similar role to Cap’n Proto schemas in capOS: defining the contract between client and server.

FIDL vs. Cap’n Proto: Schema Language

FIDL example:

library fuchsia.example;

type Color = strict enum : uint32 {
    RED = 1;
    GREEN = 2;
    BLUE = 3;
};

protocol Painter {
    SetColor(struct { color Color; }) -> ();
    DrawLine(struct { x0 float32; y0 float32; x1 float32; y1 float32; }) -> ();
    -> OnPaintComplete(struct { num_pixels uint64; });
};

Equivalent Cap’n Proto:

enum Color { red @0; green @1; blue @2; }

interface Painter {
    setColor @0 (color :Color) -> ();
    drawLine @1 (x0 :Float32, y0 :Float32, x1 :Float32, y1 :Float32) -> ();
}

Key differences in the schema language:

FeatureFIDLCap’n Proto
Unionsflexible union, strict unionAnonymous unions in structs
Enumsstrict enum, flexible enumenum (always strict)
Optionalitybox<T>, nullable typesDefault values, union with Void
Evolutionflexible keyword for forward compatField numbering, @N ordinals
Tablestable (like protobuf, sparse)struct with default values
Events-> EventName(...) server-sentNo built-in events
Error syntax-> () error uint32Must encode in return struct
Capability typesclient_end:P, server_end:Pinterface P as field type

FIDL’s table type is analogous to Cap’n Proto structs in terms of evolvability (can add fields without breaking), but Cap’n Proto structs are more compact on the wire (fixed-size inline section + pointers) while FIDL tables use an envelope-based encoding.

Wire Format Comparison

FIDL wire format:

  • Little-endian, 8-byte aligned.
  • Messages have a 16-byte header: txid (4 bytes), flags (3 bytes), magic byte (0x01), ordinal (8 bytes).
  • Structs are laid out inline with natural alignment and explicit padding.
  • Out-of-line data (strings, vectors, tables) uses offset-based indirection via “envelopes” (inline 8-byte entry: 4 bytes num_bytes, 2 bytes num_handles, 2 bytes flags).
  • Handles are out-of-band. The wire format contains ZX_HANDLE_PRESENT (0xFFFFFFFF) or ZX_HANDLE_ABSENT (0x00000000) markers where handles appear. The actual handles are in the channel message’s handle array, consumed in order of appearance in the linearized message.
  • Encoding is done into a contiguous byte buffer + a separate handle array, matching the channel write API.
  • No pointer arithmetic. FIDL v2 uses a “depth-first traversal order” encoding where out-of-line objects are laid out sequentially. Offsets are not stored; the decoder walks the type schema to find boundaries.

Cap’n Proto wire format:

  • Little-endian, 8-byte aligned (word-based).
  • Messages have a segment table header listing segment sizes.
  • Structs have a fixed data section + pointer section. Pointers are relative offsets (self-relative, in words).
  • Uses pointer-based random access: can read any field without parsing the entire message.
  • Capabilities are indexed. Cap’n Proto’s RPC protocol assigns capability table indices to interface references in messages. The actual capability (file descriptor, handle, etc.) is transferred out-of-band.
  • Supports multi-segment messages (FIDL is always single-segment).
  • Zero-copy read: can read directly from the wire buffer without deserialization.

Key wire format differences:

PropertyFIDLCap’n Proto
Random accessNo (sequential decode)Yes (pointer-based)
Zero-copy readPartial (decode-on-access for some types)Full (read from buffer)
SegmentsSingle contiguous bufferMulti-segment
PointersImplicit (traversal order)Explicit (relative offsets)
Size overheadSmaller (no pointer words)Larger (pointer section)
Decode costMust validate sequentiallyCan validate lazily
Handle/cap encodingPresence markers + out-of-band arrayCap table indices + out-of-band

FIDL Capability Transfer

FIDL has first-class syntax for capability transfer in protocols:

protocol FileSystem {
    Open(resource struct {
        path string:256;
        flags uint32;
        object server_end:File;
    }) -> ();
};

protocol File {
    Read(struct { count uint64; }) -> (struct { data vector<uint8>:MAX; });
    GetBuffer(struct { flags uint32; }) -> (resource struct { buffer zx.Handle:VMO; });
};
  • server_end:File – a channel endpoint where the server will serve the File protocol. The client creates a channel, keeps the client end, and sends the server end through this call.
  • client_end:File – a channel endpoint for a client of the File protocol.
  • zx.Handle:VMO – a handle to a specific kernel object type (VMO).
  • The resource keyword marks types that contain handles (and thus cannot be copied, only moved).

The FIDL compiler tracks handle ownership: types containing handles are “resource types” with move semantics. This is enforced at the language binding level (e.g., in C++, resource types are move-only; in Rust, they implement Drop but not Clone).

Comparison to capOS’s Cap’n Proto Usage

Cap’n Proto natively supports capability transfer through its interface types:

interface FileSystem {
    open @0 (path :Text, flags :UInt32) -> (file :File);
}

interface File {
    read @0 (count :UInt64) -> (data :Data);
    getBuffer @1 (flags :UInt32) -> (buffer :MemoryObject);
}

In standard Cap’n Proto RPC, file :File in the return type means “a capability to a File interface.” The RPC system assigns a capability table index, transfers it out-of-band, and the receiver gets a live reference to invoke further methods.

Recommendations for capOS:

  1. Use out-of-band capability transfer beside Cap’n Proto payloads. Cap’n Proto RPC has capability descriptors indexed into a capability table, but capOS currently keeps kernel transfer semantics in ring sideband records so the kernel can treat Cap’n Proto payload bytes as opaque. Promise pipelining should build on that sideband result-cap namespace rather than requiring general payload traversal in the kernel.

  2. No need to switch to FIDL. Cap’n Proto’s wire format is superior for capOS’s use case:

    • Random access means runtimes and services can inspect specific fields without full deserialization. The kernel should keep using bounded sideband metadata for transport decisions.
    • Zero-copy read means less allocation in userspace protocol handling.
    • Multi-segment messages allow avoiding large contiguous allocations.
    • Promise pipelining is native to Cap’n Proto RPC, aligning with capOS’s planned async ring chaining.
  3. FIDL’s resource keyword is worth imitating. Mark capnp types that contain capabilities differently from pure-data types. This could be done at the schema level (Cap’n Proto already distinguishes interface fields) or as a convention. This enables the kernel to fast-path messages that contain no capabilities (no need to scan for capability descriptors).

  4. FIDL’s table type for evolution. Cap’n Proto structs already support adding fields, but capOS should be aware that FIDL tables are more explicitly designed for cross-version compatibility. For system interfaces that will evolve, consider using Cap’n Proto groups or designing structs with generous ordinal spacing.

6. Synthesis: Relevance to capOS

Handle Model vs. Typed Capability Dispatch

Zircon’s handle model is untyped at the handle level – a handle is just (object_ref, rights). The type comes from the object. All operations go through fixed syscalls (zx_channel_write, zx_vmo_read, etc.).

capOS’s model is typed at the capability level – each capability implements a Cap’n Proto interface with method dispatch. Operations go through ring SQEs such as CAP_OP_CALL, with Cap’n Proto params and results carried in userspace buffers.

Both are valid. Zircon’s approach is lower overhead (no serialization for simple operations like vmo_read), while capOS’s approach gives uniformity (every operation has the same wire format, enabling persistence and network transparency).

Hybrid recommendation: For performance-critical operations (memory mapping, signal waiting), consider adding “fast-path” syscalls that bypass capnp serialization, similar to how Zircon has dedicated syscalls per object type. The capnp path remains the general mechanism and the “canonical” interface.

Async Rings vs. Ports: The Right Call

capOS’s io_uring-inspired async rings are a better fit than Zircon’s port model for a capability OS:

  1. Ports require separate syscalls for registration, waiting, and the actual operation. Async rings batch everything.
  2. Cap’n Proto’s promise pipelining maps naturally to SQE chaining.
  3. The shared-memory ring design avoids kernel-side queuing overhead.

However, learn from ports:

  • The signal model (each object has a signal set, watchers are notified) is clean and composable. Consider making “wait for signal” a CQ event type.
  • zx_port_queue() (user-initiated packets) is useful for waking up event loops from user code. Support user-initiated CQ entries.

VMO/VMAR vs. capOS Memory Model

capOS should implement VMO-equivalent capabilities after the current Endpoint and transfer baseline:

  • IPC already has shared rings, but bulk data still needs explicit shared memory objects.
  • Capability transfer of memory regions (passing a MemoryObject cap through IPC) is the standard pattern for bulk data transfer.
  • CoW cloning enables efficient process creation.

Proposed capability interfaces:

interface MemoryObject {
    read @0 (offset :UInt64, count :UInt64) -> (data :Data);
    write @1 (offset :UInt64, data :Data) -> ();
    getSize @2 () -> (size :UInt64);
    setSize @3 (size :UInt64) -> ();
    createChild @4 (offset :UInt64, size :UInt64, options :UInt32) -> (child :MemoryObject);
}

interface AddressRegion {
    map @0 (offset :UInt64, vmo :MemoryObject, vmoOffset :UInt64, len :UInt64, flags :UInt32) -> (addr :UInt64);
    unmap @1 (addr :UInt64, len :UInt64) -> ();
    protect @2 (addr :UInt64, len :UInt64, flags :UInt32) -> ();
    allocateSubRegion @3 (offset :UInt64, size :UInt64) -> (region :AddressRegion, addr :UInt64);
}

FIDL vs. Cap’n Proto: Stay with Cap’n Proto

Cap’n Proto is the right choice for capOS. The advantages over FIDL:

  1. Language-independent standard. FIDL is Fuchsia-only. Cap’n Proto has implementations in C++, Rust, Go, Python, Java, etc.
  2. Zero-copy random access. The kernel can inspect message fields without full deserialization.
  3. Promise pipelining. Native to capnp-rpc, enabling the async ring chaining that capOS plans.
  4. Persistence. Cap’n Proto messages are self-describing (with schema) and suitable for on-disk storage – important for capOS’s planned capability persistence.

The one thing FIDL does better: tight integration of handle/capability metadata in the type system (the resource keyword, client_end/server_end syntax, handle type constraints). capOS should ensure its capnp schemas clearly distinguish capability-carrying types and that the kernel enforces capability transfer semantics.

Concrete Action Items for capOS

Ordered by priority and dependency:

  1. Keep typed-interface authority model. Do not add a Zircon-style generic rights bitmask until a concrete method-attenuation need beats narrow wrapper capabilities and transfer-mode metadata.

  2. Handle generation counters. Done: upper bits of CapId detect stale references.

  3. Design MemoryObject/SharedBuffer capability. Define and implement the shared-memory object that replaces raw-frame transfer for bulk IPC.

  4. Design AddressRegion capability (Stage 5). Sub-VMAR-like sandboxing. The root VMAR handle is part of the initial capability set.

  5. Capability transfer sideband. Baseline CALL/RETURN copy and move transfer is implemented; promise-pipelined result-cap mapping still needs a precise rule before pipeline dispatch lands.

  6. Async rings with signal delivery. SQ/CQ capability rings are implemented for transport; notification objects and promise pipelining remain future work.

  7. User-queued CQ entries (with async rings). Allow userspace to post wake-up events to its own CQ, enabling pure-userspace event loop integration.

Appendix: Key Zircon Syscall Reference

For reference, the most architecturally significant Zircon syscalls:

SyscallPurpose
zx_handle_closeClose a handle
zx_handle_duplicateDuplicate with rights reduction
zx_handle_replaceAtomic replace with new rights
zx_channel_createCreate channel pair
zx_channel_readRead message + handles from channel
zx_channel_writeWrite message + handles to channel
zx_channel_callSynchronous write-then-read (RPC)
zx_port_createCreate async port
zx_port_waitWait for next packet
zx_port_queueEnqueue user packet
zx_object_wait_asyncRegister signal wait on port
zx_object_wait_oneSynchronous wait on one object
zx_vmo_createCreate virtual memory object
zx_vmo_read / writeDirect VMO access
zx_vmo_create_childCoW clone
zx_vmar_mapMap VMO into address region
zx_vmar_unmapUnmap
zx_vmar_allocateCreate sub-VMAR
zx_process_createCreate process (with root VMAR)
zx_process_startStart process execution

Used By

Genode OS Framework: Research Report for capOS

Research on Genode’s capability-based component framework, session routing, VFS architecture, and POSIX compatibility – with lessons for capOS.

1. Capability-Based Component Framework

Core Abstraction: RPC Objects

Genode’s fundamental abstraction is the RPC object. Every service in the system is implemented as an RPC object that can be invoked by clients holding a capability to it. The capability is an unforgeable reference – a kernel- protected token that names a specific RPC object and grants the holder the right to invoke its methods.

Genode supports multiple microkernels (NOVA, seL4, Fiasco.OC, a custom base-hw kernel). The capability model is consistent across all of them, though the kernel-level implementation details differ. The framework abstracts kernel capabilities into its own uniform model.

Key properties of Genode capabilities:

  • Unforgeable. A capability can only be obtained by delegation from a holder or creation by the kernel. There is no mechanism to synthesize a capability from an integer or address.
  • Typed. Each capability refers to an RPC object with a specific interface. The C++ type system enforces interface contracts at compile time.
  • Delegatable. A capability holder can pass it to another component via RPC arguments, allowing authority to flow through the system graph.
  • Revocable. Capabilities can be revoked (invalidated). When an RPC object is destroyed, all capabilities pointing to it become invalid.

Capability Types in Genode

Genode distinguishes several kinds of capabilities based on what they refer to:

  1. Session capabilities. The most common type. A session capability refers to a service session – an ongoing relationship between a client and a server. Example: a Log_session capability lets a client write log messages to a specific log session on a LOG server.

  2. Parent capability. Every component holds an implicit capability to its parent. This is the channel through which it requests resources and sessions. The parent capability is never explicitly passed – it’s built into the component framework.

  3. Dataspace capabilities. Represent shared-memory regions. A Ram_dataspace capability grants access to a specific region of physical memory. Dataspaces are the mechanism for bulk data transfer between components (the RPC path is for small messages and control).

  4. Signal capabilities. Used for asynchronous notifications. A signal source produces signals; holders of the signal capability can register handlers. Signals are Genode’s primary async notification mechanism – they don’t carry data, just wake up the receiver.

Sessions: The Service Contract

A session is the central concept of Genode’s inter-component communication. It represents an established relationship between a client component and a server component, with negotiated resource commitments.

Session lifecycle:

  1. Request. A client asks its parent to create a session of a specific type (e.g., Gui::Session, File_system::Session, Nic::Session). The request includes a label string and optional session arguments.

  2. Routing. The parent routes the session request according to its policy (see Section 2). The request may traverse multiple levels of the component tree.

  3. Creation. The server creates a session object, allocates resources for it (e.g., a shared-memory buffer), and returns a session capability to the client.

  4. Use. The client invokes RPC methods on the session capability. The server handles the calls. Both sides can use shared dataspaces for bulk data.

  5. Close. Either side can close the session. Resources committed to the session are released back.

This model is fundamentally different from Unix IPC (anonymous pipes/sockets). Every session is:

  • Typed – the interface is known at compile time.
  • Named – sessions carry a label used for routing and policy.
  • Resource-accounted – the client explicitly donates RAM to the server via a “session quota” to fund the server-side state for this session. This prevents denial-of-service through resource exhaustion.

Resource Trading

Genode’s resource model is unique and worth studying closely. Resources (primarily RAM) flow through the component tree:

  • The kernel grants a fixed RAM budget to core (the root component).
  • Core grants budgets to its children (typically just init).
  • Init grants budgets to its children according to the deployment config.
  • Each component can donate RAM to servers when opening sessions.

The session_quota mechanism works as follows: when a client opens a session, it specifies how much RAM it donates. This RAM transfer goes from the client’s budget to the server’s budget. The server uses this donated RAM to allocate server-side state for the session. When the session closes, the RAM flows back.

This creates a closed accounting system:

  • No component can use more RAM than it was granted.
  • Servers don’t need their own large budgets – clients fund their sessions.
  • Resource exhaustion is contained: a misbehaving client can only exhaust its own budget, not the server’s.

Capability Invocation vs. Delegation

Genode distinguishes two fundamental operations on capabilities:

Invocation: calling an RPC method on the capability. The caller sends a message to the RPC object named by the capability, the server processes it and returns a result. This is synchronous in Genode – the caller blocks until the server replies. (Asynchronous interaction uses signals and shared memory.)

Delegation: passing a capability as an argument in an RPC call. When a capability appears as a parameter or return value, the kernel transfers the capability reference to the receiving component. The receiver now holds an independent reference to the same RPC object. This is how authority propagates through the system.

Example: when a client opens a File_system::Session, the session creation returns a session capability. If the file system server needs to allocate memory, it calls back to the client’s RAM service using a RAM capability that was delegated during session setup.

Capabilities in Genode RPC are transferred by the kernel during the IPC operation – the framework marshals them into a special “capability argument” slot in the IPC message, and the kernel copies the capability reference into the receiver’s capability space. This is transparent to application code: capabilities appear as typed C++ objects in the RPC interface.

2. Session Routing

The Problem Session Routing Solves

In a traditional OS, services are found via well-known names in a global namespace (D-Bus addresses, socket paths, service names). This creates ambient authority – any process can connect to any service if it knows the name.

Genode has no global service namespace. A component can only obtain sessions through its parent. The parent decides which server to route each session request to. This means:

  • Service visibility is controlled structurally.
  • A component can only reach services its parent explicitly allows.
  • Different children of the same parent can be routed to different servers for the same service type.

Parent-Child Relationship

Every Genode component (except core) has exactly one parent. The parent:

  1. Created the child (spawned it with an initial set of resources).
  2. Intercepts all session requests from the child.
  3. Routes requests according to its routing policy.
  4. Can deny requests entirely (the child gets an error).

This creates a tree structure where authority flows downward. A child cannot bypass its parent to reach a service the parent didn’t approve.

Init’s Routing Configuration

The init process (Genode’s init) reads an XML configuration that specifies which services to start and how to route their session requests. This is the core of system policy.

A minimal init config:

<config>
  <parent-provides>
    <service name="LOG"/>
    <service name="ROM"/>
    <service name="CPU"/>
    <service name="RAM"/>
    <service name="PD"/>
  </parent-provides>

  <start name="timer">
    <resource name="RAM" quantum="1M"/>
    <provides> <service name="Timer"/> </provides>
    <route>
      <service name="ROM"> <parent/> </service>
      <service name="LOG"> <parent/> </service>
      <service name="CPU"> <parent/> </service>
      <service name="RAM"> <parent/> </service>
      <service name="PD">  <parent/> </service>
    </route>
  </start>

  <start name="test-log">
    <resource name="RAM" quantum="1M"/>
    <route>
      <service name="Timer"> <child name="timer"/> </service>
      <service name="LOG">   <parent/> </service>
      <!-- remaining services routed to parent by default -->
      <any-service> <parent/> </any-service>
    </route>
  </start>
</config>

Key routing directives:

  • <parent/> – route to the parent (upward delegation).
  • <child name="x"/> – route to a specific child (sibling routing).
  • <any-child/> – route to any child that provides the service.
  • <any-service> – catch-all for unspecified service types.

Label-Based Routing

Labels are strings attached to session requests. They carry context about who is requesting and what they want, enabling fine-grained routing decisions.

When a client requests a session, it attaches a label. As the request traverses the routing tree, each intermediate component (typically init) can prepend its own label. By the time the request reaches the server, the label encodes the full path through the component tree.

Example: a component named my-app inside an init subsystem named apps requests a File_system session with label "data". The composed label arriving at the file system server is: "apps -> my-app -> data".

The server can use this label for:

  • Access control. Grant different permissions based on who is asking.
  • Isolation. Store data in different directories per client.
  • Logging. Identify which component generated a message.

Label-based routing in init config:

<start name="fs">
  <provides> <service name="File_system"/> </provides>
  <route> ... </route>
</start>

<start name="app-a">
  <route>
    <service name="File_system" label="data">
      <child name="fs"/>
    </service>
    <service name="File_system" label="config">
      <child name="config-fs"/>
    </service>
  </route>
</start>

Here, app-a’s file system requests are split: requests labeled "data" go to one server, requests labeled "config" go to another. The application code is unchanged – the routing is entirely a deployment decision.

Routing as Policy

The critical insight is that routing IS access control. There is no separate permission system. If a component’s route config doesn’t include a path to a network service, that component has no network access – period. It cannot discover the network service because it has no way to name it.

This replaces:

  • Firewall rules (routing controls which network services are reachable)
  • File permissions (routing controls which file system sessions are available)
  • Process isolation policies (routing controls everything)

The routing configuration is equivalent to a whitelist of allowed service connections for each component. Adding or removing access means editing the init config, not modifying the component’s code or the server’s access control lists.

Dynamic Routing and Sculpt

In the static case (Genode’s test scenarios), routing is defined once in init’s config. In Sculpt OS (Section 6), the routing configuration can be modified at runtime, allowing users to install applications and connect them to services dynamically.

3. VFS on Top of Capabilities

The VFS Layer

Genode’s VFS (Virtual File System) is a library-level abstraction, not a kernel feature. It provides a path-based file-like interface implemented as a plugin architecture within a component’s address space.

The VFS exists because many existing applications (and libc) expect file-like access patterns. Rather than forcing all code to use Genode’s native session/capability model, the VFS provides a translation layer.

Architecture:

Application code
  |
  |  POSIX: open(), read(), write()
  v
libc (Genode's port of FreeBSD libc)
  |
  |  VFS API: vfs_open(), vfs_read(), vfs_write()
  v
VFS library (in-process)
  |
  |  Plugin dispatch based on mount point
  v
VFS plugins (in-process)
  |
  +--> ram_fs plugin (in-memory file system)
  +--> <fs> plugin (delegates to File_system session)
  +--> <terminal> plugin (delegates to Terminal session)
  +--> <log> plugin (delegates to LOG session)
  +--> <nic> plugin (delegates to Nic session, for socket layer)
  +--> <block> plugin (delegates to Block session)
  +--> <dir> plugin (combines subtrees)
  +--> <tar> plugin (read-only tar archive)
  +--> <import> plugin (populate from ROM)
  +--> <pipe> plugin (in-process pipe pair)
  +--> <rtc> plugin (system clock)
  +--> <zero> plugin (/dev/zero equivalent)
  +--> <null> plugin (/dev/null equivalent)
  ...

VFS Plugin Architecture

Each VFS plugin is a dynamically loadable library (or statically linked module) that implements a file-system-like interface. Plugins handle:

  • open/close – create/destroy file handles
  • read/write – data transfer
  • stat – metadata queries
  • readdir – directory enumeration
  • ioctl – device-specific control (limited)

Plugins are composed by the VFS configuration, which is XML embedded in the component’s config:

<config>
  <vfs>
    <dir name="dev">
      <log/>
      <null/>
      <zero/>
      <terminal name="stdin" label="input"/>
      <inline name="rtc">2024-01-01 00:00</inline>
    </dir>
    <dir name="tmp"> <ram/> </dir>
    <dir name="data"> <fs label="persistent"/> </dir>
    <dir name="socket"> <lxip dhcp="yes"/> </dir>
  </vfs>
  <libc stdout="/dev/log" stderr="/dev/log" stdin="/dev/stdin"
        rtc="/dev/rtc" socket="/socket"/>
</config>

This config creates a virtual filesystem tree:

  • /dev/log – writes go to the LOG session
  • /dev/null, /dev/zero – standard synthetic files
  • /dev/stdin – reads from a Terminal session
  • /tmp/ – in-memory filesystem (RAM-backed)
  • /data/ – delegates to a File_system session labeled “persistent”
  • /socket/ – network sockets via lwIP stack (in-process)

The <fs> plugin is the bridge from VFS to Genode’s capability world. When the application does open("/data/foo.txt"), the <fs> plugin translates this into a File_system::Session RPC call to the external file system server that the component’s routing connects to.

File System Components

Genode has several file system server components:

  • ram_fs – in-memory file system server. Multiple components can share files through it by routing their File_system sessions to it.
  • vfs_server (previously vfs) – a file system server backed by the VFS plugin architecture itself. This enables recursive composition: a VFS server can mount another VFS server.
  • fatfs – FAT file system driver over a Block session.
  • ext2_fs – ext2/3/4 via a ported Linux implementation (rump kernel).
  • store_fs / recall_fs – content-hash-based storage (experimental in some Genode releases).

The file system server is a regular Genode component. It receives a Block session (from a block device driver), provides File_system sessions, and the routing determines who can access what:

block_driver -> provides Block session
       |
       v
fatfs -> consumes Block session, provides File_system session
       |
       v
application -> consumes File_system session via VFS <fs> plugin

Libc Integration

Genode ports a substantial subset of FreeBSD’s libc. The integration point is the VFS: libc’s file operations are implemented by calling the VFS layer, which dispatches to plugins, which invoke Genode sessions as needed.

The libc port modifies FreeBSD libc minimally. Most changes are in the “backend” layer that replaces kernel syscalls with VFS calls:

  • open() -> vfs_open() -> VFS plugin dispatch
  • read() -> vfs_read() -> VFS plugin
  • socket() -> via VFS socket plugin (<lxip> or <lwip>)
  • mmap() -> supported for anonymous mappings and file-backed read-only
  • fork() -> NOT supported (no fork() in Genode)
  • exec() -> NOT supported (no in-place process replacement)
  • pthreads -> supported via Genode’s Thread API
  • select()/poll() -> supported via VFS notification mechanism
  • signal() -> partial support (SIGCHLD, basic signal delivery)

The key architectural decision: libc talks to the VFS library (in-process), the VFS talks to Genode sessions (cross-process RPC). Application code never directly touches Genode capabilities – the VFS mediates everything.

4. POSIX Compatibility

The Noux Approach (Historical)

Genode’s early POSIX approach was Noux, a process runtime that emulated Unix-like process semantics (fork, exec, pipe) on top of Genode. Noux ran as a single Genode component containing multiple “Noux processes” that shared an address space but had separate VFS views.

Noux supported:

  • fork() via copy-on-write within the Noux address space
  • exec() via in-place program replacement
  • pipe() for inter-process communication
  • A shared file system namespace

Noux was eventually deprecated because:

  1. It conflated multiple processes in one address space, undermining Genode’s isolation model.
  2. Fork emulation was fragile and slow.
  3. The libc-based VFS approach (Section 3) achieved better compatibility with less complexity.

Current Approach: libc + VFS

The current POSIX compatibility strategy:

  1. FreeBSD libc port. Provides standard C library functions. Modified to use Genode’s VFS instead of kernel syscalls.

  2. VFS plugins as POSIX backends. Each POSIX I/O pattern maps to a VFS plugin:

    • File I/O -> <fs> plugin -> File_system session
    • Sockets -> <lxip> or <lwip> plugin -> Nic session (in-process TCP/IP stack)
    • Terminal I/O -> <terminal> plugin -> Terminal session
    • Device access -> custom VFS plugins
  3. No fork(). The most significant POSIX omission. Programs that require fork() must be modified to use posix_spawn() or Genode’s native child-spawning mechanism. In practice, many programs use fork() only for daemon patterns or subprocess creation, and can be adapted.

  4. No exec(). Related to no fork(): there’s no in-place process replacement. New processes are created as new Genode components.

  5. Signals. Basic support – enough for SIGCHLD notification and simple signal handling. Complex signal semantics (real-time signals, signal-driven I/O) are not supported.

  6. pthreads. Fully supported via Genode’s native threading.

  7. mmap. Anonymous mappings and read-only file-backed mappings work. MAP_SHARED with write semantics is limited.

What Works in Practice

Genode has successfully ported:

  • Qt5/Qt6 – the full widget toolkit, including QtWebEngine (Chromium). This is the basis of Sculpt’s GUI.
  • VirtualBox – full x86 virtualization (runs Windows, Linux guests).
  • Mesa/Gallium – GPU-accelerated 3D graphics.
  • curl, wget, fetchmail – network utilities.
  • GCC toolchain – compiler, assembler, linker running on Genode.
  • bash – with limitations (no job control via signals, no fork-heavy patterns). Works for simple scripting.
  • vim, nano – terminal editors.
  • OpenSSL/LibreSSL – cryptographic libraries.
  • Various system utilities – ls, cp, rm, etc. via Coreutils port.

Applications that don’t port well:

  • Anything deeply dependent on fork+exec patterns (e.g., traditional Unix shells for complex scripting).
  • Programs relying on procfs, sysfs, or Linux-specific interfaces.
  • Daemons using inotify or Linux-specific async I/O.
  • Programs that assume global file system namespace visibility.

Practical Porting Effort

For most POSIX applications, porting involves:

  1. Build the application using Genode’s ports system (downloads upstream source, applies patches, builds with Genode’s toolchain).
  2. Write a VFS configuration that provides the file-like resources the application expects.
  3. Write a routing configuration that connects the application to required services.
  4. Patch fork() calls if present (usually replacing with posix_spawn() or restructuring to avoid subprocess creation).

The VFS configuration is where the “impedance mismatch” between POSIX expectations and Genode capabilities is resolved. The application thinks it’s accessing /etc/resolv.conf – the VFS plugin infrastructure translates this to capability-mediated access.

5. Component Architecture

Core, Init, and User Components

Core (or base-hw/base-nova/etc.): the lowest-level component, running directly on the microkernel. Core provides the fundamental services: RAM allocation, CPU time (PD sessions), ROM access (boot modules), IRQ delivery, and I/O memory access. Core is the only component with direct hardware access. Everything else goes through core.

Init: the first user-level component, child of core. Init reads its XML configuration and manages the component tree. Init’s responsibilities:

  • Parse <start> entries and spawn components.
  • Route session requests between components according to <route> rules.
  • Manage component lifecycle (restart policies, resource reclamation).
  • Propagate configuration changes (dynamic reconfiguration in Sculpt).

User components: all other components. They can be:

  • Servers that provide sessions (drivers, file systems, network stacks).
  • Clients that consume sessions (applications).
  • Both simultaneously (a network stack consumes NIC sessions and provides socket-level sessions).
  • Sub-inits – components that run their own init-like management for a subtree of components.

Resource Trading in Practice

Resources in Genode flow through the tree. A concrete example:

  1. Core has 256 MB RAM total.
  2. Core grants 250 MB to init, keeps 6 MB for kernel structures.
  3. Init grants 10 MB to the timer driver, 50 MB to the GUI subsystem, 20 MB to the network subsystem, 5 MB to a log server.
  4. When the GUI subsystem starts a framebuffer driver, it donates 8 MB from its 50 MB budget to the driver as a session quota.
  5. The framebuffer driver uses this donated RAM for the frame buffer allocation.

If the GUI subsystem wants more RAM for a new application, it can reclaim RAM by closing sessions (getting donated RAM back) or requesting more from its parent (init).

The accounting is strict: at any point, the sum of all RAM budgets across all components equals the total system RAM. There is no over-commit. This prevents the “OOM killer” problem – each component knows exactly how much RAM it can use.

Practical Component Patterns

Driver components follow a common pattern:

  • Receive: Platform session (for I/O port/memory access), IRQ session
  • Provide: A device-specific session (NIC, Block, GPU, Audio, etc.)
  • Stateless: all per-client state funded by session quota

Multiplexer components:

  • Receive: one instance of a service
  • Provide: multiple instances to clients
  • Example: NIC router receives one NIC session, provides multiple sessions with packet routing between clients

Proxy components:

  • Forward one session type, possibly filtering or transforming
  • Example: nic_bridge, nitpicker (GUI multiplexer), VFS server

Subsystem inits:

  • A component running its own init for a group of related components
  • Isolates the subtree: crash of the subsystem doesn’t affect siblings
  • Example: Sculpt’s drivers subsystem, network subsystem

6. Sculpt OS

What Sculpt Demonstrates

Sculpt OS is Genode’s demonstration desktop operating system. It turns the component framework into a usable system where:

  • Users install and run applications at runtime.
  • Each application runs in its own isolated component with explicitly configured capabilities.
  • A GUI lets users connect applications to services (routing).
  • The entire system is reconfigurable without reboot.

Architecture

Sculpt’s component tree:

core
  |
  init
    |
    +--> drivers subsystem (sub-init)
    |      +--> platform_drv (PCI, IOMMU)
    |      +--> fb_drv (framebuffer)
    |      +--> usb_drv (USB host controller)
    |      +--> wifi_drv (wireless)
    |      +--> ahci_drv (SATA)
    |      +--> nvme_drv (NVMe)
    |      +--> ...
    |
    +--> runtime subsystem (sub-init, user-managed)
    |      +--> (user-installed applications)
    |
    +--> leitzentrale (management GUI)
    |      +--> system shell
    |      +--> config editor
    |
    +--> nitpicker (GUI multiplexer)
    +--> nic_router (network multiplexer)
    +--> ram_fs (shared file system)
    +--> ...

User Experience of Capabilities

In Sculpt, installing an application means:

  1. Download the package (a Genode component archive).
  2. Edit a “deploy” configuration that specifies which services the application can access (routing rules).
  3. The runtime subsystem spawns the component with the specified routing.

A text editor gets: File_system session (to read/write files), GUI session (for display), Terminal session (optionally). It does NOT get: network access, block device access, or access to other applications’ file systems.

A web browser gets: GUI session, Nic session (for network), GPU session (for rendering), File_system session (for downloads). Each service connection is an explicit choice.

The deploy config is the security policy. A user can see exactly what authority each application has, and can change it by editing the config.

Lessons from Sculpt

  1. Capabilities need a management UI. Raw capability graphs are incomprehensible to users. Sculpt provides a GUI that presents service connections in an understandable way (though it’s still oriented toward power users).

  2. Routing is the killer feature. Being able to route the same session type to different servers for different clients is extremely powerful. One application’s “file system” is local storage; another’s is a network share – same code, different routing.

  3. Sub-inits provide failure isolation. The drivers subsystem can crash and restart without affecting applications. Sculpt’s robustness comes from this hierarchical isolation.

  4. Dynamic reconfiguration is essential. A static boot config (like capOS’s current manifest) is fine for servers and embedded systems, but a general-purpose OS needs to add/remove/reconfigure components at runtime.

  5. Package management is a routing problem. Installing an application in Sculpt is not “copy binary to disk” – it’s “add a component to the runtime subsystem with specific routing rules.” The binary is almost secondary to the routing.

  6. POSIX compat through VFS works. Sculpt runs real desktop applications (Qt-based apps, VirtualBox, web browser) using the VFS-mediated POSIX layer. The capability model doesn’t prevent running complex existing software – it just requires explicit service configuration.

7. Relevance to capOS

VFS Capability Design

Genode’s approach: The VFS is an in-process library with a plugin architecture. It mediates between libc/POSIX and Genode sessions. The VFS configuration is per-component XML.

Lessons for capOS:

  1. Don’t put the VFS in the kernel. Genode’s VFS is entirely userspace, which is correct for a capability OS. capOS should do the same – the VFS is a library linked into processes that need POSIX compatibility, not a kernel subsystem.

  2. Plugin model maps well to Cap’n Proto. Each Genode VFS plugin bridges to a specific session type. In capOS, each VFS “backend” would bridge to a specific capability interface:

    Genode VFS plugincapOS VFS backend
    <fs> -> File_system sessionFsBackend -> Namespace + Store caps
    <terminal> -> Terminal sessionTerminalBackend -> Console cap
    <lxip> -> Nic sessionNetBackend -> TcpSocket/UdpSocket caps
    <log> -> LOG sessionLogBackend -> Console cap
    <ram> -> in-process RAMRamBackend -> in-process (no cap needed)
  3. VFS config should be declarative. Rather than hardcoding mount points, capOS processes using libcapos-posix should receive a VFS mount table as part of their initial capability set. This could be a Cap’n Proto struct:

    struct VfsMountTable {
        mounts @0 :List(VfsMount);
    }
    
    struct VfsMount {
        path @0 :Text;           # mount point, e.g. "/data"
        union {
            namespace @1 :Void;  # use the Namespace cap named in capName
            console @2 :Void;    # use a Console cap
            ram @3 :Void;        # in-memory filesystem
            socket @4 :Void;     # socket interface
        }
        capName @5 :Text;        # name of the cap in CapSet backing this mount
    }
    

    This separates the VFS topology (a deployment decision) from the application code (which just calls open()).

  4. Genode’s <fs> plugin is the key analog. capOS’s Namespace capability is equivalent to Genode’s File_system session. The libcapos-posix path resolution layer (open() -> namespace.resolve()) is exactly Genode’s <fs> VFS plugin. The existing capOS design in docs/proposals/userspace-binaries-proposal.md is already on the right track.

  5. Consider streaming for large files. Genode uses shared-memory dataspaces for bulk data transfer in file system sessions. capOS’s current Store interface returns Data (a capnp blob), which means the entire object is copied per get() call. For large files, a streaming interface (with a shared-memory buffer and cursor) would be more efficient. This is capOS’s Open Question #4.

Session Routing Patterns

Genode’s approach: XML-configured routing in init, label-based dispatch, parent mediates all session requests.

Lessons for capOS:

  1. The manifest IS the routing config. capOS’s SystemManifest with structured CapRef source entries such as { service = { service = "net-stack", export = "nic" } } is functionally equivalent to Genode’s init routing config. The capOS design already handles the static case well.

  2. Label-based routing is valuable. Genode’s ability to route different requests from the same client to different servers (based on labels) maps directly to capOS’s capability naming. capOS already does this implicitly – a process can receive separate Namespace caps for “config” and “data”. The key insight is that this should be a deployment-time decision, not an application-time decision.

  3. Consider dynamic routing. capOS’s current manifest is static (baked into the ISO). For a more flexible system, init should support runtime reconfiguration:

    • Reload the manifest from a Store cap.
    • Add/remove services without reboot.
    • Re-route sessions when services restart.

    Genode achieves this via init’s config ROM, which can be updated at runtime. capOS could achieve it by having init watch a Namespace cap for manifest updates.

  4. Parent-mediated routing has costs. In Genode, every session request traverses the component tree. This adds latency and complexity. capOS’s direct capability passing (a process holds a cap directly, not through its parent) avoids this overhead. The tradeoff: capOS has less runtime control over routing (once a cap is passed, the parent can’t intercept invocations on it).

    This is a deliberate design choice. capOS favors direct caps (lower overhead, simpler) over proxied caps (more control). Genode’s session routing is powerful but adds a layer of indirection that may not be worth it for capOS’s use case.

  5. Service export needs a protocol. Genode’s session model has server components explicitly announce what services they provide. capOS’s ProcessHandle.exported() mechanism serves the same purpose. The manifest’s exports field pre-declares what a service will export, which helps init plan the dependency graph before spawning anything.

POSIX Compatibility Without Compromising Capabilities

Genode’s approach: libc port + VFS + per-component VFS config. No global namespace. No fork(). Applications see a curated file tree, not the real system.

Lessons for capOS:

  1. The VFS is a capability adapter, not a capability. The VFS library runs inside the process that needs POSIX compatibility. It doesn’t weaken the capability model because it can only access capabilities the process was granted. This matches capOS’s libcapos-posix design exactly.

  2. musl over FreeBSD libc. Genode uses FreeBSD libc because of its clean backend interface. capOS plans to use musl, which has an even cleaner __syscall() interface. This is a good choice. Genode’s experience shows that the libc implementation matters less than the VFS/backend layer quality.

  3. No fork() is fine. Genode has operated without fork() for over 15 years and runs complex software (Qt, VirtualBox, Chromium). The applications that truly need fork() are rare and usually need only posix_spawn() semantics. capOS should not attempt to implement fork() – focus on posix_spawn() backed by ProcessSpawner cap.

  4. Sockets via in-process TCP/IP stack. Genode’s <lxip> VFS plugin runs an lwIP TCP/IP stack inside the application process, using the NIC session for raw packet I/O. This avoids the overhead of routing every socket call through a separate network stack component.

    capOS could offer a similar choice:

    • Out-of-process: socket calls go to the network stack component via TcpSocket/UdpSocket caps (safer, more isolated, more overhead).
    • In-process: an lwIP/smoltcp library runs inside the application, consuming a raw Nic cap (less isolation, less overhead, more authority).

    For most applications, out-of-process sockets via caps are fine. For high-performance networking (database, web server), an in-process stack over a raw NIC cap may be needed.

  5. select/poll/epoll need async caps. Genode implements select/poll via VFS notifications (signals on file readiness). capOS needs the async capability rings (io_uring-inspired) from Stage 4 before select/poll can work. This is a natural fit: each polled fd maps to a pending capability invocation in the completion ring.

Component Patterns for Cap’n Proto Interfaces

Genode’s patterns and their capOS/Cap’n Proto equivalents:

  1. Session creation = factory method on a capability.

    Genode: client requests a Nic::Session from its parent, which routes to a NIC driver server.

    capOS: client holds a NetworkManager cap and calls create_tcp_socket() to get a TcpSocket cap. The factory pattern is the same, but capOS does it via direct cap invocation instead of parent-mediated session requests.

    Cap’n Proto naturally supports this via interfaces that return interfaces:

    interface NetworkManager {
        createTcpSocket @0 () -> (socket :TcpSocket);
        createUdpSocket @1 () -> (socket :UdpSocket);
        createTcpListener @2 (addr :IpAddress, port :UInt16)
            -> (listener :TcpListener);
    }
    
  2. Resource quotas in session creation.

    Genode: session requests include a RAM quota donated from client to server.

    capOS should consider this pattern. Currently, capOS processes receive a FrameAllocator cap for memory. If a server needs to allocate memory per-client, the client should fund it. Cap’n Proto schema could encode this:

    interface FileSystem {
        open @0 (path :Text, bufferPages :UInt32)
            -> (file :File);
        # bufferPages: number of pages the client donates for
        # server-side buffering. Server allocates from a shared
        # FrameAllocator or the client passes frames explicitly.
    }
    

    This prevents the denial-of-service problem where a client opens many sessions, exhausting the server’s memory.

  3. Multiplexer components.

    Genode: nic_router takes one NIC session, provides many. nitpicker takes one framebuffer, provides many GUI sessions.

    capOS equivalent: a process that consumes a Nic cap and provides multiple TcpSocket/UdpSocket caps. This is already what the network stack component does in capOS’s service architecture proposal. Cap’n Proto’s interface model makes this natural – the multiplexer implements one interface (NetworkManager) using another (Nic).

  4. Attenuation = capability narrowing.

    Genode: servers can return restricted capabilities (e.g., a read-only file handle from a read-write file system session).

    capOS: already planned via Fetch -> HttpEndpoint narrowing, Store -> read-only Store, Namespace -> scoped Namespace. The pattern is sound. Cap’n Proto interfaces make the attenuation explicit in the schema.

  5. Dataspace pattern for bulk data.

    Genode uses shared-memory dataspaces for efficient bulk transfer (file contents, network packets, framebuffers). The RPC path carries only small control messages and capability references.

    capOS currently moves Cap’n Proto control messages through capability rings and bounded kernel scratch, with no zero-copy bulk-data object yet. For bulk data, capOS should add a SharedBuffer capability:

    interface SharedBuffer {
        # Map a shared memory region into caller's address space
        map @0 () -> (addr :UInt64, size :UInt64);
        # Notify that data has been written to the buffer
        signal @1 (offset :UInt64, length :UInt64) -> ();
    }
    

    File system and network operations would use SharedBuffer for data transfer and capability invocations for control, matching Genode’s split between RPC and dataspaces.

  6. Sub-init pattern for failure domains.

    Genode: a sub-init manages a subtree of components. If the subtree crashes, only the sub-init restarts it.

    capOS: a supervisor process (not necessarily init) holds a ProcessSpawner cap and manages a group of services. This is already described in the service architecture proposal’s supervision tree. The key addition from Genode: make sub- supervisors a first-class pattern with their own manifest fragments, not just ad-hoc supervision loops.

Summary of Key Takeaways for capOS

AreaGenode approachcapOS adaptation
Capability modelKernel-enforced caps to RPC objectsKernel-enforced caps to Cap’n Proto objects (aligned)
Service discoveryParent-mediated session routingManifest-driven cap passing at spawn (simpler, less dynamic)
VFSIn-process library with plugin architecturelibcapos-posix with mount table from CapSet (same pattern)
POSIXFreeBSD libc + VFS backendsmusl + libcapos-posix backends (same architecture)
fork()Not supportedNot supported (use posix_spawn -> ProcessSpawner)
Bulk dataShared-memory dataspacesSharedBuffer design exists; implementation pending
Resource accountingSession quotas (RAM donated per session)Authority-accounting design exists; unified ledgers pending
Routing labelsString labels on session requests, routed by initCap naming in manifest serves same purpose
Dynamic reconfigInit config ROM updated at runtimeManifest reload via Store cap (future)
Failure isolationSub-inits as failure domainsSupervisor processes (same concept, different mechanism)
Async notificationSignal capabilitiesAsync cap rings / io_uring model (more general)

Top Recommendations

  1. Add session quotas / resource trading. This is the most important Genode pattern capOS hasn’t adopted yet. Without it, a malicious client can exhaust a server’s memory by opening many capability sessions. Design resource donation into the Cap’n Proto schema for session-creating interfaces.

  2. Design a SharedBuffer capability. Copying capnp messages through the kernel works for control messages but not for bulk data. A shared-memory mechanism (like Genode’s dataspaces) is essential for file I/O, networking, and GPU rendering.

  3. Keep VFS as a library, not a service. Genode’s in-process VFS is the right pattern. capOS’s libcapos-posix should work the same way – a library that translates POSIX calls to capability invocations within the process. No VFS server component needed (though a file system server implementing the Namespace/Store interface is separate).

  4. Add a declarative VFS mount table to process init. Each POSIX-compat process should receive a mount table (as a capnp struct) that maps paths to capabilities. This separates deployment policy from application code, matching Genode’s per-component VFS config.

  5. Plan for dynamic reconfiguration. The static manifest is fine for now, but Sculpt shows that a usable capability OS needs runtime service management. Design init so it can accept manifest updates through a cap, not just from the boot image.

  6. Don’t over-engineer routing. Genode’s parent-mediated session routing is powerful but complex. capOS’s direct capability passing is simpler and sufficient for most use cases. Add proxy/mediator patterns only when specific needs arise (e.g., capability revocation, load balancing).

References

  • Genode Foundations book (genode.org/documentation/genode-foundations/) – the authoritative source for architecture, session model, routing, VFS, and component composition.
  • Norman Feske, “Genode Operating System Framework” (2008-2025) – release notes and design documentation at genode.org.
  • Sculpt OS documentation at genode.org/download/sculpt – practical deployment of the capability model.
  • Genode source repository: github.com/genodelabs/genode – reference implementations of VFS plugins, file system servers, libc port.

Research: Plan 9 from Bell Labs and Inferno OS

Lessons for a capability-based OS using Cap’n Proto wire format.

Table of Contents

  1. Per-Process Namespaces
  2. The 9P Protocol
  3. File-Based vs Capability-Based Interfaces
  4. 9P as IPC
  5. Inferno OS
  6. Relevance to capOS

1. Per-Process Namespaces

Overview

Plan 9’s most significant architectural contribution is per-process namespaces. Every process has its own view of the file hierarchy – not a shared global filesystem tree as in Unix. A process’s namespace is a mapping from path names to file servers (channels to 9P-speaking services). Two processes running on the same machine can see completely different contents at /dev, /net, /proc, or any other path.

Namespaces are inherited by child processes (fork copies the namespace) but can be modified independently afterward. This provides a form of resource isolation that is orthogonal to traditional access control: a process simply cannot name resources that aren’t in its namespace.

The Three Namespace Operations

Plan 9 provides three system calls for namespace manipulation:

bind(name, old, flags) – Takes an existing file or directory name already visible in the namespace and makes it also accessible at path old. This is purely a namespace-level alias – no new file server is involved. The name argument must resolve to something already in the namespace.

Example: bind("#c", "/dev", MREPL) makes the console device (#c is a kernel device designator) appear at /dev. The # prefix addresses kernel devices directly before they have been bound into the namespace.

mount(fd, old, flags, aname) – Like bind, but the source is a file descriptor connected to a 9P server rather than an existing namespace path. The kernel speaks 9P over fd to serve requests for paths under old. The aname parameter selects which file tree the server should export (a single server can serve multiple trees).

Example: mount(fd, "/net", MREPL, "") where fd is a connection to the network stack’s file server, makes the TCP/IP interface appear at /net.

unmount(name, old) – Removes a previous bind or mount from the namespace.

Flags and Union Directories

The flags argument to bind and mount controls how the new binding interacts with existing content at the mount point:

  • MREPL (replace) – The new binding completely replaces whatever was at the mount point. Only the new server’s files are visible.
  • MBEFORE (before) – The new binding is placed before the existing content. When looking up a name, the new binding is searched first. If not found there, the old content is searched.
  • MAFTER (after) – The new binding is placed after the existing content. The old content is searched first.
  • MCREATE – Combined with MBEFORE or MAFTER, controls which component of the union receives create operations.

Union directories are the result of stacking multiple bindings at one mount point. When a directory has multiple bindings, a directory listing returns the union of all names from all components. A lookup walks the bindings in order and returns the first match.

This is how Plan 9 constructs /bin: multiple directories (for different architectures, local overrides, etc.) are union-mounted at /bin. The shell finds commands by simple path lookup – no $PATH variable needed.

bind /rc/bin /bin          # shell built-ins (MAFTER)
bind /386/bin /bin         # architecture binaries (MAFTER)
bind $home/bin/386 /bin    # personal overrides (MBEFORE)

A lookup for /bin/ls searches the personal directory first, then the architecture directory, then the shell builtins – all via a single path.

Namespace Inheritance and Isolation

The rfork system call controls what the child inherits:

  • RFNAMEG – Child gets a copy of the parent’s namespace. Subsequent modifications by either side are independent.
  • RFCNAMEG – Child starts with a clean (empty) namespace.
  • Without either flag, parent and child share the namespace (modifications by one affect the other).

This gives fine-grained control: a shell can construct a restricted namespace for a sandboxed command, or a server can create an isolated namespace for each client connection.

Namespace Construction at Boot

Plan 9’s boot process constructs the initial namespace step by step:

  1. The kernel provides “kernel devices” accessed via # designators: #c (console), #e (environment), #p (proc), #I (IP stack), etc.
  2. The boot script binds these into conventional paths: bind "#c" /dev, bind "#p" /proc, etc.
  3. Network connections mount remote file servers: the CPU server’s file system, the user’s home directory, etc.
  4. Per-user profile scripts further customize the namespace.

The result is that the “standard” file hierarchy is a convention, not a kernel requirement. Any process can rearrange it.

Namespace as Security Boundary

Plan 9 namespaces provide a form of capability-like access control:

  • A process cannot access resources outside its namespace
  • A parent can restrict a child’s namespace before exec
  • There is no way to “escape” a namespace – there is no .. that crosses a mount boundary unexpectedly, and # designators can be restricted

However, this is not a formal capability system:

  • The namespace contains string paths, which are ambient authority within the namespace
  • Any process can open("/dev/cons") if /dev/cons is in its namespace – there is no per-open-call authorization
  • The isolation depends on correct namespace construction, not structural properties

2. The 9P Protocol

Overview

9P (and its updated version 9P2000) is the protocol spoken between clients and file servers. Every resource in Plan 9 is accessed through 9P – local kernel devices, remote file systems, user-space services, and network resources all speak the same protocol.

9P is a request-response protocol with fixed message types. It is connection-oriented: a client establishes a session, authenticates, walks paths to obtain file handles (fids), and then reads/writes through those handles.

Message Types (9P2000)

9P2000 defines the following message pairs (T = request from client, R = response from server):

Session management:

  • Tversion / Rversion – Negotiate protocol version and maximum message size. Must be the first message. The client proposes a version string (e.g., "9P2000") and a msize (maximum message size in bytes). The server responds with the agreed version and msize.
  • Tauth / Rauth – Establish an authentication fid. The client provides a user name and an aname (the file tree to access). The server returns an afid that the client reads/writes to complete an authentication exchange.
  • Tattach / Rattach – Attach to a file tree. The client provides the afid from authentication, a user name, and the aname. The server returns a qid (unique file identifier) for the root of the tree. This fid becomes the client’s handle for the root directory.

Navigation:

  • Twalk / Rwalk – Walk a path from an existing fid. The client provides a starting fid and a sequence of name components (up to 16 per walk). The server returns a new fid pointing to the result and the qids of each intermediate step. Walk is how you traverse directories – there is no open-by-path operation.

File operations:

  • Topen / Ropen – Open an existing file (by fid, obtained via walk). The client specifies a mode (read, write, read-write, exec, truncate). The server returns the qid and an iounit (maximum I/O size for atomic operations).
  • Tcreate / Rcreate – Create a new file in a directory fid. The client specifies name, permissions, and mode.
  • Tread / Rread – Read count bytes at offset from an open fid. The server returns the data.
  • Twrite / Rwrite – Write count bytes at offset to an open fid. The server returns the number of bytes actually written.
  • Tclunk / Rclunk – Release a fid. The server frees associated state. Equivalent to close().
  • Tremove / Rremove – Remove the file referenced by a fid and clunk the fid.
  • Tstat / Rstat – Get file metadata (name, size, permissions, access times, qid, etc.).
  • Twstat / Rwstat – Modify file metadata.

Error handling:

  • Rerror – Any T-message can receive an Rerror instead of its normal response. Contains a text error string (9P2000) or an error number (9P2000.u).

Message Format

Every 9P message starts with a 4-byte length (little-endian, including the length field itself), a 1-byte type, and a 2-byte tag. The tag is chosen by the client and echoed in the response, enabling multiplexed operations over a single connection.

[4 bytes: size][1 byte: type][2 bytes: tag][... type-specific fields ...]

Field types are simple: 1/2/4/8-byte integers (little-endian), counted strings (2-byte length prefix + UTF-8), and counted data blobs (4-byte length prefix + raw bytes).

Qids and File Identity

A qid is a server-assigned 13-byte file identifier:

[1 byte: type][4 bytes: version][8 bytes: path]
  • type – Bits indicating directory, append-only, exclusive-use, authentication file, etc.
  • version – Incremented when the file is modified. The client can detect changes by comparing versions.
  • path – A unique identifier for the file within the server. Typically a hash or inode number.

Qids allow clients to detect file identity (same path through different walks = same qid) and staleness (version changed = re-read needed).

Authentication

9P2000 authentication is pluggable. The protocol provides the Tauth/Rauth mechanism to establish an authentication fid, but the actual authentication exchange happens by reading and writing this fid – the protocol itself is agnostic to the authentication method.

Plan 9’s standard mechanism is p9sk1, a shared-secret protocol using an authentication server. The flow:

  1. Client sends Tauth to get an afid
  2. Client and server exchange challenge-response messages by reading/writing the afid, mediated by the authentication server
  3. Once authentication succeeds, the client uses the afid in Tattach

The key insight: authentication is just another read/write conversation over a special fid. New authentication methods can be implemented without changing the protocol.

Concurrency

9P supports concurrent operations through tags. A client can send multiple T-messages without waiting for responses. Each has a unique tag, and the server can respond out of order. The client matches responses to requests by tag.

A special tag value NOTAG (0xFFFF) is used for Tversion, which must complete before any other messages.

The OEXCL open mode provides exclusive access to a file – only one client can open it at a time. This is used for locking (e.g., the #l lock device in some Plan 9 variants).

Fids are per-connection, not global. Different clients on different connections have independent fid spaces. A server maintains per-connection state.

Maximum Message Size

The msize negotiated in Tversion bounds all subsequent messages. A typical default is 8192 or 65536 bytes. The iounit returned by Topen tells the client the maximum useful count for read/write on that fid, which may be less than msize minus the message header overhead.

This bounding is important for resource management – a server can limit memory consumption per connection.


3. File-Based vs Capability-Based Interfaces

Plan 9: Everything is a File

Plan 9 takes Unix’s “everything is a file” philosophy further than Unix itself ever did:

  • Network stack – TCP connections are managed by reading/writing files in /net/tcp: clone (allocate a connection), ctl (write commands like connect 10.0.0.1!80), data (read/write payload), status (read connection state).
  • Window system – The rio window manager exports a file system: each window has a cons, mouse, winname, etc. A program draws by writing to /dev/draw/*.
  • Process control/proc/<pid>/ contains ctl (write kill to signal), status (read state), mem (read/write process memory), text (read executable), note (signals).
  • Hardware devices – Kernel devices export file interfaces directly. The audio device is files, the graphics framebuffer is files, etc.

The interface contract is: open a file, read/write bytes, stat for metadata. The semantics of those bytes are defined by the file server – there is no ioctl().

Strengths of the file model:

  • Universal tools work everywhere: cat /net/tcp/0/status, echo kill > /proc/1234/ctl
  • Shell scripts can compose services trivially
  • Network transparency is automatic: mount a remote file server, same tools work
  • The interface is self-documenting: ls shows available operations
  • Simple tools like cat, echo, grep become universal adapters

Weaknesses of the file model:

  • Type erasure. Everything is bytes. The protocol cannot express structured data without conventions layered on top (text formats, fixed layouts, etc.). A read() returns raw bytes – the client must know the expected format.
  • Limited operation set. The only verbs are open, read, write, stat, create, remove. Complex operations must be encoded as write-command / read-response sequences (e.g., echo "connect 10.0.0.1!80" > /net/tcp/0/ctl). Error handling is ad-hoc.
  • No schema or type checking. Nothing prevents writing garbage to a ctl file. Errors are detected at runtime, often with cryptic messages.
  • No structured errors. 9P errors are text strings. No error codes, no machine-parseable error metadata.
  • Byte-stream orientation. 9P read/write are offset-based byte operations. This fits files naturally but is awkward for RPC-style request/response interactions. File servers work around this with conventions (write a command, read the response from offset 0).
  • No pipelining of operations. You cannot say “open this file, then read it, and if that succeeds, write to this other file” atomically. Each step is a separate round-trip (though 9P’s tag multiplexing helps amortize latency).

Capability Systems: Everything is a Typed Interface

In a capability system like capOS, resources are accessed through typed interface references:

interface Console {
    write @0 (data :Data) -> ();
    writeLine @1 (text :Text) -> ();
}

interface NetworkManager {
    createTcpSocket @0 (addr :Text, port :UInt16) -> (socket :TcpSocket);
}

interface TcpSocket {
    read @0 (count :UInt32) -> (data :Data);
    write @1 (data :Data) -> (written :UInt32);
    close @2 () -> ();
}

Strengths of the capability model:

  • Type safety. The interface contract is machine-checked. You cannot call write on a NetworkManager – the type system prevents it.
  • Rich operations. Interfaces can define arbitrary methods with typed parameters and return values. No need to encode everything as byte read/writes.
  • Structured errors. Return types can include error variants. Capabilities can define error enums in the schema.
  • Schema evolution. Cap’n Proto supports backwards-compatible schema changes (adding fields, adding methods). Both old and new clients/servers interoperate.
  • No ambient authority. A process has precisely the capabilities it was granted. No path-based discovery, no /proc to enumerate.
  • Attenuation. A broad capability can be narrowed to a restricted version (e.g., Fetch -> HttpEndpoint). The restriction is structural, not a permission check.

Weaknesses of the capability model:

  • No universal tools. cat and echo do not work on capabilities. Each interface needs its own client tool or library. Debugging requires interface-aware tools.
  • Harder composition. Shell pipes compose byte streams trivially. Capability composition requires typed adapters or a capability-aware shell.
  • Discovery problem. ls shows files. What shows capabilities? A management-only CapabilityManager.list() call, but that requires holding the manager cap and a tool that can render the result.
  • Steeper learning curve. A new developer can ls /net to understand the network stack. Understanding a capability interface requires reading the schema definition.
  • Verbosity. Opening a TCP connection in Plan 9 is four file operations (clone, ctl, data, status). In a capability system, it is one typed method call. But defining the interface in the schema is more upfront work than just exporting files.

Synthesis

The file model and the capability model are not opposed – they are different points on a trade-off curve between universality and type safety. Plan 9 chose maximal universality (everything reduces to bytes + paths). Capability systems choose maximal type safety (everything has a schema).

The interesting question is whether a capability system can recover the ergonomic benefits of the file model while maintaining type safety. This is addressed in section 6.


4. 9P as IPC

File Servers as Services

In Plan 9, a “service” is simply a process that speaks 9P. When a client mounts a file server’s connection at some path, all file operations on that path become 9P messages to the server. This is the universal IPC mechanism – there are no Unix-domain sockets, no D-Bus, no shared memory primitives for service communication. Everything goes through 9P.

Examples of services-as-file-servers:

  • exportfs – Re-exports a subtree of the current namespace over a network connection, letting remote clients mount it.
  • ramfs – A RAM-backed file server. Mount it and you have a tmpfs.
  • ftpfs – Mounts a remote FTP server as a local directory. Programs read/write files; the file server translates to FTP protocol.
  • mailfs – Presents a mail spool as a directory of messages. Each message is a directory with header, body, rawbody, etc.
  • plumber – The inter-application message router exports a file interface: write a message to /mnt/plumb/send, and it arrives in the target application’s plumb port.
  • acme – The Acme editor exports its entire UI as a file system: windows, buffers, tags, event streams. External programs can control Acme by reading/writing these files.

The srv Device and Connection Passing

The kernel #s (srv) device provides a namespace for posting file descriptors. A server process creates a pipe, starts serving 9P on one end, and posts the other end as /srv/myservice. Other processes open /srv/myservice to get a connection to the server, then mount it into their namespace.

# Server side:
pipe = pipe()
post(pipe[0], "/srv/myfs")
serve_9p(pipe[1])

# Client side:
fd = open("/srv/myfs", ORDWR)
mount(fd, "/mnt/myfs", MREPL, "")
# Now /mnt/myfs/* are served by the server process

This decouples service registration from namespace mounting. Multiple clients can mount the same service at different paths in their own namespaces.

Performance and Overhead

9P’s overhead compared to direct function calls or shared memory:

  1. Serialization – Every operation is a 9P message: header parsing, field encoding/decoding. Messages are simple binary (not XML/JSON), so this is fast but nonzero.
  2. Copying – Data passes through the kernel (pipe or network): user buffer -> kernel pipe buffer -> server process buffer (and back for responses). This is at least two copies per direction.
  3. Context switches – Each request/response is a write (client) + read (server) + write (server) + read (client) = four context switches for a round-trip.
  4. No zero-copy – 9P does not support shared memory or page remapping. Large data transfers pay the full copy cost.

For metadata-heavy operations (stat, walk, open/close), the overhead is dominated by context switches, not data copying. Plan 9 is designed for networks where latency matters – the protocol’s simplicity and multiplexability help here.

For bulk data, the overhead is significant. Plan 9 compensates somewhat with the iounit mechanism (encouraging large reads/writes to amortize per-call costs) and the fact that most I/O is streaming (sequential reads/writes, not random access).

In practice, Plan 9 systems are not optimized for raw throughput on local IPC. The design prioritizes simplicity and network transparency over local performance. The assumption is that the network is the bottleneck, so local protocol overhead is acceptable.

Network Transparency

9P’s power lies in its network transparency. The same protocol runs over:

  • Pipes – Local IPC between processes on the same machine.
  • TCP connections – Remote file access across the network.
  • Serial lines – Early Plan 9 terminals connected to CPU servers.
  • TLS/SSL – Encrypted connections (added later).

A CPU server is accessed by mounting its file system over the network. The Plan 9 cpu command:

  1. Connects to a remote CPU server over TCP
  2. Authenticates
  3. Exports the local namespace (via exportfs) to the remote side
  4. The remote side mounts the local namespace, overlaying it with its own kernel devices
  5. A shell runs on the remote CPU, but with access to local files

The result: you work on the remote machine but your files, windows, and devices are local. This is more powerful than SSH because the integration is at the namespace level, not the terminal level.

Factoid: In the Plan 9 computing model, terminals were intentionally underpowered. The expensive hardware was the CPU server. Users mounted the CPU server’s filesystem and ran programs there, with the terminal providing I/O devices (keyboard, mouse, display) exported as files back to the CPU server.


5. Inferno OS

What Inferno Adds Beyond Plan 9

Inferno (also from Bell Labs, originally by the same team) took the Plan 9 architecture and adapted it for portable, networked computing. It can run as a native OS on bare hardware, as a hosted application on other OSes (Linux, Windows, macOS), or as a virtual machine.

Key additions and differences:

  1. Dis virtual machine – All user-space code runs on a register-based VM, not native machine code.
  2. Limbo language – A type-safe, garbage-collected, concurrent language (influenced Plan 9 C, CSP, Newsqueak, and Alef). All applications are written in Limbo.
  3. Styx protocol – Inferno’s name for its 9P variant (functionally identical to 9P2000 with minor encoding differences in early versions, later fully aligned with 9P2000).
  4. Portable execution – The same Limbo bytecode runs on any platform where the Dis VM is available. No recompilation needed.
  5. Built-in cryptography – TLS, certificate-based authentication, and signed modules are integrated into the system, not bolted on.

The Dis Virtual Machine

Dis is a register-based virtual machine (unlike the JVM, which is stack-based). Key characteristics:

  • Memory model – Dis uses a module-based memory model. Each loaded module has its own data segment (frame). Instructions reference memory operands by offset within the current module’s frame, the current function’s frame, or a literal (mp, fp, or immediate addressing).
  • Instruction set – CISC-inspired, with three-address instructions: add src1, src2, dst. Opcodes cover arithmetic, comparison, branching, string operations, channel operations, and system calls. Around 80-90 opcodes.
  • Type descriptors – Each allocated block has a type descriptor that identifies which words are pointers. This enables exact garbage collection (no conservative scanning).
  • Garbage collection – Reference counting with cycle detection. Deterministic deallocation for acyclic structures (important for resource management), with periodic cycle collection.
  • Module loading – Dis modules are loaded on demand. A module declares its type signature (exported functions and their types), and the loader verifies type compatibility at link time.
  • JIT compilation – On supported architectures (x86, ARM, MIPS, SPARC, PowerPC), Dis bytecode is compiled to native code at load time. This removes the interpretation overhead for hot code.
  • Concurrency – Dis natively supports concurrent threads of execution within a module. Threads communicate via typed channels (from CSP/Limbo).

The Limbo Language

Limbo is Inferno’s application language. Its design reflects the system’s values:

  • Type-safe – No pointer arithmetic, no unchecked casts, no buffer overflows. The type system is enforced at compile time and verified at module load time.
  • Garbage collected – Programmers do not manage memory. Reference counting provides deterministic resource cleanup.
  • Concurrent – First-class chan types (typed channels) and spawn for creating threads. This is CSP-style concurrency, predating (and influencing) Go’s goroutines and channels.
  • Module system – Modules declare interfaces (like header files with type signatures). A module imports another module’s interface, and the runtime verifies type compatibility at load time.
  • ADTs – Algebraic data types with pick (tagged unions). Pattern matching over variants.
  • Tuples – First-class tuple types for returning multiple values.
  • No inheritance – Limbo has ADTs and modules, not objects and classes.

Example – a simple file server in Limbo:

implement Echo;

include "sys.m";
include "draw.m";
include "styx.m";

sys: Sys;

Echo: module {
    init: fn(nil: ref Draw->Context, argv: list of string);
};

init(nil: ref Draw->Context, argv: list of string)
{
    sys = load Sys Sys->PATH;
    # ... set up Styx server, handle read/write on echo file
}

Limbo and the Namespace Model

Limbo programs interact with the namespace through the Sys module’s file operations (open, read, write, mount, bind, etc.) – the same operations as in Plan 9. The namespace model is identical:

  • Each process group has its own namespace
  • bind and mount manipulate the namespace
  • File servers (Styx servers) provide services
  • Union directories compose multiple servers

The difference is that Limbo’s type safety extends to the file descriptors and channels used to communicate. A Sys->FD is a reference type, not a raw integer. You cannot fabricate a file descriptor from nothing.

Limbo’s channel type (chan of T) provides typed communication between concurrent threads within a process. Channels are a local IPC mechanism complementary to Styx, which handles inter-process and inter-machine communication.

Styx (Inferno’s 9P)

Styx is Inferno’s name for the 9P2000 protocol. In the current version of Inferno, Styx and 9P2000 are wire-compatible – the same byte format, the same message types, the same semantics. The renaming reflects Inferno’s origin as a commercial product from Vita Nuova (and before that, Lucent Technologies) with its own branding.

The Inferno kernel includes a Styx library (Styx and Styxservers modules) that makes implementing file servers straightforward in Limbo. The Styxservers module provides a framework: you implement a navigator (for walk/stat) and a file handler (for read/write), and the framework handles the protocol boilerplate.

include "styx.m";
include "styxservers.m";

styx: Styx;
styxservers: Styxservers;

Srv: adt {
    # ... file tree definition
};

# The framework calls navigator.walk(), navigator.stat() for metadata
# and file.read(), file.write() for data operations.

Inferno also provides the 9srvfs utility for mounting external 9P servers and the mount command for attaching Styx servers to the namespace – the same patterns as Plan 9.

Security Model

Inferno’s security model builds on namespaces with additional mechanisms:

  • Signed modules – Dis modules can be cryptographically signed. The loader can verify signatures before executing code.
  • Certificate-based authentication – Inferno uses a certificate infrastructure (not Kerberos like Plan 9) for authenticating connections.
  • Namespace restriction – The wm/sh shell and other supervisory programs can construct restricted namespaces for untrusted code.
  • Type safety as security – Since Limbo prevents pointer forgery and buffer overflows, type safety is a security boundary. A Limbo program cannot escape its type system to forge file descriptors or access arbitrary memory.

6. Relevance to capOS

6.1 Namespace Composition via Capabilities

Plan 9 lesson: Per-process namespaces are a powerful isolation and composition mechanism. A process’s “view of the world” is constructed by its parent through bind/mount operations. The child cannot escape this view.

capOS parallel: Per-process capability tables serve an analogous role. A process’s “view of the world” is its set of granted capabilities. The child cannot discover or access capabilities outside its table.

What capOS could adopt:

The existing Namespace interface in the storage proposal (docs/proposals/storage-and-naming-proposal.md) already captures some of this – resolve, bind, list, and sub provide name-to-capability mappings. But Plan 9’s namespace model suggests a more dynamic composition pattern:

interface Namespace {
    # Resolve a name to a capability reference
    resolve @0 (name :Text) -> (capId :UInt32, interfaceId :UInt64);

    # Bind a capability at a name in this namespace
    bind @1 (name :Text, capId :UInt32) -> ();

    # Create a union: multiple capabilities behind one name
    union @2 (name :Text, capId :UInt32, position :UnionPosition) -> ();

    # List available names
    list @3 () -> (entries :List(NamespaceEntry));

    # Get a restricted sub-namespace
    sub @4 (prefix :Text) -> (ns :Namespace);
}

enum UnionPosition {
    before @0;   # searched first (like Plan 9 MBEFORE)
    after @1;    # searched last (like Plan 9 MAFTER)
    replace @2;  # replaces existing (like Plan 9 MREPL)
}

struct NamespaceEntry {
    name @0 :Text;
    interfaceId @1 :UInt64;
    label @2 :Text;
}

The key insight from Plan 9 is union composition – multiple capabilities can be bound at the same name, searched in order. This is useful for overlay patterns: a local cache capability layered before a remote store capability, or a per-user config namespace layered before a system-wide default.

Differences from Plan 9:

Plan 9 namespaces map names to file servers. capOS namespaces map names to typed capabilities. The advantage: capOS can verify at bind time that the capability matches the expected interface. Plan 9 cannot – you mount a file server and discover at runtime whether it exports the files you expect.

6.2 Cap’n Proto RPC vs 9P

Protocol comparison:

Aspect9P2000Cap’n Proto RPC
Message formatFixed binary fields, counted strings/dataCapnp wire format (pointer-based, zero-copy decode)
OperationsFixed set (walk, open, read, write, stat, …)Arbitrary per-interface (schema-defined methods)
TypingUntyped bytesStrongly typed (schema-checked)
MultiplexingTag-based (16-bit tags)Question ID-based (32-bit)
PipeliningNot supported (each op is independent)Promise pipelining (call method on not-yet-returned result)
AuthenticationPluggable via auth fidApplication-level (not protocol-specified)
CapabilitiesNo (file fids are unforgeable handles, but no transfer/attenuation)Native capability passing and attenuation
Maximum messageNegotiated msizeNo inherent limit (segmented messages)
Schema evolutionN/A (fixed protocol)Forward/backward compatible schema changes
Network transparencyNative design goalNative design goal

Key differences for capOS:

  1. Promise pipelining – This is capnp RPC’s strongest advantage over 9P. In 9P, opening a TCP connection requires: walk to /net/tcp -> walk to clone -> open clone -> read (get connection number) -> walk to ctl -> open ctl -> write “connect …” -> walk to data -> open data. Eight round-trips minimum. With capnp pipelining: net.createTcpSocket("10.0.0.1", 80) returns a promise, and you can immediately call .write(data) on the promise – the runtime chains the calls without waiting for the first to complete. One logical round-trip.

  2. Typed interfaces – 9P’s strength is that cat works on any file. Capnp’s strength is that the compiler catches console.allocFrame() at compile time. capOS should not try to make everything a “file” – typed interfaces are the right abstraction for a capability system. But a FileServer capability interface could provide Plan 9-like flexibility where needed (see below).

  3. Capability passing – 9P has no way to pass a fid through a file server to a third party. (The srv device is a workaround, not a protocol feature.) Capnp RPC natively supports passing capability references in messages. This is fundamental to capOS’s model.

6.3 File Server Pattern as a Capability

Plan 9’s file server pattern is useful and should not be discarded just because capOS is capability-based. Instead, define a generic FileServer capability interface:

interface FileServer {
    walk @0 (names :List(Text)) -> (fid :FileFid);
    list @1 (fid :FileFid) -> (entries :List(DirEntry));
}

interface FileFid {
    open @0 (mode :OpenMode) -> (iounit :UInt32);
    read @1 (offset :UInt64, count :UInt32) -> (data :Data);
    write @2 (offset :UInt64, data :Data) -> (written :UInt32);
    stat @3 () -> (info :FileInfo);
    close @4 () -> ();
}

A FileServer capability enables:

  • /proc-like introspection – A debugging service exports process state as a file tree. Tools read files to inspect state.
  • Config storage – A configuration namespace can be exposed as files for tools that work with text.
  • POSIX compatibility – The POSIX shim layer maps open()/read()/ write() to FileServer capability calls.
  • Shell scripting – A capability-aware shell could mount FileServer caps and use cat/echo-style tools on them.

The point: FileServer is one capability interface among many. It is not the universal abstraction (as in Plan 9), but it is available where the file metaphor is natural.

6.4 IPC Lessons

Plan 9 lesson: 9P works as universal IPC because the protocol is simple and the kernel handles the plumbing (mount, pipe, network). The cost is per-message overhead (copies, context switches).

capOS implications:

  1. Minimize copies. 9P’s two-copies-per-direction (user -> kernel pipe buffer -> server) is acceptable for networks but expensive for local IPC. capOS should investigate shared-memory regions for bulk data transfer between co-located processes, with capnp messages as the control plane. The roadmap’s io_uring-inspired submission/completion rings already point in this direction.

  2. Direct context switch. The L4/seL4 IPC fast-path (direct switch from caller to callee without choosing an unrelated runnable process) now exists as a baseline for blocked Endpoint receivers. Plan 9 does not do this – every 9P round-trip goes through the kernel’s pipe/network layer. capOS can tune this further because capability calls have a known target process.

  3. Batching. Plan 9 mitigates round-trip costs through large reads/ writes (the iounit mechanism). Capnp’s promise pipelining is the typed equivalent – batch multiple logical operations into a dependency chain that executes without intermediate round-trips.

6.5 Inferno Lessons

Dis VM / type safety: Inferno’s bet on a managed runtime (Dis + Limbo) gives it type safety as a security boundary. capOS, being written in Rust for kernel code and targeting native binaries, does not have this luxury for arbitrary user-space code. However:

  • WASI support (on the roadmap) provides a sandboxed execution environment with type-checked interfaces, similar in spirit to Dis.
  • Cap’n Proto schemas provide interface-level type safety even for native code. The schema is the contract, enforced at message boundaries.

Channel-based concurrency: Limbo’s chan of T type is a local IPC mechanism within a process. capOS does not currently have this (it relies on kernel-mediated capability calls for all IPC). For in-process threading (on the roadmap), typed channels between threads could be useful – implemented as a library on top of shared memory + futex, without kernel involvement.

Portable execution: Inferno’s ability to run the same bytecode everywhere is appealing but orthogonal to capOS’s goals. The WASI runtime item on the roadmap serves this purpose for capOS.

6.6 Concrete Recommendations

Based on this research, the following items are most relevant to capOS development:

  1. Add a Namespace capability with union semantics. Extend the existing Namespace design (from the storage proposal) with Plan 9-style union composition (before/after/replace). This enables overlay patterns for configuration, caching, and modularity.

  2. Implement a FileServer capability interface. Not as the universal abstraction, but as one interface for resources that are naturally file-like (config trees, debug introspection, POSIX compatibility). A FileServer cap is just another capability – no special kernel support needed.

  3. Prioritize promise pipelining. This is capnp’s killer feature over 9P and the biggest performance advantage for IPC-heavy workloads. Multiple logical operations collapse into one network/IPC round-trip. Async rings are in place; the remaining work is the Stage 6 pipeline dependency/result-cap mapping rule.

  4. Plan 9-style namespace construction in init. The boot manifest already describes which capabilities each service receives. Consider adding namespace-level composition to the manifest: “this service sees capability X as data/primary and capability Y as data/cache, with cache searched first” – union directory semantics expressed in capability terms.

  5. Study 9P’s exportfs pattern for network transparency. Plan 9’s exportfs re-exports a namespace subtree over the network. The capOS equivalent would be a proxy service that takes a set of local capabilities and makes them available as capnp RPC endpoints on the network. This is the “network transparency” roadmap item – 9P’s design proves it is achievable, and capnp’s richer type system makes it more robust.

  6. Do not replicate 9P’s weaknesses. The untyped byte-stream interface, the lack of structured errors, and the fixed operation set are 9P’s costs for universality. capOS pays none of these costs with Cap’n Proto. The temptation to “make everything a file for simplicity” should be resisted – typed capabilities are strictly more powerful, and the FileServer interface provides the file metaphor where needed without compromising the rest of the system.


Summary

Plan 9 / Inferno ConceptcapOS EquivalentGap / Action
Per-process namespace (bind/mount)Per-process capability tableAdd Namespace cap with union semantics
9P protocol (file operations)Cap’n Proto RPC (typed method calls)capnp is strictly superior for typed IPC; FileServer cap provides file semantics where needed
Union directoriesNo current equivalentAdd union composition to Namespace interface
File servers as servicesCapability-implementing processesAlready the model; manifest-driven service graph is close to Plan 9’s boot namespace construction
Network transparency via 9PNetwork transparency via capnp RPCSame goal, capnp adds promise pipelining and typed interfaces
exportfs (namespace re-export)Capability proxy serviceNot yet designed; high-value future work
Styx/9P as universal IPCCapnp messages as universal IPCAlready the model; prioritize fast-path and pipelining
Dis VM (portable, type-safe execution)WASI runtime (roadmap)Same goal, different mechanism
Limbo channels (typed local IPC)Not yet presentConsider for in-process threading
Authentication via auth fidNot yet designedCap’n Proto RPC has no built-in auth; needs design

References

  • Rob Pike, Dave Presotto, Sean Dorward, Bob Flandrena, Ken Thompson, Howard Trickey, Phil Winterbottom. “Plan 9 from Bell Labs.” Computing Systems, Vol. 8, No. 3, Summer 1995, pp. 221-254.
  • Rob Pike, Dave Presotto, Ken Thompson, Howard Trickey, Phil Winterbottom. “The Use of Name Spaces in Plan 9.” Operating Systems Review, Vol. 27, No. 2, April 1993, pp. 72-76.
  • Plan 9 Manual: intro(1), bind(1), mount(1), intro(5) (the 9P manual section).
  • Russ Cox, Eric Grosse, Rob Pike, Dave Presotto, Sean Quinlan. “Security in Plan 9.” USENIX Security 2002.
  • Sean Dorward, Rob Pike, Dave Presotto, Dennis Ritchie, Howard Trickey, Phil Winterbottom. “The Inferno Operating System.” Bell Labs Technical Journal, Vol. 2, No. 1, Winter 1997.
  • Phil Winterbottom, Rob Pike. “The Design of the Inferno Virtual Machine.” Bell Labs, 1997.
  • Vita Nuova. “The Dis Virtual Machine Specification.” 2003.
  • Vita Nuova. “The Limbo Programming Language.” 2003.
  • Sape Mullender (editor). “The 9P2000 Protocol.” Plan 9 manual, section 5 (intro(5)).
  • Kenichi Okada. “9P Resource Sharing Protocol.” IETF Internet-Draft, 2010.

Research: EROS, CapROS, and Coyotos

Deep analysis of persistent capability operating systems and their relevance to capOS.

1. EROS (Extremely Reliable Operating System)

1.1 Overview

EROS was designed and implemented by Jonathan Shapiro and collaborators at the University of Pennsylvania, starting in the late 1990s. It is a pure capability system descended from KeyKOS (developed at Key Logic in the 1980s). EROS’s defining feature is orthogonal persistence: the entire system state – processes, memory, capabilities – is transparently persistent. There is no distinction between “in memory” and “on disk.”

Key papers:

  • Shapiro, J. S., Smith, J. M., & Farber, D. J. “EROS: A Fast Capability System” (SOSP 1999)
  • Shapiro, J. S. “EROS: A Capability System” (PhD dissertation, 1999)
  • Shapiro, J. S. & Weber, S. “Verifying the EROS Confinement Mechanism” (IEEE S&P 2000)

1.2 The Single-Level Store

In a conventional OS, memory and storage are separate address spaces with different APIs (read/write vs mmap/file I/O). The programmer is responsible for explicitly loading data from disk into memory, modifying it, and writing it back. This creates an impedance mismatch that is the source of enormous complexity (serialization, caching, crash consistency, etc.).

EROS eliminates this distinction with a single-level store:

  • All objects (processes, memory pages, capability nodes) exist in a unified persistent object space.
  • There is no “file system” and no “load/save.” Objects simply exist.
  • The system periodically checkpoints the entire state to disk. Between checkpoints, modified pages are held in memory. After a crash, the system restores to the last consistent checkpoint.
  • From the application’s perspective, memory IS storage. There is no API for persistence – it happens automatically.

The single-level store in EROS operates on two primitive object types:

  1. Pages – 4KB data pages (the equivalent of both memory pages and file blocks).
  2. Nodes – 32-slot capability containers (the equivalent of both process state and directory entries).

Every page and node has a persistent identity (an Object ID, or OID). The kernel maintains an in-memory object cache and demand-pages objects from disk as needed. Modified objects are written back during checkpoints.

1.3 Checkpoint/Restart

EROS uses a consistent checkpoint mechanism inspired by KeyKOS:

How it works:

  1. The kernel periodically initiates a checkpoint (KeyKOS used a 5-minute interval; EROS used a configurable interval, typically seconds to minutes).
  2. All processes are momentarily frozen.
  3. The kernel snapshots the current state:
    • All dirty pages are marked for write-back.
    • All node state (capability tables, process descriptors) is serialized.
    • A consistent snapshot of the entire system is captured.
  4. Processes resume immediately – they continue modifying their own copies of pages (copy-on-write semantics ensure the checkpoint image is stable while new modifications accumulate).
  5. The snapshot is written to disk asynchronously while processes continue running.
  6. Once the write completes, the checkpoint is atomically committed (a checkpoint header on disk is updated).

What state is captured:

  • All memory pages (dirty pages since last checkpoint).
  • All nodes (capability slots, process registers, scheduling state).
  • The kernel’s object table (mapping OIDs to disk locations).
  • The capability graph (which process holds which capabilities).

Recovery after crash:

  • On boot, the kernel reads the last committed checkpoint header.
  • The system resumes from that exact state. All processes continue as if nothing happened (they may have lost a few seconds of work since the last checkpoint).
  • No fsck, no journal replay, no application-level recovery logic.

Performance characteristics:

  • Checkpoint cost is proportional to the number of dirty pages since the last checkpoint, not total system size.
  • Copy-on-write minimizes pause time – processes are frozen only long enough to mark pages, not to write them.
  • EROS achieved checkpoint times of a few milliseconds for the freeze phase, with asynchronous write-back taking longer depending on dirty set size.
  • The 1999 SOSP paper reported IPC performance within 2x of L4 (the fastest microkernel at the time) despite the persistence overhead.

1.4 Capabilities: Keys, Nodes, and Domains

EROS (following KeyKOS) uses a specific capability model with three fundamental concepts:

Keys (capabilities):

A key is an unforgeable reference to an object. Keys are the ONLY way to access anything in the system. There are several types:

  • Page keys – reference a persistent page. Can be read-only or read-write.
  • Node keys – reference a node (a 32-slot capability container). Can be read-only.
  • Process keys (called “domain keys” in KeyKOS) – reference a process, allowing control operations (start, stop, set registers).
  • Number keys – encode a 96-bit value directly in the key (no indirection). Used for passing constants through the capability mechanism.
  • Device keys – reference hardware device registers.
  • Forwarder keys – indirection keys used for revocation (see below).
  • Void keys – null/invalid keys, used as placeholders.

Nodes:

A node is a persistent container of exactly 32 key slots (in KeyKOS; EROS varied this slightly). Nodes serve multiple purposes:

  • Address space description: A tree of nodes with page keys at the leaves defines a process’s virtual address space. The kernel walks this tree to resolve virtual addresses to physical pages (analogous to page tables, but persistent and capability-based).
  • Capability storage: A process’s “capability table” is a node tree.
  • General-purpose data structure: Any capability-based data structure (directories, lists, etc.) is built from nodes.

Domains (processes):

A domain is EROS’s equivalent of a process. It consists of:

  • A domain root node with specific slots for:
    • Slot 0-15: general-purpose key registers (the process’s capability table)
    • Address space key (points to the root of the address space node tree)
    • Schedule key (determines CPU time allocation)
    • Brand key (identity for authentication)
    • Other control keys
  • The domain’s register state (general-purpose registers, IP, SP, flags)
  • A state (running, waiting, available)

The entire domain state is captured during checkpoint because it’s all stored in persistent nodes and pages.

1.5 The Keeper Mechanism

Each domain has a keeper key – a capability to another domain that acts as its fault handler. When a domain faults (page fault, capability fault, exception), the kernel invokes the keeper:

  1. The faulting domain is suspended.
  2. The kernel sends a message to the keeper describing the fault.
  3. The keeper can inspect and modify the faulting domain’s state (via the domain key), fix the fault (e.g., map a page, supply a capability), and restart it.

This is EROS’s equivalent of signal handlers or exception ports, but capability-mediated and fully general. Keepers enable:

  • Demand paging (the space bank keeper maps pages on fault)
  • Capability interposition (a keeper can wrap/restrict capabilities)
  • Process supervision (restart on crash)

1.6 Capability Revocation

Capability revocation – the ability to invalidate all copies of a capability – is one of the hardest problems in capability systems. EROS solves it with forwarder keys (called “sensory keys” in some descriptions):

How forwarders work:

  1. Instead of giving a client a direct key to a resource, the server creates a forwarder node.
  2. The forwarder contains a key to the real resource in one of its slots.
  3. The client receives a key to the forwarder, not the resource.
  4. When the client invokes the forwarder key, the kernel transparently redirects to the real resource.
  5. To revoke: the server rescinds the forwarder (sets a bit on the forwarder node). All outstanding forwarder keys become void keys. Invocations fail immediately.

Properties:

  • Revocation is O(1) – flip a bit on the forwarder node. No need to scan all processes for copies.
  • Revocation is transitive – if the revoked key was used to derive other keys (via further forwarders), those are also invalidated.
  • The client cannot distinguish a forwarder key from a direct key (the kernel handles the indirection transparently).
  • Revocation is immediate and irrevocable.

Space banks and revocation:

EROS uses space banks (inspired by KeyKOS) to manage resource allocation. A space bank is a capability that allocates pages and nodes. When a space bank is destroyed, ALL objects allocated from it are reclaimed. This provides bulk revocation of an entire subsystem.

1.7 Confinement

EROS provides a formally verified confinement mechanism. A confined subsystem cannot leak information to the outside world except through channels explicitly provided to it. Shapiro and Weber (IEEE S&P 2000) proved that EROS can construct a confined subsystem using:

  1. A constructor creates the confined process.
  2. The confined process receives ONLY the capabilities explicitly granted to it. It has no ambient authority, no access to timers (to prevent timing channels), and no access to storage (to prevent storage channels).
  3. The constructor verifies that no covert channels exist in the granted capability set.

This is relevant to capOS’s capability model: the same structural properties that make EROS confinement possible (no ambient authority, capabilities as the only access mechanism) are present in capOS’s design.


2. CapROS

2.1 Relationship to EROS

CapROS (Capability-based Reliable Operating System) is the direct successor to EROS. It was started by Charles Landau (who also worked on KeyKOS) and continues development based on the EROS codebase. CapROS is essentially “EROS in production” – the same architecture with engineering improvements.

2.2 Improvements Over EROS

Practical engineering focus:

  • EROS was a research system; CapROS aims to be deployable.
  • CapROS added support for modern hardware (PCI, USB, networking).
  • Improved build system and development toolchain.

Persistence improvements:

  • CapROS refined the checkpoint mechanism for better performance with modern disk characteristics (SSDs change the cost model significantly – random writes are cheap, so the checkpoint layout can be optimized differently than for spinning disks).
  • Added support for larger persistent object spaces.
  • Improved crash recovery speed.

Device driver model:

  • CapROS runs device drivers as user-space processes (like EROS), each receiving only the device capabilities they need.
  • A device driver receives: device register keys (MMIO access), interrupt keys (to receive interrupts), and DMA buffer keys.
  • The driver CANNOT access other devices, other processes’ memory, or arbitrary I/O ports. It is confined to its specific device.
  • This is directly analogous to capOS’s planned device capability model (see the networking and cloud deployment proposals).

Linux compatibility layer:

  • CapROS includes a partial Linux kernel compatibility layer that allows some Linux device drivers to be compiled and run as CapROS user-space drivers. This pragmatically addresses the “driver availability” problem without compromising the capability model.

2.3 Current Status

CapROS development continued into the 2010s but has been relatively quiet. The codebase exists and runs on real x86 hardware. It is not widely deployed and remains primarily a research/demonstration system. The key contribution is demonstrating that the EROS/KeyKOS persistent capability model is viable on modern hardware and can support real device drivers and applications.

2.4 Device Drivers and Hardware Access

CapROS’s device driver isolation is worth examining in detail because capOS faces the same design decisions:

Device capability model:

Kernel
  │
  ├── DeviceManager capability
  │     │
  │     ├── grants DeviceMMIO(base, size) to driver
  │     ├── grants InterruptCap(irq_number) to driver
  │     └── grants DMAPool(phys_range) to driver
  │
  └── Driver process
        │
        ├── uses DeviceMMIO to read/write registers
        ├── uses InterruptCap to wait for interrupts
        ├── uses DMAPool to allocate DMA-safe buffers
        └── exports higher-level capability (e.g., NIC, Block)

The driver has no way to access memory outside its granted ranges. A buggy NIC driver cannot corrupt disk I/O or access other processes’ pages.


3. Coyotos

3.1 Design Philosophy

Coyotos was Jonathan Shapiro’s next-generation project after EROS, started around 2004. Where EROS was an implementation of the KeyKOS model in C, Coyotos aimed to be a formally verifiable capability OS from the ground up.

Key differences from EROS:

  • Verification-oriented design: Every kernel mechanism was designed to be amenable to formal verification. If a feature couldn’t be verified, it was redesigned or removed.
  • BitC language: A new programming language (BitC) was designed specifically for writing verified systems software.
  • Simplified object model: Coyotos reduced the number of primitive object types compared to EROS, making the verification target smaller.
  • No inline assembly in the verified core: The verified kernel core was to be written entirely in BitC, with a thin hardware abstraction layer underneath.

3.2 BitC Language

BitC was an ambitious attempt to create a language suitable for both systems programming and formal verification:

Design goals:

  • Type safety: Sound type system that prevents memory errors at compile time.
  • Low-level control: Direct memory layout control, no garbage collector, suitable for kernel code.
  • Formal reasoning: Type system designed so that proofs about programs could be mechanically checked.
  • Mutability control: Explicit distinction between mutable and immutable references (predating Rust’s borrow checker by several years).

Relationship to capability verification:

The key insight was that if the kernel is written in a language with a sound type system, and capabilities are represented as typed references in that language, then many capability safety properties (no forgery, no amplification) follow from type safety rather than requiring separate proofs.

Specifically:

  • Capabilities are opaque typed references – the type system prevents construction of capabilities from raw integers.
  • The lack of arbitrary pointer arithmetic prevents capability forgery.
  • Type-based access control means a read-only capability reference cannot be cast to a read-write one.

Outcome:

BitC was never completed. The language design proved extremely difficult – combining low-level systems programming with formal verification requirements created unsolvable tensions in the type system. Shapiro eventually acknowledged that the BitC approach was overambitious and shelved the project. (Rust, which appeared later, solved many of the same problems with a different approach – borrowing and lifetimes rather than full dependent types.)

3.3 Formal Verification Approach

Coyotos aimed to verify several key properties:

  1. Capability safety: No process can forge, modify, or amplify a capability. This was to be proved as a consequence of BitC’s type safety.
  2. Confinement: A confined subsystem cannot leak information except through authorized channels. EROS proved this informally; Coyotos aimed for machine-checked proofs.
  3. Authority propagation: Formal model of how authority flows through the capability graph, allowing static analysis of security policies.
  4. Memory safety: The kernel never accesses memory it shouldn’t, never double-frees, never uses after free. Type safety + linear types in BitC were intended to guarantee this.

The verification approach influenced later work on seL4, which successfully achieved formal verification of a capability microkernel (though in C with Isabelle/HOL proofs, not in a verification-oriented language).

3.4 Coyotos Memory Model

Coyotos simplified the EROS memory model while retaining persistence:

Objects:

  • Pages: 4KB data pages (same as EROS).
  • CapPages: Pages that hold capabilities instead of data. This replaced EROS’s fixed-size nodes with variable-size capability containers.
  • GPTs (Guarded Page Tables): A unified abstraction for address space construction. Instead of EROS’s separate node trees for address spaces, Coyotos uses GPTs that combine guard bits (for sparse address space construction, similar to Patricia trees) with page table semantics.
  • Processes: Similar to EROS domains but with a cleaner structure.
  • Endpoints: IPC communication endpoints (similar to L4 endpoints, replacing EROS’s direct domain-to-domain calls).

GPTs (Guarded Page Tables):

This was Coyotos’s most innovative memory model contribution. A GPT node has:

  • A guard value and guard length (for address space compression).
  • Multiple capability slots pointing to sub-GPTs or pages.
  • Hardware-independent address space description that the kernel translates to actual page tables on TLB miss.

The guard mechanism allows sparse address spaces without allocating intermediate page table levels. For example, a process that uses only two memory regions at addresses 0x1000 and 0x7FFF_F000 needs only a few GPT nodes, not a full 4-level page table tree.

Persistence:

Coyotos retained EROS’s checkpoint-based persistence but with a cleaner separation between the persistent object store and the in-memory cache. The simpler object model (fewer object types) made the checkpoint logic easier to verify.

3.5 Current Status

Coyotos was never completed. The BitC language proved too difficult, and Shapiro moved on to other work. However, Coyotos’s design documents and specifications remain valuable as a carefully reasoned evolution of the EROS model. The key ideas (GPTs, endpoint-based IPC, verification-oriented design) influenced other systems work.


4. Single-Level Store: Deep Dive

4.1 The Core Concept

The single-level store unifies two traditionally separate abstractions:

Traditional OSSingle-Level Store
Virtual memory (RAM, volatile)Unified persistent object space
File system (disk, persistent)Same unified space
mmap (bridge between the two)No bridge needed
Serialization (convert objects to bytes for storage)Objects are always in storable form
Crash recovery (fsck, journal replay)Checkpoint restore

In a single-level store, the programmer never thinks about persistence. Objects are created, modified, and eventually garbage collected. The system ensures they survive power failure without any explicit save operation.

4.2 Implementation in EROS

EROS’s single-level store works as follows:

Object storage on disk:

  • The disk is divided into two regions: the object store and the checkpoint log.
  • The object store holds the canonical copy of all objects (pages and nodes), indexed by OID.
  • The checkpoint log holds the most recently checkpointed versions of modified objects.

Object lifecycle:

  1. An object is created (allocated from a space bank). It receives a unique OID.
  2. The object exists in the in-memory object cache. It may be modified arbitrarily.
  3. During checkpoint, if the object is dirty, its current state is written to the checkpoint log.
  4. After the checkpoint commits, the logged version may be migrated to the object store (or left in the log until the next checkpoint).
  5. If the object is evicted from memory (memory pressure), it can be demand-paged back from disk.

Demand paging:

When a process accesses a virtual address that isn’t currently in physical memory:

  1. Page fault occurs.
  2. The kernel looks up the OID for that virtual page (by walking the address space capability tree).
  3. If the object is on disk, the kernel reads it into the object cache.
  4. The page is mapped into the process’s address space.
  5. The process continues, unaware that I/O occurred.

This is similar to demand paging in a conventional OS, but with a critical difference: the “backing store” is the persistent object store, not a swap partition. There is no separate swap space.

4.3 Performance Implications

Advantages:

  • No serialization overhead for persistence. Objects are stored in their in-memory format.
  • No double-buffering. A conventional OS may have a page in both the page cache and a file buffer; EROS has one copy.
  • Checkpoint cost is proportional to mutation rate, not data size.
  • Recovery is instantaneous – resume from last checkpoint, no log replay.

Disadvantages:

  • Checkpoint pause: Even with copy-on-write, there is a brief pause to snapshot the system state. KeyKOS/EROS measured this at milliseconds, but it can grow with the number of dirty pages.
  • Write amplification: Every modified page must be written to the checkpoint log, even if only one byte changed. This is worse than a log-structured filesystem that can coalesce small writes.
  • Memory pressure: The object cache competes with application working sets. Under heavy memory pressure, the system may thrash between paging objects in and checkpointing them out.
  • Large object stores: The OID-to-disk-location mapping must be kept in memory (or itself paged, adding complexity). For very large stores, this overhead grows.
  • No partial persistence: You can’t choose to make some objects transient and others persistent. Everything is persistent. This wastes disk bandwidth on objects that don’t need persistence (temporary buffers, caches, etc.).

4.4 Relationship to Persistent Memory (PMEM/Optane)

Intel Optane (3D XPoint, now discontinued but conceptually important) and other persistent memory technologies provide byte-addressable storage that survives power loss. This is remarkably close to what EROS simulates in software:

EROS Single-Level StorePMEM Hardware
Software checkpoint to diskHardware persistence on every write
Object cache in DRAMData in persistent memory
Demand paging from diskDirect load/store to persistent media
Crash = lose since last checkpointCrash = lose in-flight stores (cache lines)

PMEM makes the single-level store cheaper:

  • No checkpoint writes needed for objects stored in PMEM – they’re already persistent.
  • No demand paging from disk – PMEM is directly addressable.
  • Consistency requires cache line flush + fence (much cheaper than disk I/O).

But PMEM doesn’t eliminate the need for the store abstraction:

  • PMEM capacity is limited (compared to SSDs/HDDs). The object store may still need to tier between PMEM and block storage.
  • PMEM has higher latency than DRAM. The object cache still has value as a fast-path.
  • Crash consistency with PMEM requires careful ordering of writes (cache line flushes). The checkpoint model actually simplifies this – you don’t need per-object crash consistency, just per-checkpoint consistency.

Relevance to capOS:

Even without PMEM hardware, understanding the single-level store model informs how capOS can design its persistence layer. The key insight is that separating “in-memory format” from “on-disk format” creates unnecessary complexity. Cap’n Proto’s zero-copy serialization already blurs this line – a capnp message in memory has the same byte layout as on disk.


5. Persistent Capabilities

5.1 How Persistent Capabilities Survive Restarts

In EROS/KeyKOS, capabilities survive restarts because they are part of the checkpointed state:

  1. A capability is stored as a key in a node slot.
  2. The key contains: (object type, OID, permissions, other metadata).
  3. During checkpoint, all nodes (including their key slots) are written to disk.
  4. On restart, nodes are restored. Keys reference objects by OID. Since objects are also restored, the key resolves to the same object.

The critical property: capabilities are named by the persistent identity of their target, not by a volatile address. A key says “page #47293” not “memory address 0x12345.” Since page #47293 is persistent, the key is meaningful across restarts.

5.2 Consistency Model

EROS guarantees checkpoint consistency: the entire system is restored to the state at the last committed checkpoint. This means:

  • If process A sent a message to process B, and both the send and receive completed before the checkpoint, both see the message after restart.
  • If the send completed but the receive didn’t (checkpoint happened between them), both are rolled back to before the send. The message is lost, but the system is consistent.
  • There is no scenario where A thinks it sent a message but B never received it (or vice versa). The checkpoint captures a consistent global snapshot.

This is analogous to database transaction atomicity but applied to the entire system state.

5.3 Volatile State and Capabilities

Some capabilities reference inherently volatile state. EROS handles this through the object re-creation pattern:

Hardware devices:

  • Device keys reference hardware registers that don’t survive reboot.
  • On restart, the kernel re-initializes device state and re-creates device keys.
  • Processes that held device keys get valid keys again (pointing to the re-initialized device), but the device state itself is reset.
  • The process’s device driver is responsible for re-initializing the device to the desired state (this is application logic, not kernel logic).

Network connections:

  • EROS doesn’t have a native networking stack in the kernel, so this is handled at the application level.
  • A network service process re-establishes connections on restart.
  • Clients that held capabilities to network endpoints would invoke them, and the network service would transparently reconnect.
  • The capability abstraction hides the reconnection – the client’s code doesn’t change.

General pattern:

When a capability references state that can’t survive restart:

  1. The capability itself persists (it’s in a node slot, checkpointed).
  2. On restart, invoking the capability may trigger re-initialization.
  3. The keeper mechanism handles this: the target object’s keeper detects the stale state and re-initializes before completing the call.
  4. The client is unaware of the restart (or sees a transient error if re-initialization fails).

5.4 The Space Bank Model

Persistent capabilities create a garbage collection problem: when is it safe to reclaim a persistent object? EROS solves this with space banks:

  • A space bank is a capability that allocates objects (pages and nodes).
  • Every object is allocated from exactly one space bank.
  • Space banks can be hierarchical (a bank allocates from a parent bank).
  • Destroying a space bank reclaims ALL objects allocated from it.

This provides:

  • Bulk deallocation: Terminate a subsystem by destroying its bank.
  • Resource accounting: Each bank tracks how much space it has consumed.
  • Revocation: Destroying a bank revokes all capabilities to objects allocated from it (the objects cease to exist).

The space bank model avoids the need for a global garbage collector scanning the capability graph. Instead, resource lifetimes are explicitly managed through the bank hierarchy.


6. Relevance to capOS

6.1 Cap’n Proto as Persistent Capability Format

EROS stores capabilities as (type, OID, permissions) tuples in fixed-size node slots. capOS can do something analogous but more naturally, because Cap’n Proto already provides a serialization format for structured data:

A persistent capability in capOS could be a capnp struct:

struct PersistentCapRef {
  interfaceId @0 :UInt64;   # which capability interface
  objectId @1 :UInt64;      # persistent object identity
  permissions @2 :UInt32;   # bitmask of allowed methods
  epoch @3 :UInt64;         # revocation epoch (see below)
}

Why this works well with Cap’n Proto:

  • Zero-copy persistence: A capnp message in memory has the same byte layout as on disk. No serialization/deserialization step for persistence. This is the closest a modern system can get to EROS’s single-level store without hardware support.
  • Schema evolution: Cap’n Proto’s backwards-compatible schema evolution means persistent capability formats can evolve without breaking existing stored capabilities.
  • Cross-machine references: The same PersistentCapRef can reference a local or remote object. The objectId can include a machine/node identifier for distributed capabilities.
  • Type safety: The interfaceId field provides runtime type checking that EROS’s keys lacked (EROS keys are untyped references; the type is determined by the target object, not the key).

Difference from EROS:

EROS capabilities are kernel objects – the kernel knows about every key and mediates every invocation. In capOS, PersistentCapRef could be a user-space construct – a serialized reference that is resolved by the kernel (or a userspace capability manager) when invoked. This is a deliberate trade-off: less kernel complexity, more flexibility, but the kernel must validate references on use rather than at creation time.

6.2 Checkpoint/Restart Patterns for capOS

EROS’s checkpoint model provides several patterns capOS could adopt:

This is what capOS’s storage proposal already describes: services serialize their own state to the Store capability. This is simpler than EROS’s transparent persistence but requires application cooperation.

Service state → capnp serialize → Store.put(data) → persistent hash
On restart: Store.get(hash) → capnp deserialize → restore state

Advantages over EROS transparent persistence:

  • No kernel complexity for checkpointing.
  • Services control what is persistent and what is transient.
  • No “checkpoint pause” – services choose when to persist.
  • Natural fit with Cap’n Proto (state is already capnp).

Disadvantages:

  • Every service must implement save/restore logic.
  • No automatic consistency across services (each saves independently).
  • Programmer error can lead to inconsistent state after restart.

Pattern 2: Kernel-Assisted Checkpointing (Phase 2)

Add a Checkpoint capability that captures process state:

interface Checkpoint {
  # Save the calling process's state (registers, memory, cap table)
  save @0 () -> (handle :Data);
  # Restore a previously saved state
  restore @1 (handle :Data) -> ();
}

This is analogous to CRIU (Checkpoint/Restore in Userspace) on Linux but capability-mediated:

  • The kernel captures the process’s address space, register state, and capability table.
  • State is serialized as capnp messages and stored via the Store capability.
  • Restore creates a new process from the saved state.

Advantages:

  • Transparent to the application (no save/restore logic needed).
  • Can capture the full capability graph of a process.
  • Enables process migration between machines.

Disadvantages:

  • Kernel complexity for state capture.
  • Must handle capabilities that reference volatile state (open network connections, device handles).
  • Memory overhead for copy-on-write snapshots.

Pattern 3: Consistent Multi-Process Checkpointing (Phase 3)

EROS’s global checkpoint extended to capOS:

  • A CheckpointCoordinator service initiates a distributed snapshot.
  • All participating services freeze, checkpoint their state, then resume.
  • The coordinator records a consistent cut across all services.
  • Recovery restores all services to the same consistent point.

This requires:

  • A coordination protocol (similar to distributed database commit).
  • Services must participate in the protocol (register with the coordinator, respond to freeze/checkpoint/resume signals).
  • The coordinator must handle failures during the checkpoint itself.

This is the most complex option but provides the strongest consistency guarantees. It’s appropriate for capOS’s later stages when multi-service reliability matters.

6.3 Capability-Native Filesystem Design

EROS’s model and capOS’s Store proposal can be synthesized into a capability-native filesystem design:

Hybrid approach: Content-Addressed Store + Capability Metadata

capOS’s current Store proposal uses content-addressed storage (hash-based). This is good for immutable data but awkward for capability references (a capability’s target may change without the capability itself changing).

A better model, informed by EROS:

Persistent Object = (ObjectId, Version, CapnpData, CapSlots[])

Where:

  • ObjectId is a persistent identity (like EROS’s OID).
  • Version is a monotonic counter (for optimistic concurrency).
  • CapnpData is the object’s data payload as a capnp message.
  • CapSlots[] is a list of capability references embedded in the object (like EROS’s node slots).

This separates data from capability references, which is important because:

  • Data can be content-addressed (deduplicated by hash).
  • Capability references must be identity-addressed (two identical-looking references to different objects are different).
  • Revocation operates on capability references, not data.

The Namespace as Directory

capOS’s Namespace capability is the capability-native equivalent of a directory:

UnixEROScapOS
Directory (inode + dentries)Node with keys in slotsNamespace capability
Path traversalNode tree walkNamespace.resolve() chain
Permission bitsKey type + slot permissionsCapability attenuation
Hard linksMultiple keys to same objectMultiple refs to same hash
Symbolic linksForwarder keysRedirect capabilities

Journaling and Crash Consistency

EROS avoids journaling by using checkpoint-based consistency. capOS’s Store service needs its own consistency story:

Option A: Checkpoint-based (EROS-style)

  • Store service maintains an in-memory cache of recent modifications.
  • Periodically flushes a consistent snapshot to disk.
  • On crash, recovers to last flush point.
  • Simple but may lose recent writes.

Option B: Log-structured (modern)

  • All writes go to an append-only log.
  • A background compaction process builds indexed snapshots from the log.
  • On crash, replay the log from the last snapshot.
  • More complex but no data loss window.

Option C: Hybrid

  • Capability metadata (the namespace bindings) uses a write-ahead log for crash consistency.
  • Object data (capnp blobs in the content-addressed store) uses checkpoint-based consistency (losing a few blobs is tolerable; losing a namespace binding is not).

Option C is recommended for capOS: it provides strong consistency for the critical metadata while keeping the data path simple.

6.4 Transparent vs Explicit Persistence: Tradeoffs

AspectEROS TransparentcapOS ExplicitHybrid
Application complexityNone (automatic)High (must implement save/restore)Medium (opt-in transparency)
Kernel complexityVery high (checkpoint, COW, object store)Low (just IPC and memory)Medium (checkpoint capability)
ConsistencyStrong (global checkpoint)Weak (per-service)Medium (coordinator)
ControlNone (everything persists)Full (choose what to save)Selective
PerformanceCheckpoint pausesNo pauses, explicit I/O costConfigurable
Volatile stateKeeper mechanism handlesService handles reconnectionAnnotated capabilities
DebuggabilityHard (system is a black box)Easy (state is explicit capnp)Medium
Cap’n Proto fitNeutralExcellent (state = capnp)Good

Recommendation for capOS:

Start with explicit persistence (Phase 1 in the storage proposal) because:

  1. It’s dramatically simpler to implement.
  2. Cap’n Proto makes serialization nearly free anyway.
  3. It gives services control over what is persistent.
  4. It aligns with capOS’s existing Store/Namespace design.
  5. The kernel stays simple.

Then add opt-in kernel-assisted checkpointing (like the Checkpoint capability described above) for services that want transparent persistence. This gives the benefits of EROS’s model without forcing it on everything.

Never implement EROS’s fully transparent global persistence – the kernel complexity is enormous, the debugging experience is poor, and modern systems (with fast SSDs and capnp zero-copy serialization) don’t need it. The explicit model with good tooling is strictly better for a research OS.

6.5 Capability Revocation in capOS

EROS’s forwarder key model translates directly to capOS:

Epoch-based revocation:

Each capability includes a revocation epoch. The kernel (or capability manager) maintains a per-object epoch counter. When a capability is invoked:

  1. Check that the capability’s epoch matches the object’s current epoch.
  2. If it doesn’t match, the capability has been revoked – return an error.
  3. To revoke all capabilities to an object, increment the object’s epoch.

This is O(1) revocation (increment a counter) with O(1) check per invocation (compare two integers). It’s simpler than EROS’s forwarder mechanism and fits naturally into a capnp-serialized capability reference:

struct CapRef {
  objectId @0 :UInt64;
  epoch @1 :UInt64;        # revocation epoch
  permissions @2 :UInt32;  # method bitmask
  interfaceId @3 :UInt64;  # type of the capability
}

Space bank analog:

capOS can implement EROS’s space bank pattern using the Store:

  • Each “bank” is a Namespace prefix in the Store.
  • Objects allocated by a service are stored under its namespace.
  • Destroying the service’s namespace revokes access to all its objects.
  • Resource accounting is done by the Store service (track bytes per namespace).

6.6 Summary of Recommendations

EROS/CapROS/Coyotos ConceptcapOS Recommendation
Single-level storeDon’t implement (too complex for research OS). Use Cap’n Proto zero-copy as a lightweight equivalent.
Checkpoint/restartPhase 1: application-level (explicit capnp save/restore). Phase 2: Checkpoint capability for opt-in transparent persistence.
Persistent capabilitiesUse capnp PersistentCapRef struct with objectId + epoch. Store capability graph in the Store service.
Capability revocationEpoch-based revocation (increment counter, check on invocation). Simpler than EROS forwarders, same O(1) cost.
Space banksMap to Store namespaces. Destroying a namespace reclaims all objects.
Keeper/fault handlerMap to capOS’s supervisor mechanism (service-architecture proposal). Supervisor receives fault notifications and can restart/repair.
GPTs (Coyotos)Not needed – capOS uses hardware page tables directly. The sparse address-space idea remains relevant for future SharedBuffer/AddressRegion work beyond the current VirtualMemory cap.
ConfinementcapOS already has the structural prerequisites (no ambient authority). Formal confinement proofs are a future research direction.
Device isolationAlready planned in capOS (device capabilities with MMIO/interrupt/DMA grants). CapROS validates this approach works in practice.

Key References

  • Shapiro, J. S., Smith, J. M., Farber, D. J. “EROS: A Fast Capability System.” Proceedings of the 17th ACM Symposium on Operating Systems Principles (SOSP), 1999.
  • Shapiro, J. S. “EROS: A Capability System.” PhD dissertation, University of Pennsylvania, 1999.
  • Shapiro, J. S. & Weber, S. “Verifying the EROS Confinement Mechanism.” IEEE Symposium on Security and Privacy, 2000.
  • Hardy, N. “The Confused Deputy.” ACM SIGOPS Operating Systems Review, 1988. (Motivates capability-based access control.)
  • Hardy, N. “KeyKOS Architecture.” Operating Systems Review, 1985.
  • Landau, C. R. “The Checkpoint Mechanism in KeyKOS.” Proceedings of the Second International Workshop on Object Orientation in Operating Systems, 1992.
  • Shapiro, J. S. et al. “Coyotos Microkernel Specification.” Technical report, Johns Hopkins University, 2004-2008.
  • Shapiro, J. S. et al. “BitC Language Specification.” Technical report, Johns Hopkins University, 2004-2008.
  • Dennis, J. B. & Van Horn, E. C. “Programming Semantics for Multiprogrammed Computations.” Communications of the ACM, 1966. (Original capability concept.)
  • Levy, H. M. “Capability-Based Computer Systems.” Digital Press, 1984. (Comprehensive survey of capability systems including CAP, Hydra, iAPX 432, StarOS.)

LLVM Target Customization for capOS

Deep research report on creating custom LLVM/Rust/Go targets for a capability-based OS.

Status 2026-04-30 00:41 UTC: capOS keeps the kernel on x86_64-unknown-none, while userspace builds through the checked-in x86_64-unknown-capos target plus the runtime linker-script path. Since this report was first written, PT_TLS parsing, userspace TLS block setup, FS-base save/restore, the VirtualMemory capability, a #[thread_local] QEMU smoke, Timer now/sleep, current-execution-context ThreadControl FS-base updates, the single-thread runtime checkpoint, process-local thread lifecycle, and private ParkSpace wait/wake have landed. Anonymous VirtualMemory unmap/decommit and explicit MemoryObject.unmap now drain private park waiters before address reuse. Runtime park clients, Go futexsleep/futexwake glue, per-thread TLS ownership for full multi-thread runtime use, shared park words, address-space generation cleanup, and a Go port remain future work.

Table of Contents

  1. Custom OS Target Triple
  2. Calling Conventions
  3. Relocations
  4. TLS (Thread-Local Storage) Models
  5. Rust Target Specification
  6. Go Runtime Requirements
  7. Relevance to capOS

1. Custom OS Target Triple

Target Triple Format

LLVM target triples follow the format <arch>-<vendor>-<os> or <arch>-<vendor>-<os>-<env>:

  • arch: x86_64, aarch64, riscv64gc, etc.
  • vendor: unknown, apple, pc, etc. (often unknown for custom OSes)
  • os: linux, none, redox, hermit, fuchsia, etc.
  • env (optional): gnu, musl, eabi, etc.

For capOS, the eventual userspace target triple should be x86_64-unknown-capos. The kernel should keep using a freestanding target (x86_64-unknown-none) unless a kernel-specific target file becomes useful for build hygiene.

What LLVM Needs

LLVM’s target description consists of:

  1. Target machine: Architecture (instruction set, register file, calling conventions). x86_64 already exists in LLVM.
  2. Object format: ELF, COFF, Mach-O. capOS uses ELF.
  3. Relocation model: static, PIC, PIE, dynamic-no-pic.
  4. Code model: small, kernel, medium, large.
  5. OS-specific ABI details: Stack alignment, calling convention defaults, TLS model, exception handling mechanism.

LLVM does NOT need kernel-level knowledge of your OS. It needs to know how to generate correct object code for the target environment. The OS name in the triple primarily affects:

  • Default calling convention selection
  • Default relocation model
  • TLS model selection
  • Object file format and flags
  • C library assumptions (relevant for C compilation, less for Rust no_std)

Creating a New OS in LLVM (Upstream Path)

To add capos as a recognized OS in LLVM itself:

  1. Add the OS to llvm/include/llvm/TargetParser/Triple.h (the OSType enum)
  2. Add string parsing in llvm/lib/TargetParser/Triple.cpp
  3. Define ABI defaults in the relevant target (llvm/lib/Target/X86/)
  4. Update Clang’s driver for the new OS (clang/lib/Driver/ToolChains/, clang/lib/Basic/Targets/)

This is significant upstream work and not necessary initially. The pragmatic path is using Rust’s custom target JSON mechanism (see Section 5).

What Other OSes Do

OSLLVM statusApproach
RedoxUpstream in Rust; no dedicated LLVM OS enum in current LLVMFull triple x86_64-unknown-redox, Tier 2 in Rust
HermitUpstream in LLVM and Rustx86_64-unknown-hermit, Tier 3, unikernel
FuchsiaUpstream in LLVM and Rustx86_64-unknown-fuchsia, Tier 2
TheseusCustom target JSONUses x86_64-unknown-theseus JSON spec, not upstream
Blog OS (phil-opp)Custom target JSONUses JSON target spec, targets x86_64-unknown-none base
seL4/RobigaliaCustom target JSONModified from x86_64-unknown-none

Recommendation for capOS: keep the kernel on x86_64-unknown-none. Introduce a userspace-only custom target JSON when cfg(target_os = "capos") or toolchain packaging becomes valuable. Do not upstream a capos OS triple until the userspace ABI is stable.

Treat the userspace target as build hygiene and runtime scaffolding for now. It does not promise a stable language ABI, Rust std, Go, C runtime, or upstream target contract beyond the current static no_std userspace model.


2. Calling Conventions

LLVM Calling Conventions

LLVM supports numerous calling conventions. The ones relevant to capOS:

CCLLVM IDDescriptionRelevance
C0Default C calling convention (System V AMD64 ABI on x86_64)Primary for interop
Fast8Optimized for internal use, passes in registersRust internal use
Cold9Rarely-called functions, callee-save heavyError paths
GHC10Glasgow Haskell Compiler, everything in registersNot relevant
HiPE11Erlang HiPE, similar to GHCNot relevant
WebKit JS12JavaScript JITNot relevant
AnyReg13Dynamic register allocationJIT compilers
PreserveMost14Caller saves almost nothingInterrupt handlers
PreserveAll15Caller saves nothingContext switches
Swift16Swift self/error registersNot relevant
CXX_FAST_TLS17C++ TLS access optimizationTLS wrappers
X86_StdCall64Windows stdcallNot relevant
X86_FastCall65Windows fastcallNot relevant
X86_RegCall95Register-based callingPerformance-critical code
X86_INTR83x86 interrupt handlerIDT handlers
Win6479Windows x64 calling conventionNot relevant

System V AMD64 ABI (The Default for capOS)

On x86_64, the System V AMD64 ABI (CC 0, “C”) is the standard:

  • Integer args: RDI, RSI, RDX, RCX, R8, R9
  • Float args: XMM0-XMM7
  • Return: RAX (integer), XMM0 (float)
  • Caller-saved: RAX, RCX, RDX, RSI, RDI, R8-R11, XMM0-XMM15
  • Callee-saved: RBX, RBP, R12-R15
  • Stack alignment: 16-byte at call site
  • Red zone: 128 bytes below RSP (unavailable in kernel mode)

capOS already uses this convention – the syscall handler in kernel/src/arch/x86_64/syscall.rs maps syscall registers to System V registers before calling syscall_handler.

Customizing for a New OS Target

For a custom OS, calling convention customization is usually minimal:

  1. Kernel code: Disable the red zone (capOS already does this via x86_64-unknown-none which sets "disable-redzone": true). The red zone is unsafe in interrupt/syscall contexts.

  2. Userspace code: Standard System V ABI is fine. The red zone is safe in userspace.

  3. Syscall convention: This is an OS design choice, not an LLVM CC. capOS uses: RAX=syscall number, RDI-R9=args (matching System V for easy dispatch). Linux uses a slightly different register mapping (R10 instead of RCX for arg4, because SYSCALL clobbers RCX).

  4. Interrupt handlers: Use X86_INTR (CC 83) or manual save/restore. capOS currently uses manual asm stubs.

Cross-Language Interop Implications

LanguagesConventionNotes
Rust <-> RustRust ABI (unstable)Internal to a crate, not stable across crates
Rust <-> Cextern "C" (System V)Stable, well-defined. Used for libcapos API
Rust <-> GoComplex (see Section 6)Go has its own internal ABI (ABIInternal)
C <-> Goextern "C" via cgoGo’s cgo bridge, heavy overhead
Any <-> KernelSyscall conventionRegister-based, OS-defined, not a CC

Key point: The System V AMD64 ABI is the lingua franca. All languages can produce extern "C" functions. capOS should standardize on System V for all cross-language boundaries and capability invocations.

Go’s internal ABI (ABIInternal, using R14 as the g register) is different from System V. Go functions called from outside Go must go through a trampoline. This is handled by the Go runtime, not something capOS needs to solve at the LLVM level.


3. Relocations

LLVM Relocation Models

ModelFlagDescription
static-relocation-model=staticAll addresses resolved at link time. No GOT/PLT.
pic-relocation-model=picPosition-independent code. Uses GOT for globals, PLT for calls.
dynamic-no-pic-relocation-model=dynamic-no-picLike static but with dynamic linking support (macOS legacy).
ropi-relocation-model=ropiRead-only position-independent (ARM embedded).
rwpi-relocation-model=rwpiRead-write position-independent (ARM embedded).
ropi-rwpi-relocation-model=ropi-rwpiBoth ROPI and RWPI (ARM embedded).

Code Models (x86_64)

ModelFlagAddress RangeUse Case
small-code-model=small0 to 2GBUserspace default
kernel-code-model=kernelTop 2GB (negative 32-bit)Higher-half kernel
medium-code-model=mediumCode in low 2GB, data anywhereLarge data sets
large-code-model=largeNo assumptionsMaximum flexibility, worst performance

What capOS Currently Uses

From .cargo/config.toml:

[target.x86_64-unknown-none]
rustflags = ["-C", "link-arg=-Tkernel/linker-x86_64.ld", "-C", "code-model=kernel", "-C", "relocation-model=static"]
  • Kernel: code-model=kernel + relocation-model=static. Correct for a higher-half kernel at 0xffffffff80000000. All kernel symbols are in the top 2GB of virtual address space, so 32-bit sign-extended addressing works.

  • Init/demos/capos-rt/shell/libcapos/libcapos-posix/capos-wasm userspace: All standalone userspace crates build against targets/x86_64-unknown-capos.json (checked in at that path) via the build-*-capos Cargo aliases in .cargo/config.toml. The target sets code-model = "small", relocation-model = "static", os = "capos", has-thread-local = true, and tls-model = "local-exec". The pinned nightly toolchain is nightly-2026-04-20; verify the effective LLVM version with rustc --version --verbose against that toolchain date.

Kernel vs. Userspace Requirements

Kernel:

  • Static relocations, kernel code model.
  • No PIC overhead needed – the kernel is loaded at a known address.
  • The linker script places everything in the higher half.
  • This is the correct and standard approach (Linux kernel does the same).

Userspace (current – static binaries):

  • Static relocations. A future custom userspace target should choose the small code model explicitly.
  • Simple, no runtime relocator needed.
  • Binary is loaded at a fixed address (0x200000).
  • Works perfectly for single-binary-per-address-space.

Userspace (future – if shared libraries or ASLR desired):

  • PIE (Position-Independent Executable) = PIC + static linking.
  • Requires a dynamic loader or kernel-side relocator.
  • Enables ASLR (Address Space Layout Randomization) for security.
  • Adds GOT indirection overhead (typically < 5% performance impact).

Position-Independent Code in a Capability Context

PIC/PIE is relevant to capOS for several reasons:

  1. ASLR: PIE enables loading binaries at random addresses, making ROP attacks harder. Even in a capability system, defense-in-depth matters.

  2. Shared libraries: If capOS ever supports shared objects (e.g., a shared libcapos.so), PIC is required for the shared library.

  3. WASI/Wasm: Not relevant – Wasm has its own memory model.

  4. Multiple instances: With static linking, two instances of the same binary can share read-only pages (text, rodata) if loaded at the same address. PIC/PIE allows sharing even at different addresses (copy-on-write for the GOT).

Recommendation for capOS: Keep static relocation for now. Consider PIE for userspace when implementing ASLR (after threading and IPC are stable). The kernel should remain static forever.


4. TLS (Thread-Local Storage) Models

LLVM TLS Models

LLVM supports four TLS models, in order from most dynamic to most constrained:

ModelDescriptionRuntime RequirementPerformance
general-dynamicAny module, any timeFull __tls_get_addr via dynamic linkerSlowest (function call per access)
local-dynamicSame module, any time__tls_get_addr for module base, then offsetSlow (one call per module per thread)
initial-execOnly modules loaded at startupGOT slot populated by dynamic linkerFast (one memory load)
local-execMain executable onlyDirect FS/GS offset, known at link timeFastest (single instruction)

How TLS Works on x86_64

On x86_64, TLS is accessed via the FS segment register:

  1. The OS sets the FS base address for each thread (via MSR_FS_BASE or arch_prctl(ARCH_SET_FS)).
  2. TLS variables are accessed as offsets from FS base:
    • local-exec: mov %fs:OFFSET, %rax (offset known at link time)
    • initial-exec: mov %fs:0, %rax; mov GOT_OFFSET(%rax), %rcx; mov %fs:(%rcx), %rdx
    • general-dynamic: call __tls_get_addr (returns pointer to TLS block)

Which Model for capOS?

Kernel:

  • The kernel does not use compiler TLS. Current TLS support is for loaded userspace ELF images only.
  • For SMP: per-CPU data via GS segment register (the standard approach). Set MSR_GS_BASE on each CPU to point to a PerCpu struct. swapgs on kernel entry switches between user and kernel GS base.
  • Kernel TLS model: Not applicable (per-CPU data is accessed via GS, not the compiler’s TLS mechanism).

Userspace (static binaries, no dynamic linker):

  • local-exec is the only correct choice. There’s no dynamic linker to resolve TLS relocations, so general-dynamic and initial-exec won’t work.
  • Implemented for the current single-threaded process model: the ELF parser records PT_TLS, the loader maps a Variant II TLS block plus TCB self pointer, and the scheduler saves/restores FS base on context switch.
  • Implemented for the current execution context: ThreadControl.setFsBase gives a runtime a capability-authorized equivalent to arch_prctl(ARCH_SET_FS).
  • ThreadControl.setFsBase affects only the current thread or execution context. There is no process-global FS-base mutation.
  • Still missing for future threading and full Go: per-thread TLS state and independently settable FS bases for each user thread.
  • Future thread creation must allocate or receive a distinct TLS block and FS base per ThreadRef; treating TLS as process-global would break Rust #[thread_local], Go g state, and any C runtime that assumes per-thread TLS.
  • Current-process/current-thread FS-base operations are useful for the single-thread runtime checkpoint, but they are not the final threading ABI. True multi-threaded Go or C/POSIX-like runtime support requires per-ThreadRef TLS allocation, per-thread FS-base ownership, and context switches that save/restore FS base as thread state.

Userspace (with dynamic linker, future):

  • initial-exec for the main executable and preloaded libraries.
  • general-dynamic for dlopen()-loaded libraries.
  • Requires implementing __tls_get_addr in the dynamic linker.

TLS Initialization Sequence

For a statically-linked userspace binary with local-exec TLS:

1. Kernel creates thread
2. Kernel allocates TLS block (size from ELF TLS program header)
3. Kernel copies .tdata (initialized TLS) into TLS block
4. Kernel zeros .tbss (uninitialized TLS) in TLS block
5. Kernel sets FS base = TLS block address (writes MSR_FS_BASE)
6. Thread starts executing; %fs:OFFSET accesses TLS directly

The ELF file contains two TLS sections:

  • .tdata (PT_TLS segment, initialized thread-local data)
  • .tbss (zero-initialized thread-local data, like .bss but per-thread)

The PT_TLS program header tells the loader:

  • Virtual address and file offset of .tdata
  • p_memsz = total TLS size (including .tbss)
  • p_filesz = size of .tdata only
  • p_align = required alignment

FS/GS Base Register Usage Plan

RegisterUsed ByPurpose
FSUserspace threadsThread-local storage (set per-thread by kernel)
GSKernel (via swapgs)Per-CPU data (set per-CPU during boot)

This is the standard Linux convention and what Go expects (Go uses arch_prctl(ARCH_SET_FS) to set the FS base for each OS thread).

What capOS Has and Still Needs

  1. Implemented: parse PT_TLS in capos-lib/src/elf.rs.
  2. Implemented: allocate/map a TLS block during process image load in kernel/src/spawn.rs.
  3. Implemented: copy .tdata, zero .tbss, and write the TCB self pointer for the current Variant II static TLS layout.
  4. Implemented: save/restore FS base through kernel/src/sched.rs and kernel/src/arch/x86_64/tls.rs.
  5. Implemented for the current process execution context: ThreadControl.getFsBase and ThreadControl.setFsBase.
  6. Still needed: per-thread FS-base state for future multi-threaded userspace.

5. Rust Target Specification

How Custom Targets Work

Rust supports custom targets via JSON specification files. The workflow:

  1. Create a <target-name>.json file
  2. Pass it to rustc: --target path/to/x86_64-unknown-capos.json
  3. Use with cargo via -Zbuild-std to build core/alloc/std from source

Target lookup priority:

  1. Built-in target names
  2. File path (if the target string contains / or .json)
  3. RUST_TARGET_PATH environment variable directories

The Rust target JSON schema is explicitly unstable. Generate examples from the pinned compiler with rustc -Z unstable-options --print target-spec-json and validate against that same compiler’s target-spec-json-schema before checking in a target file.

Viewing Existing Specs

# Print the JSON spec for a built-in target:
rustc +nightly -Z unstable-options --target=x86_64-unknown-none --print target-spec-json

# Print the JSON schema for all available fields:
rustc +nightly -Z unstable-options --print target-spec-json-schema

Example: x86_64-unknown-capos Kernel Target

Based on the current x86_64-unknown-none target, with capOS-specific adjustments. This is a sketch; regenerate from the pinned rustc schema before using it.

{
    "llvm-target": "x86_64-unknown-none-elf",
    "metadata": {
        "description": "capOS kernel (x86_64)",
        "tier": 3,
        "host_tools": false,
        "std": false
    },
    "data-layout": "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128",
    "arch": "x86_64",
    "cpu": "x86-64",
    "target-endian": "little",
    "target-pointer-width": 64,
    "target-c-int-width": 32,
    "os": "none",
    "env": "",
    "vendor": "unknown",
    "linker-flavor": "gnu-lld",
    "linker": "rust-lld",
    "pre-link-args": {
        "gnu-lld": ["-Tkernel/linker-x86_64.ld"]
    },
    "features": "-mmx,-sse,-sse2,-sse3,-ssse3,-sse4.1,-sse4.2,-avx,-avx2,+soft-float",
    "disable-redzone": true,
    "panic-strategy": "abort",
    "code-model": "kernel",
    "relocation-model": "static",
    "rustc-abi": "softfloat",
    "executables": true,
    "exe-suffix": "",
    "has-thread-local": false,
    "position-independent-executables": false,
    "static-position-independent-executables": false,
    "plt-by-default": false,
    "max-atomic-width": 64,
    "stack-probes": { "kind": "inline" }
}

Example: x86_64-unknown-capos Userspace Target

{
    "llvm-target": "x86_64-unknown-none-elf",
    "metadata": {
        "description": "capOS userspace (x86_64)",
        "tier": 3,
        "host_tools": false,
        "std": false
    },
    "data-layout": "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128",
    "arch": "x86_64",
    "cpu": "x86-64",
    "target-endian": "little",
    "target-pointer-width": 64,
    "target-c-int-width": 32,
    "os": "capos",
    "env": "",
    "vendor": "unknown",
    "linker-flavor": "gnu-lld",
    "linker": "rust-lld",
    "pre-link-args": {
        "gnu-lld": ["-Tinit/linker.ld"]
    },
    "features": "-mmx,-sse,-sse2,-sse3,-ssse3,-sse4.1,-sse4.2,-avx,-avx2,+soft-float",
    "disable-redzone": false,
    "panic-strategy": "abort",
    "code-model": "small",
    "relocation-model": "static",
    "rustc-abi": "softfloat",
    "executables": true,
    "exe-suffix": "",
    "has-thread-local": true,
    "position-independent-executables": false,
    "static-position-independent-executables": false,
    "max-atomic-width": 64,
    "plt-by-default": false,
    "stack-probes": { "kind": "inline" },
    "tls-model": "local-exec"
}

Key JSON Fields

FieldPurposeTypical Values
llvm-targetLLVM triple for code generationx86_64-unknown-none-elf (reuse existing backend)
osOS name (affects cfg(target_os = "..."))"none", "capos", "linux"
archArchitecture name"x86_64", "aarch64"
data-layoutLLVM data layout stringCopy from same-arch target
linker-flavorWhich linker to use"gnu-lld", "gcc", "msvc"
linkerLinker binary"rust-lld", "ld.lld"
featuresCPU features to enable/disableDisable SIMD/FPU until context switching saves that state
disable-redzoneDisable System V red zonetrue for kernel, false for userspace
code-modelLLVM code model"kernel", "small"
relocation-modelLLVM relocation model"static", "pic"
panic-strategyHow to handle panics"abort", "unwind"
has-thread-localEnable #[thread_local]true for userspace now that PT_TLS/FS base works
tls-modelDefault TLS model"local-exec" for static binaries
max-atomic-widthLargest atomic type (bits)64 for x86_64
pre-link-argsArguments passed to linker before user argsLinker script path
position-independent-executablesGenerate PIE by defaultfalse for now
exe-suffixExecutable file extension"" for ELF
stack-probesStack overflow detection mechanism{"kind": "inline"} in the current freestanding x86_64 spec

The SIMD/FPU-disabled userspace target is a temporary runtime constraint, not a long-term property of x86_64-unknown-capos. It is acceptable only while the kernel lacks full FPU/SIMD context switching and language runtimes are confined to the current static no_std subset. Before Go, C, or full Rust std support, validate the target against each runtime’s amd64 codegen assumptions; mainstream amd64 runtimes may assume SSE2/FPU state even when application code does not explicitly use vector types.

Do not let the custom userspace target accidentally ossify a weaker ABI solely because early kernel context switching does not yet save full FPU/SIMD state. The final language-runtime target must be selected after the kernel’s amd64 context-switch state and the runtime’s codegen assumptions are both reviewed.

no_std vs std Support Path

Current state: capOS uses no_std + alloc. This works with any target, including x86_64-unknown-none.

Path to std support (what Redox, Hermit, and Fuchsia did):

  1. Phase 1: Custom target with os: "capos" (current report). Use -Zbuild-std=core,alloc to build core and alloc. No std.

  2. Phase 2: Add capOS to Rust’s std library. This requires:

    • Adding mod capos under library/std/src/sys/ with OS-specific implementations of: filesystem, networking, threads, time, stdio, process spawning, etc.
    • Each of these maps to capOS capabilities
    • Use cfg(target_os = "capos") throughout std
    • Build with -Zbuild-std=std
  3. Phase 3: Upstream the target (optional). Submit the target spec and std implementations to the Rust project. Requires sustained maintenance.

What Redox did: Redox implemented a full POSIX-like userspace (relibc) and added std support by implementing the sys module in terms of relibc syscalls. This made Redox a Tier 2 target with pre-built std artifacts.

What Hermit did: Hermit is a unikernel, so std is implemented directly in terms of Hermit’s kernel-level APIs. Tier 3, community maintained.

What Fuchsia did: Fuchsia implemented std using Fuchsia’s native zircon syscalls (handles, channels, VMOs – similar in spirit to capabilities). Tier 2.

Recommendation for capOS: Stay on no_std + alloc with the custom target JSON. std support is a large effort that should wait until the syscall surface is stable and threading works. When the time comes, Fuchsia’s approach (std over native capability syscalls) is the best model, since Fuchsia’s handle-based API is conceptually close to capOS’s capabilities.

Other OS Projects Reference

OSTargetTierstdApproach
Redoxx86_64-unknown-redox2Yesrelibc (custom libc) over Redox syscalls
Hermitx86_64-unknown-hermit3Yesstd directly over kernel API
Fuchsiax86_64-unknown-fuchsia2Yesstd over zircon handles (capability-like)
Theseusx86_64-unknown-theseusN/ANoCustom JSON, no_std, research OS
Blog OSCustom JSONN/ANoBased on x86_64-unknown-none
MOROSCustom JSONN/ANoSimple hobby OS

6. Go Runtime Requirements

Go’s Runtime Architecture

Go’s runtime is essentially a userspace operating system. It manages goroutine scheduling, garbage collection, memory allocation, and I/O multiplexing. The runtime interfaces with the actual OS through a narrow set of functions that each GOOS must implement.

Minimum OS Interface for a Go Port

Based on analysis of runtime/os_linux.go, runtime/os_plan9.go, and runtime/os_js.go, here is the minimum interface:

Tier 1: Absolute Minimum (single-threaded, like GOOS=js)

These functions are needed for “Hello, World!”:

func osinit()                                    // OS initialization
func write1(fd uintptr, p unsafe.Pointer, n int32) int32  // stdout/stderr output
func exit(code int32)                            // process termination
func usleep(usec uint32)                         // sleep (can be no-op initially)
func readRandom(r []byte) int                    // random data (for maps, etc.)
func goenvs()                                    // environment variables
func mpreinit(mp *m)                             // pre-init new M on parent thread
func minit()                                     // init new M on its own thread
func unminit()                                   // undo minit
func mdestroy(mp *m)                             // destroy M resources

Plus memory management (in runtime/mem_*.go):

func sysAllocOS(n uintptr) unsafe.Pointer        // allocate memory (mmap)
func sysFreeOS(v unsafe.Pointer, n uintptr)       // free memory (munmap)
func sysReserveOS(v unsafe.Pointer, n uintptr) unsafe.Pointer  // reserve VA range
func sysMapOS(v unsafe.Pointer, n uintptr)        // commit reserved pages
func sysUsedOS(v unsafe.Pointer, n uintptr)       // mark as used
func sysUnusedOS(v unsafe.Pointer, n uintptr)     // mark as unused (madvise)
func sysFaultOS(v unsafe.Pointer, n uintptr)      // remove access
func sysHugePageOS(v unsafe.Pointer, n uintptr)   // hint: use huge pages

Tier 2: Multi-threaded (real goroutines)

func newosproc(mp *m)                            // create OS thread (clone)
func exitThread(wait *atomic.Uint32)             // exit current thread
func futexsleep(addr *uint32, val uint32, ns int64)  // futex wait
func futexwakeup(addr *uint32, cnt uint32)        // futex wake
func settls()                                     // set FS base for TLS
func nanotime1() int64                            // monotonic nanosecond clock
func walltime() (sec int64, nsec int32)           // wall clock time
func osyield()                                    // sched_yield

Tier 3: Full Runtime (signals, profiling, network poller)

func sigaction(sig uint32, new *sigactiont, old *sigactiont)
func signalM(mp *m, sig int)                      // send signal to thread
func setitimer(mode int32, new *itimerval, old *itimerval)
func netpollopen(fd uintptr, pd *pollDesc) uintptr
func netpoll(delta int64) (gList, int32)
func netpollBreak()

Linux Syscalls Used by Go Runtime (Complete List)

From runtime/sys_linux_amd64.s:

Syscall#Go WrappercapOS Equivalent
read0runtime.readStore cap
write1runtime.write1Console cap
close3runtime.closefdCap drop
mmap9runtime.sysMmapVirtualMemory cap
munmap11runtime.sysMunmapVirtualMemory.unmap
brk12runtime.sbrk0VirtualMemory cap
rt_sigaction13runtime.rt_sigactionSignal cap (future)
rt_sigprocmask14runtime.rtsigprocmaskSignal cap (future)
sched_yield24runtime.osyieldsys_yield
mincore27runtime.mincoreVirtualMemory.query
madvise28runtime.madviseFuture VirtualMemory decommit/query semantics, or unmap/remap policy
nanosleep35runtime.usleepTimer cap
setitimer38runtime.setitimerTimer cap
getpid39runtime.getpidProcess info
clone56runtime.cloneThread cap
exit60runtime.exitsys_exit
sigaltstack131runtime.sigaltstackNot needed initially
arch_prctl158runtime.settlsThreadControl.setFsBase
gettid186runtime.gettidThread info
futex202runtime.futexParkSpace compact CAP_OP_PARK / CAP_OP_UNPARK
sched_getaffinity204runtime.sched_getaffinityCPU info
timer_create222runtime.timer_createTimer cap
timer_settime223runtime.timer_settimeTimer cap
timer_delete226runtime.timer_deleteTimer cap
clock_gettime228runtime.nanotime1Timer cap
exit_group231runtime.exitsys_exit
tgkill234runtime.tgkillThread signal (future)
openat257runtime.openNamespace cap
pipe2293runtime.pipe2IPC cap

Go’s TLS Model

Go uses arch_prctl(ARCH_SET_FS, addr) to set the FS segment base for each OS thread. The convention:

  • FS base points to the thread’s m.tls array
  • Goroutine pointer g is stored at -8(FS) (ELF TLS convention)
  • In Go’s ABIInternal, R14 is cached as the g register for performance
  • On signal entry or thread start, g is loaded from TLS into R14

Go does NOT use the compiler’s TLS mechanisms (no __thread or thread_local!). It manages TLS entirely in its own runtime via the FS register.

For capOS, this means the kernel needs:

  1. arch_prctl(ARCH_SET_FS) equivalent capability method
  2. The kernel must save/restore FS base on context switch
  3. Each thread’s FS base must be independently settable

Adding GOOS=capos to Go

Files that need to be created/modified in a Go fork:

src/runtime/
    os_capos.go           // osinit, newosproc, futexsleep, etc.
    os_capos_amd64.go     // arch-specific OS functions
    sys_capos_amd64.s     // syscall wrappers in assembly
    mem_capos.go          // sysAlloc/sysFree/etc. over VirtualMemory cap
    signal_capos.go       // signal stubs (no real signals initially)
    stubs_capos.go        // misc stubs
    netpoll_capos.go      // network poller (stub initially)
    defs_capos.go         // OS-level constants
    vdso_capos.go         // VDSO stubs (no VDSO)

src/syscall/
    syscall_capos.go      // Go's syscall package
    zsyscall_capos_amd64.go

src/internal/platform/
    (modifications to supported.go, zosarch.go)

src/cmd/dist/
    (modifications to add capOS to known OS list)

Estimated: ~2000-3000 lines for Phase 1 (single-threaded).

Feasibility Assessment

FeatureDifficultyBlocked On
Hello World (write + exit)EasyConsole capability plus exit syscall
Memory allocator (mmap)MediumVirtualMemory capability exists; Go glue and any missing query/decommit semantics remain
Single-threaded goroutines (M=1)MediumVirtualMemory and Timer capabilities exist; Go runtime glue remains
Multi-threaded (real threads)Hardcapos-rt thread/park clients, Go newosproc and futexsleep/futexwake glue, per-ThreadRef TLS ownership, GC/runtime coordination
Network pollerHardAsync cap invocation, networking stack
Signal-based preemptionHardSignal delivery mechanism
Full stdlibVery HardPOSIX layer or native cap wrappers

7. Relevance to capOS

Practical Scope of Work

Phase 1: Custom Target JSON (done)

What: A targets/x86_64-unknown-capos.json target spec is checked into the repo. All userspace crates (init, demos, shell, capos-rt, libcapos, libcapos-posix, capos-wasm) build against it via Cargo aliases in .cargo/config.toml. The kernel stays on x86_64-unknown-none.

Why: Enables cfg(target_os = "capos"), sets code-model = "small" and tls-model = "local-exec" explicitly, and removes the dependency on per-crate rustflag overrides.

Recurring maintenance: Rust target JSON fields are not stable; validate the checked-in file against rustc -Z unstable-options --print target-spec-json-schema when upgrading the pinned nightly.

Phase 2: TLS Support (mostly landed, required for Go)

What: Parse PT_TLS from ELF, allocate per-thread TLS blocks, set FS base on context switch, add arch_prctl-equivalent syscall.

Why: Required for Go runtime (Go’s settls() sets FS base), for Rust #[thread_local] in userspace, and for C’s __thread.

Current state: PT_TLS parsing, static TLS mapping, FS-base context-switch state, runtime-controlled current FS-base updates, and Rust #[thread_local] smokes are implemented. Process-local thread lifecycle also exists. Remaining work is allocating and owning distinct TLS blocks and FS-base state per ThreadRef for Go’s multi-thread runtime path.

Blockers: per-ThreadRef TLS ownership rules and Go newosproc integration for the multi-threaded case.

Phase 3: VirtualMemory Capability (implemented baseline, required for Go)

What: Implement the VirtualMemory capability interface. The current schema has map, unmap, and protect; Go may need decommit/query semantics later.

Why: Go’s memory allocator (sysAlloc, sysReserve, sysMap, etc.) needs mmap-like functionality. This is the single biggest kernel-side requirement for Go.

Current state: VirtualMemoryCap implements map/unmap/protect over the existing page-table code with ownership tracking and quota checks. Go-specific work still has to map runtime sysAlloc/sysReserve/sysMap expectations onto that interface.

Blockers: None for the baseline capability. Useful Go still needs runtime glue for VirtualMemory/Timer, capos-rt park clients, Go futex glue, Go thread integration, and address-space generation cleanup for reusable private park words outside the landed explicit unmap/decommit paths.

Phase 4: ParkSpace Go Futex Glue (Low-medium effort, required for Go threading)

What: map Go’s futex(WAIT) and futex(WAKE) runtime hooks onto the implemented ParkSpace compact wait/wake operations.

Why: Go’s runtime synchronization (lock_futex.go) is built on futexes. The entire goroutine scheduler depends on futex-based sleeping.

Effort: the compact park ABI already exists as CAP_OP_PARK and CAP_OP_UNPARK; Go futex glue should target that ParkSpace contract instead of inventing a parallel wait namespace.

Private futex authority and keying rules: use ParkSpace as the normative design. Private futex keys are generation-bearing address-space keys:

#![allow(unused)]
fn main() {
ParkKey::Private {
    address_space_id,
    address_space_generation,
    uaddr,
}
}
  • WAIT validates that the address is mapped readable in the caller’s current address space and that the expected value still matches under the same page-table stability rules used for process-buffer validation.
  • The value check and waiter insertion are one atomic kernel operation with respect to WAKE, unmap, process exit, and address-space teardown.
  • WAKE for a private futex can only wake waiters with the same address_space_id and address_space_generation; a raw virtual address is never a cross-process sync key.
  • Unmap, revoke, or address-space teardown drains or fails waiters for the old key before the virtual address can be reused as unrelated state.
  • A future shared-futex design must use ParkKey::Shared with memory_object_id, memory_object_generation, and aligned object offset, not raw user virtual address.

The authority boundary stays the caller’s ParkSpace capability for private parks and a future SharedParkSpace for MemoryObject-derived shared parks. Do not introduce a global futex namespace or a generation-less duplicate key shape.

Blockers: capos-rt park clients, Go futexsleep/futexwake glue, and full multi-thread runtime integration.

Phase 5: Go Thread Runtime Integration (High effort, required for Go GOMAXPROCS>1)

What: connect Go’s newosproc, TLS ownership, futex glue, and GC coordination to the implemented process-local thread lifecycle and private ParkSpace wait/wake substrate.

Why: Go’s newosproc() creates OS threads via clone(). Without real threads, Go is limited to GOMAXPROCS=1.

Effort: still high, but the kernel substrate is no longer a blank scheduler extension. The remaining work is capos-rt clients, Go runtime glue, per-ThreadRef TLS ownership, and validation under Go’s scheduler.

Blockers: capos-rt thread and park clients, newosproc glue, futexsleep/futexwake glue, per-ThreadRef TLS ownership rules, GC coordination across kernel threads, address-space generation cleanup for reusable private park-word memory outside explicit unmap/decommit paths, and shared park words for future cross-process futexes. Per-CPU data and SMP are later blockers for multi-core scaling, not for the first single-CPU Go thread integration.

Biggest Blockers for Go

In priority order after the 2026-04-24 TLS, VirtualMemory, Timer, ThreadControl, single-thread runtime-checkpoint, process-local thread lifecycle, and private ParkSpace work:

  1. Go park/futex glue – Go’s M:N scheduler depends on futex-shaped sleeping/waking. The kernel has private ParkSpace wait/wake; the Go port still needs capos-rt clients and futexsleep/futexwake integration.

  2. Go thread integration – Required for GOMAXPROCS > 1. The kernel has process-local thread lifecycle; the Go port still needs newosproc, per-ThreadRef TLS ownership, and GC coordination across those threads.

  3. Go runtime port glue – the capOS capability side now has a single-thread checkpoint for VirtualMemory and Timer, but a real Go fork still needs to map sysAlloc/write1/exit/random/env/time to capOS runtime and capabilities.

Biggest Blockers for C

C is much simpler than Go:

  1. Linker and toolchain setup – Need a cross-compilation toolchain targeting capOS (Clang with the custom target, or GCC cross-compiler).
  2. libcapos.a with C headers – Rust library with extern "C" API.
  3. musl integration (optional) – For full libc, replace musl’s __syscall() with capability invocations.
1. Custom userspace target JSON          [done: targets/x86_64-unknown-capos.json]
     |
2. VirtualMemory capability              [done: baseline map/unmap/protect]
     |
3. TLS support (PT_TLS, FS base)         [done: static ELF + ThreadControl]
     |
4. ParkSpace compact wait/wake           [done: private path; clients open]
     |
5. Timer capability (monotonic clock)    [done: monotonic now/sleep]
     |
6. Go Phase 1: minimal GOOS=capos       [checkpoint done; Go fork remains]
     |
7. Kernel threading for Go runtime       [partial thread lifecycle; Go integration open]
     |
8. Go Phase 2: multi-threaded           [GOMAXPROCS>1, concurrent GC]
     |
9. C toolchain + libcapos               [parallel with Go work]
     |
10. Go Phase 3: network poller          [depends on networking stack]

Steps 1-5 are kernel prerequisites. Step 6 is the Go fork. Steps 7-10 are incremental improvements that can proceed in parallel.

Key Architectural Decisions for capOS

  1. Keep x86_64-unknown-none for kernel, x86_64-unknown-capos for userspace. The kernel does not benefit from a custom OS target (it’s freestanding). Userspace benefits from cfg(target_os = "capos").

  2. Use local-exec TLS model for static binaries. No dynamic linker means no general-dynamic or initial-exec TLS. local-exec is zero-overhead.

  3. Implement FS base save/restore early. Both Go and Rust #[thread_local] need it. It’s a small addition to context switch code.

  4. VirtualMemory cap stays on the Go critical path. The baseline exists; the Go port still needs exact runtime allocator semantics and any missing query/decommit behavior.

  5. Futex is the synchronization primitive. Both Go and any future pthreads implementation need futex-shaped wait/wake. The capOS authority surface is ParkSpace, using compact CAP_OP_PARK / CAP_OP_UNPARK transport rather than generic Cap’n Proto method dispatch on the hot path.

  6. Signals can be deferred. Go can start with cooperative-only preemption (no SIGURG). Signal delivery is complex and can come much later.

Used By

Research: Linux Sandboxes And Virtualization For Workloads

capOS needs a credible way to run Linux-native software before every useful application, language runtime, package manager, development workflow, and desktop or server tool has a native capOS port. Users may want a familiar Linux environment. Agents may need a bounded place to run build systems, interpreters, package managers, browsers, command-line tools, scientific software, or model-generated code. Operators may need a compatibility bridge while capOS-native services are still emerging.

This note separates the available Linux isolation choices and records how they should map to generic capOS capability services. Scientific tooling is one important consumer of this substrate, but the substrate itself should be a general Linux workload sandbox.

The important distinction is between compatibility wrappers and isolation boundaries. Namespaces, cgroups, seccomp, Landlock, User-Mode Linux, containers, gVisor, and KVM microVMs all run “Linux things”, but they do not provide the same boundary, timing behavior, device model, or operational cost.

Source Baseline

External sources checked:

Local grounding:

Isolation Layers

Namespaces, cgroups, seccomp, And Landlock

The basic Linux sandbox stack is:

  • namespaces for separate views of process ids, mounts, users, networks, IPC, UTS names, time, and related global resources;
  • cgroup v2 for resource accounting, placement, and limits;
  • seccomp-BPF for syscall filtering;
  • Landlock for unprivileged filesystem access restriction;
  • rlimits and ordinary Unix credentials for process-local bounds.

This stack is useful for trusted or semi-trusted tools that need quick startup and native Linux performance. It is not a hard boundary against all kernel attack surface: a namespaced process still talks to the host Linux kernel through syscalls, page faults, filesystem code, networking, and device interfaces. For capOS, a namespace/cgroup/seccomp/Landlock sandbox is a good early backend for trusted batch tools, shell commands, build steps, formatters, language package commands, and scientific-base tools such as PARI/GP, Z3, cvc5, HiGHS, or Lean when the tools and inputs are trusted by the same operator.

The capOS wrapper should generate the sandbox policy from capability grants: read-only input directories, a scratch/output directory, optional loopback or egress network, CPU/memory/pids/io quotas, and a syscall profile. The policy is an implementation detail; the capOS-visible object is still a typed command, job, shell, build, solver, proof, CAS, notebook, or application capability.

bubblewrap, nsjail, systemd-nspawn, And OCI Runtimes

bubblewrap is a low-level unprivileged sandboxing tool used by Flatpak-style systems. It is appropriate for single-process or small interactive tools where the desired policy is mostly mount and namespace shaping.

nsjail combines namespaces, cgroups, rlimits, and seccomp-BPF policies with a compact configuration format. It is a strong fit for early batch jobs, command-wrapper services, solver/proof-checker tasks, package commands, and agent tool calls because it already models the same inputs capOS cares about: uid/gid, chroot/root, mounts, network mode, time limits, memory limits, cgroups, and syscall policy.

systemd-nspawn is better for booting or debugging a full Linux userspace tree than for narrow per-tool sandboxing. It is useful for stateful development images and package-build roots, but it should not be the default tool executor because its shape encourages broad OS-in-container authority.

OCI runtimes and images are valuable for supply-chain compatibility. capOS should be able to import OCI image metadata and run image contents through a chosen sandbox backend, but it should not treat “OCI container” as a security claim. The security claim depends on the runtime and host policy.

User-Mode Linux

User-Mode Linux is a Linux kernel port that runs as a normal Linux process and talks to the host kernel instead of hardware. It is useful as a compatibility, debugging, and low-privilege Linux-kernel experiment path. It can contain a guest Linux userspace without requiring hardware virtualization.

UML is not the same category as a hardware-backed Linux guest. It does not give the same boundary as KVM/microVM execution because the UML kernel and guest work ultimately run as host Linux processes and depend heavily on the host kernel surface. For capOS Linux workload execution, UML can be a convenient developer backend when /dev/kvm is unavailable, but it should not be the default answer for untrusted multi-tenant sessions, model-generated code, networked tools, or package-build execution.

gVisor

gVisor moves many host-kernel-facing interfaces into a per-sandbox application kernel and exposes an OCI runtime, runsc. This is an attractive middle tier: it keeps container-like resource behavior and tooling while reducing direct host kernel exposure for many syscalls.

The tradeoff is compatibility and performance. General Linux workloads can exercise native runtimes, dynamic loaders, filesystems, signals, threading, shared memory, networking, debuggers, browser sandboxes, package managers, and sometimes GPU/device paths. gVisor should be treated as a backend to test per workload class, not assumed compatible with every developer tool, package manager, browser, desktop app, scientific stack, proof assistant, or solver.

Hardware-Backed Linux Guests

For stronger isolation, use a Linux guest under hardware virtualization: QEMU/KVM, Firecracker, Cloud Hypervisor, or Kata Containers.

QEMU/KVM is the broadest compatibility target. It can run a full Linux guest with familiar device models, disks, networking, and debugging hooks. It is the right default for compatibility breadth, reproducibility, and complex package systems that expect a normal Linux distribution.

Firecracker is a narrow microVM monitor designed for serverless-style workloads. Its reduced device model and operational focus are attractive for batch jobs, command execution workers, stateless build/test workers, solver workers, and proof-check workers where the rootfs, network, block devices, and API surface can be kept small.

Kata Containers runs container workloads inside lightweight VMs and integrates with container orchestration. It is a good reference for mapping container workload semantics onto VM isolation. capOS does not need to import the full Kubernetes/Kata stack, but the pod-as-VM-sandbox idea maps well to an LinuxWorkloadVm, AgentJobVm, or other specialized Linux workload service.

Hardware-backed Linux guests should be the default for:

  • untrusted interactive Linux shells or familiar Linux workspaces;
  • untrusted notebook execution;
  • model-generated code that may exploit native extensions;
  • package builds from untrusted recipes;
  • network-enabled data processing;
  • multi-tenant hosted agent jobs;
  • browser, GUI, or desktop-like Linux application sessions;
  • workflows that need a full Linux distribution but should not share the host kernel attack surface.

Dedicated Host Isolation

VM and microVM boundaries reduce direct host-kernel sharing, but they do not remove every shared-hardware or operator-domain risk. Dedicated hosts, single-tenant nodes, or separately owned external hardware are appropriate when the workload has unusually high tenant risk, handles sensitive data, requires GPU or device passthrough, runs long-lived browser/GUI sessions with large attack surface, or must limit the blast radius of a VMM, firmware, driver, or VM-escape failure.

Dedicated hardware should be modeled as a deployment and tenancy property, not as a different Linux API. A QemuKvmVm or FirecrackerMicroVm running on a single-tenant host still exposes the same guest workload interface, but its security and scheduler evidence should record that the host was not shared with unrelated tenants. Conversely, a hardware-backed guest on a shared host is still a VM boundary, but it is not the strongest isolation class capOS can offer.

Virtualized Workloads And capOS Auto Full-NOHZ

For capOS scheduling design, Linux sandboxes are modeled as host-visible workloads when making native Tickless and Realtime Scheduling decisions. VMs, microVMs, UML processes, gVisor sandboxes, external sidecars, and VMM helper threads affect capOS through the host-visible set of runnable work, timers, IRQs, polling loops, and housekeeping obligations.

For capOS-native auto full-nohz scheduling:

  • capOS policy applies to the outer capOS-scheduled entity: VMM processes, vCPU threads, I/O helper threads, proxy processes, and native capOS services.
  • Guest Linux scheduler state is opaque. Guest CONFIG_NO_HZ_IDLE, nohz_full, cpuidle, and halt-poll settings may be recorded for diagnostics or benchmark interpretation, but they do not grant capOS CPU-isolation authority.
  • Ordinary Linux sandboxes should run as ordinary scheduled workloads unless the capOS-visible outer backend receives an explicit low-noise placement lease.
  • A sandbox descriptor must not set capOS auto full-nohz, CPU isolation, or exclusive CPU placement by itself. Those are scheduler-authority decisions with global cost.

Idle behavior still needs backend research because it determines whether an “idle” guest is actually idle from the host scheduler’s perspective. Linux CONFIG_NO_HZ_IDLE stops the guest scheduling-clock tick when a guest CPU is idle, which reduces guest-generated timer interrupts and vCPU wakeups. That does not enable capOS host tick suppression by itself. It only helps by making the VMM’s host-visible vCPU thread block more often and wake less often.

KVM prior art shows the boundary clearly. When a guest vCPU halts, the host may block the vCPU thread or poll briefly for a wakeup. Host-side KVM halt polling trades latency for CPU use, and large polling intervals can turn idle guest time into host kernel time. Guest-side halt polling makes the guest vCPU poll before halting and can run even when other host tasks are runnable. A capOS backend intended for low-noise placement therefore needs explicit accounting for VMM/vCPU polling, helper threads, virtio event loops, host timers, and IRQ placement.

The validation target is backend quietness, not Linux nohz integration:

  • idle vCPUs should block or halt instead of forcing periodic outer work;
  • one-shot guest timer deadlines should wake the vCPU correctly without a host periodic tick dependency;
  • VMM helper threads, block/network event loops, and virtio queues should be visible to capOS placement and accounting;
  • halt-polling or busy guest kernel threads should make the outer workload ineligible for low-noise placement rather than silently degrading a capOS scheduler claim;
  • benchmark reports should distinguish guest Linux tickless state from capOS outer scheduler state.

capOS Linux Workload Service Model

The capOS-visible service should hide the backend without hiding the security claim:

LinuxWorkloadSandbox {
  backend: NamespaceSandbox | GVisor | UserModeLinux | QemuKvmVm |
           FirecrackerMicroVm | KataVm | NativeCapos;
  isolationClass: Compatibility | ProcessSandbox | SyscallSandbox |
                  ApplicationKernel | HardwareVm | DedicatedHost;
  deployment: ExternalLinuxHost | CaposScheduledProxy |
              CaposScheduledVmm | DedicatedExternalHost | NativeCapos;
  workloadClass: InteractiveShell | BatchCommand | BuildJob |
                 PackageInstall | BrowserBackend | Notebook |
                 ScientificJob | AgentTool | ServiceDaemon;
  trustClass: SameOperator | UntrustedCode | MultiTenant | FamiliarWorkspace;
  placement: Ordinary | AutoNoHzEligible | CpuIsolationLease;
  packageClosure: PackageClosureId;
  inputCaps: ArtifactId[] | NamespaceGrant[];
  outputCaps: ArtifactSinkId[] | NamespaceGrant[];
  networkPolicy: None | Loopback | BrokeredEgress;
  resourceEnvelope: CpuMemoryIoPidGpuLimits;
  auditPolicy: ProvenanceRequired;
}

The wrapper should record:

  • backend and version;
  • kernel, rootfs, image, and package closure hashes;
  • seccomp/Landlock/cgroup/namespace policy or VM device model;
  • deployment location, distinguishing external Linux-host policy from capOS-scheduled proxy/VMM/native state;
  • CPU affinity, cgroup CPU quota or VM vCPU placement, capOS NoHzEligibility/NoHzActivation state, and outer housekeeping CPU set when the workload is capOS-scheduled;
  • external host CPU/isolation/nohz metadata when the workload runs outside capOS, recorded as host evidence rather than capOS scheduler proof;
  • guest tickless/nohz state when a Linux guest is used, recorded separately from the capOS outer scheduler state;
  • network and block-device grants;
  • input and output artifact ids;
  • exit status, signal, timeout, OOM, or backend failure.

Recommendation

Use a tiered sidecar strategy:

  1. Namespace sandbox tier. Use nsjail or bubblewrap for trusted commands, package steps, build/test tools, and scientific-base batch tools, with cgroup v2 quotas, seccomp, Landlock where available, read-only inputs, and immutable output capture.
  2. gVisor tier. Test high-risk but container-compatible Linux workloads where syscall mediation is useful and full VM overhead is not justified.
  3. Hardware VM tier. Use QEMU/KVM for broad compatibility and Firecracker or Kata-style microVMs for repeated batch jobs. This is the default for untrusted familiar Linux workspaces, notebooks, model-generated code, package builds, networked tools, and multi-tenant agent work.
  4. Dedicated host tier. Use single-tenant nodes or separately owned external hosts for high-risk tenants, sensitive data, GPU/device passthrough, long-lived browser/GUI workloads, side-channel-sensitive jobs, and cases where VM escape or VMM compromise must have a smaller blast radius.
  5. UML tier. Keep User-Mode Linux as a developer/debug/compatibility fallback when KVM is unavailable, not as the primary strong-isolation backend.
  6. Native capOS tier. Migrate stable, small, well-understood services into native capOS userspace after the capability interfaces are proven.

The first serious hardware-backed proof should run a Linux guest workload under QEMU/KVM, expose a narrow Cap’n Proto capability proxy to capOS, and execute a mix of familiar Linux commands plus one or two specialized workloads with artifact capture. Good first cases are a shell/build job, a package-manager or compiler invocation, and a scientific batch job such as PARI/GP, Z3/cvc5, HiGHS, or Lean. A later Firecracker proof can optimize startup and attack surface for stateless command, solver, proof-check, and agent-tool workers.

For browser use, this service is only a possible backend behind the BrowserSession capability. It must not expose a parallel browser authority model: origins, profiles, downloads, uploads, automation, and audit still belong to the browser capability surface, even if the actual browser engine runs in a Linux sandbox or hardware-backed Linux guest.

Research: Out-of-Kernel Scheduling

Survey of whether capOS can move CPU scheduler implementation out of the kernel, which parts are normally kept privileged, and which policy has been moved to user-space services or loadable policy modules in prior systems.

Scope

“User-space scheduler” is an overloaded term. The question here is narrower than language/runtime scheduling: can the OS CPU scheduler itself be moved out of the kernel?

This report separates the relevant models:

ModelSchedulesKernel seesExamples
User-controlled kernel schedulingKernel threads / scheduling contextsPrivileged mechanism plus user policy inputsL4 user-level scheduling, seL4 MCS, ARINC partition schedulers on seL4
Dynamic in-kernel policyKernel threadsPolicy loaded from user space but executed in kernelLinux sched_ext, Ekiben, Bossa
Whole-machine core arbitrationCores granted to applications/runtimesKernel threads pinned, parked, or revokedArachne, Shenango, Caladan
In-process M:N runtimeGoroutines, virtual threads, fibers, async tasksA smaller set of OS threadsGo, Java virtual threads, Erlang, Tokio
User-level thread packageUser-level threads or taskletsOne or more kernel execution contextsCapriccio, Argobots
Kernel-assisted two-level runtime schedulingUser threads plus kernel eventsVirtual processors / activationsScheduler activations, Windows UMS

The common boundary in prior systems is: the kernel allocates protected execution resources, handles blocking and preemption, and enforces isolation. User space supplies domain policy: which goroutine, actor, task, request, or coroutine runs next.

Feasibility Assessment

Moving the entire scheduler out of the kernel is not viable in a protected, preemptive system if “scheduler” means the code that runs on timer interrupts, chooses an immediately runnable kernel thread, saves/restores CPU state, changes page tables, updates per-CPU state, and enforces CPU-time isolation. That mechanism is part of the CPU protection boundary.

Moving scheduler policy out of the kernel is viable. A capOS-like kernel can act as a small CPU driver that enforces runnable-state invariants, capability-authorized scheduling contexts, budgets, priorities, CPU affinity, timeout faults, and IPC donation. A privileged user-space scheduler service can own admission control, budgets, priorities, placement, CPU partitioning, and service-specific policy.

The design point supported by the surveyed systems is not “no scheduler in kernel.” It is “minimal kernel dispatch and enforcement, user-space policy.”

Executive Conclusions

  1. The next-thread dispatch path is normally kept in kernel mode. It runs when the current user process may be untrusted, blocked, faulting, or out of budget.
  2. User space can own policy if the kernel exposes scheduling contexts as capability-controlled CPU-time objects. Thread creation and thread handles should follow the same capability-first model.
  3. Consulting a user-space scheduler server on every timer tick adds context switches to the hottest path and creates a bootstrap problem when the scheduler server itself is not runnable.
  4. seL4 MCS is the most directly comparable model: scheduling contexts are explicit objects, budgets are enforced by the kernel, and passive servers can run on caller-donated scheduling contexts.
  5. L4 user-level scheduling experiments show that user-directed scheduling is possible, with reported overhead from 0 to 10 percent compared with a pure in-kernel scheduler for their workload. That is plausible for policy changes, not for every dispatch decision.
  6. seL4 user-mode partition schedulers show the downside: a prototype partitioned scheduler measured substantial overhead because each scheduling event crosses the user/kernel boundary.
  7. sched_ext and Ekiben are useful evidence for pluggable scheduler policy, but they still execute policy in or near the kernel. They do not prove that the dispatch mechanism can be a normal user process.
  8. Whole-machine core arbiters such as Arachne, Shenango, and Caladan support a different split: the kernel still schedules threads, while a privileged control plane grants, revokes, and places cores at coarser granularity.
  9. Direct-switch IPC and scheduling-context donation reduce the priority inversion and dispatch-overhead risks that appear when capability servers are scheduled only by per-process priorities.
  10. Pure M:1 user-level threads are insufficient for capOS as the only threading story. They are fast, but one blocking syscall, page fault wait, or long CPU loop can stall unrelated user threads unless every blocking operation is converted to async form.
  11. M:N runtimes need a small OS contract: capability-created kernel threads, TLS/FS-base state, capability-authorized futex-style wait/wake, monotonic timers, async I/O/event notification, and a way to detect or avoid kernel blocking.
  12. Scheduler activations solved the right conceptual problem but exposed a complicated upcall contract. A capability OS can get most of the benefit with simpler primitives: async capability rings, notification objects, futexes, and explicit thread objects.
  13. Work-stealing with per-worker local queues is the dominant general-purpose runtime design. It gives locality and scale, but it needs explicit fairness guards and I/O polling integration.
  14. SQPOLL-style polling is a scheduling decision. It trades a core for lower submission latency and depends on SMP plus explicit CPU ownership. Full-nohz for that poller should be treated as a CPU-isolation lease with housekeeping and accounting constraints, not as an automatic timer optimization; see NO_HZ, SQPOLL, and realtime scheduling.
  15. A generic language scheduler in the kernel is a separate design from out-of-kernel CPU policy. Go, Rust async, actor runtimes, and POSIX layers need kernel mechanisms that let them implement their own policy.

Privileged Mechanisms

The following responsibilities are mechanism, not policy. Moving them to a normal user process either breaks protection or puts a user/kernel round trip on the critical path:

  • Save and restore CPU register context.
  • Switch page tables / address spaces.
  • Update per-CPU current-thread state, kernel stack, TSS/RSP0, and syscall stack state.
  • Handle timer interrupts and IPIs.
  • Maintain a safe runnable/blocked/exited state machine.
  • Enforce CPU budgets and preempt a thread that exceeds its budget.
  • Choose an emergency runnable thread when the policy owner is dead, blocked, or malicious.
  • Run idle and halt safely when no runnable work exists.
  • Integrate scheduling with blocking syscalls, page faults, futex waits, and IPC wakeups.
  • Preserve invariants under SMP races.

These are exactly the parts currently concentrated in kernel/src/sched.rs and the x86 context-switch path. They can be simplified and made more generic, but they remain required somewhere privileged.

Policy Surface

The following are policy examples that can be owned by a privileged user-space service once scheduling contexts exist:

  • Admission control: which process/thread is allowed to consume CPU time.
  • Priority assignment and dynamic priority changes.
  • Budget/period selection for temporal isolation.
  • CPU affinity and CPU partitioning decisions.
  • Core grants for SQPOLL, device polling, network stacks, and latency-sensitive services.
  • Overload handling policy.
  • Per-service or per-tenant fair-share policy.
  • Instrumentation-driven tuning.
  • Runtime-specific hints, such as “latency-sensitive”, “batch”, “driver”, or “poller”.

This split gives a capOS-like system policy freedom while preserving a small, auditable kernel CPU mechanism.

Viable Architectures

1. Minimal Kernel Scheduler Plus User Policy Service

This is one capOS-compatible design point.

The kernel implements:

  • Thread states and per-CPU run queues.
  • Priority/budget-aware dispatch.
  • Scheduling-context objects.
  • Timer-driven budget accounting.
  • Timeout faults or notifications.
  • Capability-checked operations to bind/unbind scheduling contexts to threads.
  • Emergency fallback policy.

A user-space sched service implements:

  • System policy loaded from the boot manifest.
  • Resource partitioning between services.
  • Priority/budget updates.
  • CPU pinning and SQPOLL grants.
  • Diagnostics and policy reload.

The policy service is invoked on configuration changes and timeout faults, not on every context switch.

2. seL4-MCS-Style Scheduling Contexts

seL4 MCS makes CPU time a first-class kernel object. A thread needs a scheduling context to run. A scheduling context carries budget, period, and priority. The kernel enforces the budget with a sporadic-server model. Passive servers can block without their own scheduling context; callers donate their scheduling context through synchronous IPC, and the context returns on reply.

This maps directly to capOS:

SchedContext {
    budget_ns
    period_ns
    priority
    cpu_mask
    remaining_budget
    timeout_endpoint
}

Kernel responsibilities:

  • Enforce budget and period.
  • Dispatch runnable threads with eligible scheduling contexts.
  • Donate and return contexts across direct-switch IPC.
  • Notify user space on timeout or depletion.

User-space responsibilities:

  • Create and distribute scheduling-context capabilities.
  • Decide budgets and priorities.
  • Build passive service topologies.
  • React to timeout faults.

This moves scheduling policy out without moving the hot dispatch mechanism out.

3. Hierarchical User-Level Scheduler

L4 research evaluated exporting scheduling to user level through a hierarchical user-level scheduling architecture. The reported application overhead was 0 to 10 percent compared with a pure in-kernel scheduler in their evaluation, and the design enabled user-directed scheduling.

This is possible, but the cost model is sensitive:

  • Every policy decision that requires a scheduler-server round trip is expensive.
  • The scheduler server needs guaranteed CPU time, or the system can deadlock.
  • Faults and interrupts still need kernel fallback.
  • SMP multiplies races around run queues, CPU ownership, and migration.

This architecture is viable for coarse-grained partition scheduling, VM scheduling, or policy control. As a first general dispatch path, it has higher latency and bootstrap risk than an in-kernel dispatcher.

4. Dynamic In-Kernel Policy

Linux sched_ext lets user space load BPF scheduler programs, but the policy runs inside the kernel scheduler framework. The kernel preserves integrity by falling back to the fair scheduler if the BPF scheduler errors or stalls runnable tasks. Ekiben similarly targets high-velocity Linux scheduler development with safe Rust policies, live upgrade, and userspace debugging.

This model is a later-stage option for dynamic scheduler experiments, but it is not “scheduler in user space.” It also adds verifier/runtime complexity.

5. Core Arbiter / Resource Manager

Arachne, Shenango, and Caladan move high-level core allocation decisions out of the ordinary kernel scheduler path. Applications or runtimes know which cores they own, while an arbiter grants and revokes cores based on load or interference.

This model is useful for capOS after SMP:

  • grant cores to NIC drivers, network stacks, or SQPOLL workers;
  • revoke poller cores under CPU pressure;
  • isolate latency-sensitive services from batch work;
  • expose CPU ownership through capabilities.

It does not remove the kernel dispatcher. It changes the granularity of policy from “which thread next” to “which service owns this CPU budget.”

Classic Problem: Kernel Threads vs User Threads

The scheduler activations paper is still the cleanest statement of the core problem: kernel threads have integration with blocking and preemption, while user-level threads have cheaper context switching and better policy control. The failure mode of user-level threads layered naively on kernel threads is that kernel events are hidden from the runtime. A kernel thread can block in the kernel while runnable user threads exist, and the kernel can preempt a kernel thread without telling the runtime which user thread was stopped.

Scheduler activations address this by giving each address space a “virtual multiprocessor.” The kernel allocates processors to address spaces and vectors events to the user scheduler when processors are added, preempted, blocked, or unblocked. The activation is both an execution context and a notification vehicle.

The lesson for capOS is not to copy the full activation API. The durable idea is the split:

  • Kernel owns physical CPU allocation, protection, preemption, and blocking.
  • Runtime owns which application-level work item runs on a granted execution context.
  • Kernel-visible blocking must create a runtime-visible event, or it must be avoided by making the operation async.

For capOS, async capability rings already avoid many blocking syscalls. The remaining hard cases are futex waits, page faults that require I/O, synchronous IPC, and preemption of long-running runtime tasks.

Runtime Schedulers in Practice

Go

Go uses an M:N scheduler with three central concepts:

  • G: goroutine.
  • M: worker thread.
  • P: processor token required to execute Go code.

The Go runtime distributes runnable goroutines over worker threads, keeps per-P queues for scalability, uses global queues and netpoller integration for fairness and I/O, and parks/unparks OS threads conservatively to avoid wasting CPU. Its own source comments call out why centralized state and direct handoff were rejected: centralization hurts scalability, while eager handoff hurts locality and causes thread churn.

Preemption is mixed. Go has synchronous safe points and asynchronous preemption using OS mechanisms such as signals. The runtime can only safely stop a goroutine at points where stack and register state can be scanned.

Implications for capOS:

  • Initial GOOS=capos can run with GOMAXPROCS=1 and cooperative preemption, but useful Go requires kernel threads, futexes, FS-base/TLS, a monotonic timer, and an async network poller.
  • A signal clone is not strictly required if capOS provides a runtime-visible timer/preemption notification and the Go port accepts cooperative-first behavior.
  • The kernel must schedule threads, not processes, before Go can use multiple cores.

Java Virtual Threads

JDK virtual threads use M:N scheduling: many virtual threads are mounted on a smaller number of platform threads. The default scheduler is a FIFO-mode work-stealing ForkJoinPool; the platform thread currently carrying a virtual thread is called its carrier.

The design is intentionally not pure cooperative scheduling from the application’s perspective: most JDK blocking operations unmount the virtual thread, freeing the carrier. But some operations pin the virtual thread to the carrier, notably native calls and some synchronized regions. The JEP also notes that the scheduler does not currently implement CPU time-sharing for virtual threads.

Implications for capOS:

  • “Blocking” compatibility requires library/runtime cooperation, not just a scheduler. The runtime needs blocking operations to yield carriers.
  • Native calls and pinned regions remain a general M:N hazard. capOS cannot make that disappear in the kernel.

Tokio and Rust Async Executors

Tokio represents the async executor model rather than stackful green threads. Tasks run until they return Poll::Pending, so fairness depends on cooperative yield points and wakeups. Tokio’s multi-thread scheduler uses one global queue, per-worker local queues, work stealing, an event interval for I/O/timer checks, and a LIFO slot optimization for locality.

Implications for capOS:

  • A capos-rt async executor can integrate capability-ring completions, notification objects, and timers as wake sources.
  • A cooperative budget is mandatory. A future that never awaits can monopolize a worker until kernel preemption takes the whole OS thread away.
  • A single global CQ per process can become an executor bottleneck if many worker threads consume completions. Per-thread or sharded wake queues are likely needed after SMP.

Erlang/BEAM

BEAM schedulers run lightweight Erlang processes on scheduler threads. The runtime exposes scheduler count and binding controls, and Erlang processes are preempted by reductions rather than OS timer slices. This shows a different point in the design space: the language VM owns fairness because it controls execution of bytecode.

Implications for capOS:

  • Managed runtimes can implement stronger fairness than native async libraries because they control instruction dispatch or compiler-inserted safe points.
  • Native Rust/C userspace cannot rely on that unless the compiler/runtime inserts yield or safe-point checks.

Capriccio and Argobots

Capriccio showed that a user-level thread package can scale to very high concurrency by combining cooperative user-level threads, asynchronous I/O, O(1) thread operations, linked stacks, and resource-aware scheduling. The important lesson is that the thread abstraction can survive high concurrency when the runtime controls stacks and blocking.

Argobots generalizes lightweight execution units into user-level threads and tasklets over execution streams. It is designed as a substrate for higher-level systems such as OpenMP and MPI, with customizable schedulers. This is directly relevant to capOS because it argues for low-level runtime mechanisms, not one global scheduling policy.

Lithe

Lithe targets composition of parallel libraries. Its thesis is that a universal task abstraction or one global scheduler does not compose well when multiple parallel libraries are nested. Instead, physical hardware threads are shared through an explicit resource interface, while each library keeps its own task representation and scheduling policy.

Implications for capOS:

  • Avoid oversubscription by making CPU grants visible to user space.
  • A future CpuSet or scheduling-context capability could let runtimes know how much parallelism they are actually allowed to use.
  • Nested runtimes benefit from the ability to donate or yield execution resources without going through a process-global policy singleton.

Kernel Interfaces That Matter

Futexes

Futexes are the standard split-lock design: user space does the uncontended fast path with atomics, and the kernel only participates to sleep or wake threads. Linux also has priority-inheritance futex operations for cases where the kernel must manage lock-owner priority propagation.

For capOS:

  • Implement futex as a capability-authorized primitive. Do not assume generic Cap’n Proto method encoding is acceptable for the hot path; measure it against a compact operation before fixing the ABI.
  • Key futex wait queues by (address_space, user_virtual_address) for private futexes. Shared-memory futexes eventually need a memory-object identity plus offset.
  • Support timeout against monotonic time first. Requeue and PI futexes can wait.

Restartable Sequences

Linux rseq lets user space maintain per-CPU data without heavyweight atomics and lets a thread cheaply read its current CPU/node. The current kernel docs also describe scheduler time-slice extensions for short critical sections.

For capOS:

  • rseq-style current-CPU access becomes useful after SMP and per-CPU run queues.
  • It is not a first threading prerequisite. Futex, TLS, and kernel threads come first.
  • If added, expose a small per-thread ABI page with cpu_id, node_id, and an abort-on-migration critical-section protocol.

io_uring SQPOLL

SQPOLL moves submission from syscall-driven to polling-driven. A kernel thread polls the submission queue and submits work as soon as userspace publishes SQEs. This reduces submission latency and syscall overhead for sustained I/O, but it burns CPU and needs careful affinity.

capOS already has an io_uring-inspired capability ring, so the analogy is direct:

  • Current tick-driven ring processing is correct for a toy system but couples invocation latency to timer frequency.
  • A kernel-side SQ polling thread interacts badly with single-CPU systems. On a single CPU it competes with the application it is supposed to accelerate.
  • Make SQPOLL a scheduling/capability decision: the process donates or is granted a CPU budget for the poller.
  • Completion handling remains a separate problem. A runtime still needs to poll CQs or block on notifications.

sched_ext

Linux sched_ext is not a normal user-level thread scheduler. It is a scheduler class whose behavior is defined by BPF programs loaded from user space. The kernel docs emphasize that sched_ext can be enabled and disabled dynamically, can group CPUs freely, and falls back to the default scheduler if the BPF scheduler misbehaves. The docs also warn that the scheduler API has no stability guarantee.

For capOS:

  • The relevant idea is safe, dynamically replaceable policy with kernel integrity fallback.
  • Copying the BPF ABI is not required. capOS can get a smaller version through privileged scheduler-policy capabilities later.
  • Keep early scheduling policy in kernel Rust until the invariants are clear.

Whole-Machine User-Space/Core Schedulers

Arachne

Arachne is a user-level thread system for very short-lived threads. It is core-aware: applications know which cores they own and control placement of work on those cores. A central arbiter reallocates cores among applications. The published results report strong memcached and RAMCloud improvements, and the implementation requires no Linux kernel modifications.

Takeaway: user-level scheduling gets much better when the runtime has explicit core ownership. Blindly creating more kernel threads and hoping the OS scheduler does the right thing is a weaker contract.

Shenango

Shenango targets datacenter services with microsecond-scale tail-latency goals. It uses kernel-bypass networking and an IOKernel on a dedicated core to steer packets and reallocate cores across applications every 5 microseconds. The key policy is rapid core reallocation based on whether queued work is waiting long enough to imply congestion.

Takeaway: a dedicated scheduling/control core can be worthwhile when latency SLOs are tighter than normal kernel scheduling reaction times. It is expensive and only justified for sustained latency-sensitive workloads.

Caladan

Caladan extends the idea from load to interference. It uses a centralized scheduler core and kernel module to monitor and react to memory hierarchy and hyperthread interference at microsecond scale. Its main claim is that static partitioning of cores, caches, and memory bandwidth is neither necessary nor sufficient for rapidly changing workloads.

Takeaway: CPU scheduling is not only “which runnable thread next.” On modern machines it is also placement relative to caches, sibling SMT threads, memory bandwidth, and bursty workload phase changes.

Design Axes

AxisOptionsPractical conclusion
Stack modelStackless tasks, segmented/growing stacks, fixed stacksRust async uses stackless futures; Go/Java need runtime-managed stacks; POSIX threads need fixed or growable user stacks
PreemptionCooperative, safe-point, signal/upcall, timer-forced OS preemptionKernel preemption alone protects the system; runtime fairness needs safe points or cooperative budgets
BlockingConvert all operations to async, add carriers, kernel upcallsAsync caps reduce blocking; Go/POSIX still need kernel threads and futexes
QueueingGlobal queue, per-worker queues, work stealing, priority queuesPer-worker queues plus stealing are the default; add global fairness escape hatches
CPU ownershipInvisible OS scheduling, affinity hints, explicit CPU grantsExplicit grants matter for high-performance runtimes and SQPOLL
Cross-process callsQueue through scheduler, direct switch, scheduling donationDirect switch and scheduling-context donation reduce sync IPC overhead and inversion
IsolationBest-effort fairness, priorities, budget/period contextsCloud-oriented capOS eventually needs budget/period scheduling contexts

capOS Design Options

Option: Minimal Kernel Mechanism Plus User Policy

This option keeps dispatch and enforcement in the kernel, replaces the current round-robin process scheduler with a minimal kernel CPU mechanism, and moves policy to user space through scheduling-context capabilities.

The kernel side covers:

  • dispatching the next runnable thread on each CPU;
  • enforcing budget/period/priority invariants;
  • handling interrupts, blocking, wakeups, and exits;
  • direct-switch IPC and scheduling-context donation;
  • an emergency fallback policy.

The user-space scheduler service covers:

  • policy configuration from the manifest;
  • per-service budgets, periods, priorities, and CPU masks;
  • admission control for new processes and threads;
  • SQPOLL/core grants;
  • response to timeout faults and overload telemetry.

This gives a capOS-like system the exokernel/microkernel benefit of policy freedom without putting a user-space server on the context-switch fast path.

Possible Implementation Sequence

  1. Thread scheduler in kernel. Convert from process scheduling to thread scheduling, with per-thread kernel stack, saved registers, FS base, and shared process address space/cap table.
  2. Scheduling contexts. Add kernel objects that carry budget, period, priority, CPU mask, and timeout endpoint. Initially assign one default context per thread.
  3. ThreadSpawner and ThreadHandle capabilities. Expose thread creation and lifecycle through capabilities from the start. Bootstrap grants init the initial authority; init or a scheduler service delegates it under quota.
  4. Scheduling-context donation for IPC. Baseline direct-switch IPC handoff exists for blocked Endpoint receivers; add budget/priority donation and return once scheduling contexts exist.
  5. User-space policy service. Let init or a sched service create and update scheduling contexts via capabilities.
  6. SMP core ownership. After per-CPU run queues and TLB shootdown exist, allow the scheduler service to manage CPU masks and SQPOLL/poller grants.
  7. Optional dynamic policy. Much later, consider sched_ext-like policy modules if Rust/verifier infrastructure exists. This is not a prerequisite.

Minimal Kernel API Sketch

interface SchedulerControl {
    createContext @0 (budgetNs :UInt64, periodNs :UInt64, priority :UInt16)
        -> (context :SchedulingContext);
    setCpuMask @1 (context :SchedulingContext, mask :Data) -> ();
    bind @2 (thread :ThreadHandle, context :SchedulingContext) -> ();
    unbind @3 (thread :ThreadHandle) -> ();
    setTimeoutEndpoint @4 (context :SchedulingContext, endpoint :Endpoint) -> ();
    stats @5 (context :SchedulingContext) -> (consumedNs :UInt64, throttled :Bool);
}

interface SchedulingContext {
    yieldTo @0 (thread :ThreadHandle) -> ();
    consumed @1 () -> (consumedNs :UInt64);
}

interface ThreadSpawner {
    create @0 (
        entry :UInt64,
        stackTop :UInt64,
        arg :UInt64,
        context :SchedulingContext,
        flags :UInt64
    ) -> (thread :ThreadHandle);
}

interface ThreadHandle {
    join @0 (timeoutNs :UInt64) -> (status :Int64);
    exitCode @1 () -> (exited :Bool, status :Int64);
    bind @2 (context :SchedulingContext) -> ();
}

The hot path does not invoke these methods; they are control-plane operations.

Dependency: In-Process Threading

Kernel threads inside a process are a dependency for sophisticated user-level thread support:

  • Thread object with saved registers, per-thread kernel stack, user stack pointer, FS base, state, and parent process reference.
  • Scheduler runs threads, not processes.
  • Process owns address space and cap table; threads share both.
  • Process context switch saves/restores FS base today; thread scheduling must make that state per-thread.
  • Thread creation is exposed first as a ThreadSpawner capability; bootstrap grants initial authority to init, and later policy delegates it through the capability graph.
  • Thread exit reclaims the thread stack and wakes joiners if join exists.

This directly unblocks Go phase 2, POSIX pthread compatibility, native thread-local storage, and any multi-worker Rust async executor.

Dependency: Park (Linux futex analogue) and Timer

A minimal capability-authorized park primitive has this shape:

park(park_space, uaddr, expected, timeout_ns) -> Result
unpark(park_space, uaddr, max_count) -> usize

Required semantics:

  • park checks that *uaddr == expected while holding the park wait-lock equivalent, then blocks the current thread.
  • unpark makes up to max_count waiters runnable.
  • Timeouts use monotonic ticks or a timer wheel/min-heap.
  • Return values must distinguish woken, timed out, interrupted, and value mismatch.

The authority should be capability-based from the start, for example through a ParkSpace, SharedParkSpace, or memory-object-derived capability. Pre-thread measurement with the benchmark-only ParkBench cap favors a compact capability-authorized operation over generic Cap’n Proto methods for failed wait and empty wake. The blocked/resume path still needs measurement after threads exist because the primitive sits on the runtime parking path.

Measure this before fixing the ABI:

  • CAP_OP_NOP: ring validation plus CQE post, with no cap lookup or capnp.
  • Empty and small NullCap calls through normal cap lookup, method dispatch, capnp param decode, and capnp result encode.
  • Futex-shaped compact operation carrying cap_id, uaddr, expected, and timeout/max_count, initially returning without blocking.
  • Generic ParkBench.wait / ParkBench.wake Cap’n Proto methods for the same pre-thread failed-wait and empty-wake cases.
  • Later, real blocking paths: failed wait, wake with no waiters, wait-to-block, wake-to-runnable, and wake-to-resume.

The useful decision is not “capability or syscall”; it is “generic capnp method or compact capability-authorized scheduler primitive.” Authority remains in the capability model either way.

Near Term: Runtime Event Integration

For capos-rt, design the executor around kernel completion sources:

  • Capability-ring CQ entries wake tasks waiting on cap invocations.
  • Notification objects wake tasks waiting on interrupts, timers, or service events.
  • Futex wakes resume parked worker threads.
  • Timers can be integrated as wakeups instead of periodic polling.

The executor policy can start simple:

  • One worker per kernel thread.
  • Local FIFO queue per worker.
  • One global injection queue.
  • Work stealing when local and global queues are empty.
  • Cooperative operation budget, then requeue.

Stage 6: IPC Scheduling

For synchronous IPC, direct switch has been introduced before priority scheduling:

  • If client A calls server B and B is blocked in receive, switch A -> B directly without picking an unrelated runnable thread. This is implemented for the current single-CPU Endpoint path.
  • Mark A blocked on reply.
  • Future fastpath work can transfer a small message inline; use shared buffers for large data.

Scheduling-context donation then adds the budget/priority transfer:

  • The server runs the request using the caller’s scheduling context.
  • The caller’s budget covers client + server work.
  • Passive servers can exist without independent CPU budget and only run when a caller donates one.

This avoids priority inversion through the capability graph and matches the service architecture better than per-process priorities alone.

Stage 7: SMP and Core Ownership

Once per-CPU scheduler queues exist, these become policy surfaces:

  • CPU affinity depends on correct migration and TLB shootdown.
  • A CpuSet or SchedulingContext capability can describe allowed CPUs, budget, period, and priority.
  • Cheap current-CPU exposure depends on a stable per-thread ABI page.
  • SQPOLL can be gated on available CPU budget to avoid unlimited poller creation.

Risks and Failure Modes

  • M:1 green threads do not provide Go or POSIX compatibility by themselves.
  • A normal user-space process choosing the next thread on every timer tick puts a context-switch round trip on the hot path.
  • Recovery from scheduler-service failure cannot depend solely on the scheduler service being runnable.
  • A Go-like G/M/P scheduler in the kernel couples language runtime policy to the kernel.
  • Generic Cap’n Proto capability calls may be too heavy for every synchronization primitive. Measure generic calls against compact capability-authorized operations before fixing the futex ABI.
  • sched_ext-like dynamic policy loading depends on mature scheduler invariants and verifier/runtime machinery.
  • SQPOLL on a single-core system can compete with the application it is meant to accelerate.

Open Questions

  1. Does capOS need scheduler-activation-style upcalls? Async caps and notification objects cover many of the same cases with less machinery.
  2. How can runtime preemption work without Unix signals? Options are cooperative-only, timer notification to a runtime handler, or a kernel forced safe-point ABI. Cooperative-only is one first-support option for Go.
  3. How are shared-memory futex keys represented? Private futexes can key on address space and virtual address. Shared futexes need memory-object identity and offset.
  4. How large is the blocked/resume overhead once threads exist? The pre-thread failed-wait and empty-wake measurement already favors compact operations, but 4.5.5 still needs the contended path before freezing the final ABI.
  5. How much policy belongs in the boot manifest versus a long-running sched service? Static embedded systems can use manifest policy. Cloud or developer systems need runtime policy updates.
  6. What is the emergency fallback if the scheduler service exits? Options are a tiny kernel round-robin fallback for privileged recovery threads, a pinned immortal scheduler thread, or panic. The first is the only robust development choice.

Source Notes

  • Anderson et al., “Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism” (SOSP 1991): https://polaris.imag.fr/vincent.danjean/papers/anderson.pdf
  • “Towards Effective User-Controlled Scheduling for Microkernel-Based Systems” (L4 user-level scheduling): https://os.itec.kit.edu/21_738.php
  • Asberg and Nolte, “Towards a User-Mode Approach to Partitioned Scheduling in the seL4 Microkernel”: https://www.es.mdh.se/pdf_publications/2641.pdf
  • Kang et al., “A User-Mode Scheduling Mechanism for ARINC653 Partitioning in seL4”: https://link.springer.com/chapter/10.1007/978-981-10-3770-2_10
  • L4Re overview: https://l4re.org/doc/l4re_intro.html
  • Liedtke, “On micro-kernel construction”: https://elf.cs.pub.ro/soa/res/lectures/papers/lietdke-1.pdf
  • seL4 MCS tutorial: https://docs.sel4.systems/Tutorials/mcs.html
  • seL4 design principles: https://microkerneldude.org/2020/03/11/sel4-design-principles/
  • Linux kernel sched_ext documentation: https://www.kernel.org/doc/html/next/scheduler/sched-ext.html
  • Arun et al., “Agile Development of Linux Schedulers with Ekiben”: https://arxiv.org/abs/2306.15076
  • Williams, “An Implementation of Scheduler Activations on the NetBSD Operating System” (USENIX 2002): https://web.mit.edu/nathanw/www/usenix/freenix-sa/freenix-sa.html
  • Microsoft, “User-Mode Scheduling”: https://learn.microsoft.com/en-us/windows/win32/procthread/user-mode-scheduling
  • Go runtime scheduler source: https://go.dev/src/runtime/proc.go
  • Go preemption source: https://go.dev/src/runtime/preempt.go
  • OpenJDK JEP 444, “Virtual Threads”: https://openjdk.org/jeps/444
  • Tokio runtime scheduling documentation: https://docs.rs/tokio/latest/tokio/runtime/
  • von Behren et al., “Capriccio: Scalable Threads for Internet Services” (SOSP 2003): https://web.stanford.edu/class/archive/cs/cs240/cs240.1046/readings/capriccio-sosp-2003.pdf
  • Argobots paper page: https://www.anl.gov/argonne-scientific-publications/pub/137165
  • Argobots project: https://www.argobots.org/
  • Pan et al., “Lithe: Enabling Efficient Composition of Parallel Libraries” (HotPar 2009): https://www.usenix.org/legacy/event/hotpar09/tech/full_papers/pan/pan_html/
  • Linux futex(2) manual: https://man7.org/linux/man-pages/man2/futex.2.html
  • Linux kernel restartable sequences documentation: https://docs.kernel.org/userspace-api/rseq.html
  • io_uring_sqpoll(7) manual: https://manpages.debian.org/testing/liburing-dev/io_uring_sqpoll.7.en.html
  • Qin et al., “Arachne: Core-Aware Thread Management” (OSDI 2018): https://www.usenix.org/conference/osdi18/presentation/qin
  • Ousterhout et al., “Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads” (NSDI 2019): https://www.usenix.org/conference/nsdi19/presentation/ousterhout
  • Fried et al., “Caladan: Mitigating Interference at Microsecond Timescales” (OSDI 2020): https://www.usenix.org/conference/osdi20/presentation/fried

Completion Rings And Threaded Runtimes

This note grounds the capOS ring/threading roadmap in existing completion I/O and futex designs. The question is not whether a shared CQ can be made to work with many waiting threads; it can. The question is which ownership model keeps the kernel ABI stable once capOS runs multiple process threads on multiple CPUs.

Sources Checked

  • Linux io_uring_enter(2) documents the aggregate wait shape: with IORING_ENTER_GETEVENTS, the syscall waits until min_complete completion events are available.
  • Linux io_uring_setup(2) documents SQPOLL, CQ sizing, and single-issuer-oriented task-run modes.
  • Linux io_uring_register(2) documents registered wait regions.
  • Jens Axboe’s io_uring paper explains the core ring design as a pair of shared rings with single producer/single consumer ownership on each side and user_data copied from request to completion for matching.
  • Linux futex(2) and futex(7) document futexes as a kernel-assisted blocking path for synchronization objects whose uncontended state lives in user memory.
  • Microsoft I/O completion ports document the port model: threads wait on a completion port and dequeue completion packets, rather than each thread waiting directly on one specific operation’s storage slot.

Consequences For capOS

The current process-wide capOS ring matches the early io_uring shape: one SQ, one CQ, and user_data for completion matching. That shape is efficient when userspace serializes submission and completion consumption through one runtime owner. It becomes the wrong primitive for full SMP if multiple kernel-scheduled threads in the same process concurrently enter the kernel, because the ring turns into a multi-producer/multi-consumer coordination problem.

Waiting for a raw CQ slot is not a good abstraction. CQ slots are circular buffer storage and are reused. Stable wait identities are request user_data, kernel answer ids, completion packets, or a completion queue/lane chosen at submission time.

The clean full-SMP target is per-thread completion ownership. Each thread gets its own capability ring endpoint: a complete SQ/CQ pair, even if multiple endpoints are packed into one larger mapping. The existing cap_enter(min_complete, timeout_ns) semantics can then remain aggregate: min_complete counts completions available on the current thread’s CQ. Runtime code still matches individual operations by user_data, but two sibling threads no longer race to consume the same process CQ.

The Windows IOCP model is a useful counterpoint: a shared completion port works when the abstraction is explicitly a packet queue consumed by a worker pool. That is a runtime/service scheduling model, not the same thing as multiple threads blocking on one raw process CQ while each expects a private answer.

Current Implementation State

The kernel dispatches six SQE opcodes today: CALL, RECV, RETURN, RELEASE, CANCEL, and NOP. FINISH is reserved for the future system capnp transport and completes with an unsupported-opcode error. PARK and UNPARK (capability- authorized futex-style thread-park operations) are also dispatched. Only CALL opcodes are gated to syscall context (via call_requires_syscall_dispatch); the other dispatched opcodes, PARK and UNPARK included, are processed in both syscall and timer-interrupt contexts. PARK_BENCH is measurement-only and dispatched only when the kernel is built with the measure feature.

Per-process resource limits are enforced via ResourceProfile, a quota struct carried on each Process and resolved at spawn time. Two fields directly bound the ring’s resource use: ring_scratch_limit_bytes caps the input and output buffer capacity of the per-process ring scratch allocator (narrowing the kernel-side ceilings MAX_PARAMS and MAX_RESULT); in_flight_call_limit and endpoint_queue_limit cap the per-Endpoint in-flight CALL count and the queued (parked) CALL queue depth respectively, each clamped by a kernel structural maximum of 32.

SQPOLL on the per-process ring has landed: a process can hold a kernelSqpoll lease whose bound ring transitions into SQPOLL mode, with the kernel acting as sole SQ consumer for that ring. This is the SQPOLL foundation for the full-SMP per-thread ring target described below, not the target itself. Generic full-nohz for explicitly budgeted compute leases and SQPOLL nohz for explicitly leased caller-thread rings have landed; broader userspace-poller/device-queue issuance remains future work.

  1. Keep the current process ring as the bootstrap and compatibility surface.
  2. Add runtime reactor/demux support as an interim path for multithreaded runtimes that still use one process ring.
  3. Make the full SMP ABI a per-thread ring model:
    • each Thread owns one ring endpoint with a complete SQ/CQ pair;
    • cap_enter operates on the current thread’s ring;
    • SQPOLL, when enabled, is the sole kernel SQ consumer for that ring;
    • result-cap transfers still mutate the process cap table;
    • endpoint, timer, process-wait, thread-join, and futex completions post to the waiting ThreadRef’s ring.
  4. Consider shared completion ports only as a userspace runtime/service abstraction above per-thread rings, not as the kernel’s first full-SMP ring ABI.

References

x2APIC and APIC Virtualization

Research note for the SMP Phase C LAPIC/IPI decision. The goal is to decide how x2APIC should fit after the current LAPIC/IPI implementation work and to record which virtualization facts affect that choice.

Status note (2026-06-06): The x2APIC backend has landed in kernel/src/arch/x86_64/lapic.rs: the BSP checks CPUID.01H:ECX.x2APIC at boot and prefers x2APIC MSR access when available, falling back to xAPIC MMIO. AP initialization follows the BSP-selected mode. The selected-mode QEMU proof is make run-interrupt-grant-x2apic, which forces +x2apic, asserts LapicMode::X2Apic, and reuses the routed Interrupt.wait / Interrupt.acknowledge path. The proof is a bounded QEMU backend-selection proof, not high-core hardware readiness.

Existing Local Research

Before adding this note, docs/research/ contained:

  • capnp-error-handling.md
  • completion-ring-threading.md
  • eros-capros-coyotos.md
  • genode.md
  • ix-on-capos-hosting.md
  • llvm-target.md
  • os-error-handling.md
  • out-of-kernel-scheduling.md
  • pingora.md
  • plan9-inferno.md
  • sel4.md
  • small-llm-survey.md
  • zircon.md

None of those files directly cover APIC/x2APIC or KVM APIC virtualization.

Sources Checked

Local verification:

  • Host command qemu-system-x86_64 --version reported QEMU 8.2.2.
  • Host command qemu-system-x86_64 -cpu help listed x2apic as a recognized CPUID feature.
  • The current capOS LAPIC implementation has both xAPIC MMIO and x2APIC MSR backends. The BSP selects x2APIC when CPUID or firmware state makes it available and otherwise falls back to xAPIC MMIO.
  • make run-interrupt-grant-x2apic uses -cpu qemu64,+smep,+smap,+rdrand,+x2apic, asserts the selected LapicMode::X2Apic backend, and proves the routed interrupt waiter / deferred-EOI acknowledgement path still works in that mode.

x2APIC Findings

x2APIC is still the forward-looking LAPIC backend for later hardware and VM coverage:

  • It avoids mapping the local APIC MMIO page and uses architectural MSRs for local APIC register access.
  • It supports wider APIC IDs than xAPIC’s 8-bit destination model, which keeps the CPU-id/LAPIC-id split introduced by the SMP proposal relevant on larger systems and VMs.
  • Intel’s current public guidance says x2APIC is required above 255 cores, newer Intel client families default to x2APIC, and legacy xAPIC can become unavailable or locked out after firmware or system software enters x2APIC.
  • The local capOS dependency set already has x86_64 MSR access, and the implemented x2APIC backend covers EOI, ICR/IPI, spurious vector, LVT timer, timer initial count, divide config, and current APIC ID without adding another architecture crate.

The implementation shape is:

  1. Keep the xAPIC MMIO LAPIC timer/IPI foundation as the fallback for older hardware and VM configurations that only expose xAPIC.
  2. Select x2APIC when CPUID.01H:ECX.x2APIC is available or when firmware has already enabled/locked x2APIC.
  3. Keep TLB shootdown, timer, EOI, and device-vector paths on the architectural LAPIC interface rather than on KVM paravirtual APIC helpers.
  4. Treat larger-APIC-ID and high-core hardware validation as future hardware evidence; the current selected-mode QEMU proof covers backend selection and the routed waiter/ack path only.

Virtualization Findings

Virtualization is relevant to validation and future performance, not to the guest-visible correctness contract:

  • QEMU/KVM can expose x2APIC through CPU model feature selection. capOS tests should make that explicit by extending the current QEMU model to -cpu qemu64,+smep,+smap,+rdrand,+x2apic, or by using another named CPU model with +x2apic, instead of relying on the host or accelerator default.
  • KVM exposes APIC state through its own API and has x2APIC-specific handling for 32-bit APIC IDs. That matters to the VMM, but a capOS guest should use the architectural x2APIC interface.
  • QEMU/KVM paravirtual features such as kvm-pv-eoi, kvm-pv-ipi, and kvm-pv-tlb-flush are optional accelerations. They should not be part of the first LAPIC/IPI or TLB-shootdown proof because they would make correctness depend on a Linux/KVM-specific host contract.
  • APIC virtualization features such as APICv or AMD AVIC are VMM-side acceleration mechanisms. capOS should not require or detect them before it has a stable architectural x2APIC path.

The practical QEMU proof targets are therefore:

  1. Boot the current xAPIC MMIO LAPIC implementation with -smp 2.
  2. Prove LAPIC timer ticks on vector 48 and IPI delivery on vector 49.
  3. Keep KVM paravirtual APIC/TLB/IPI features disabled or ignored for the first correctness proof.
  4. Run make run-interrupt-grant-x2apic as the selected-mode x2APIC proof, using -cpu qemu64,+smep,+smap,+rdrand,+x2apic and asserting the selected backend plus the routed interrupt wait/ack path.

capOS Recommendation

Keep x2APIC as the preferred backend when CPUID or firmware state exposes it, with xAPIC MMIO as the fallback. Keep correctness on the architectural LAPIC timer, IPI, EOI, and device-vector paths; KVM paravirtual APIC/TLB/IPI features remain optional accelerations rather than proof dependencies. Do not treat the selected-mode QEMU proof as high-core hardware readiness.

IOMMU Remapping Grounding

This note records primary-source facts for IOMMU/remapping work. The Intel VT-d path has landed under #[cfg(feature = "qemu")] in kernel/src/iommu.rs as a QEMU q35 smoke (make run-iommu-remapping); AMD-Vi table programming remains future work. DMAPool has manager-owned domain identity and mapping-lifecycle preflight records. For the QEMU Intel IOMMU path, real VT-d table programming, hardware-DMA translation proof, two-phase invalidation/IOTLB-flush revocation, and IOMMU-backed hostile stale-DMA smokes have all landed (see ddf-iommu-qemu-intel-remapping-smoke). For QEMU shapes without intel-iommu, the kernel-owned bounce-buffer fallback remains active (remapping_tables=not-programmed, hostile_hardware_isolation=not-claimed). AMD-Vi table programming and a bounce-buffer policy for non-IOMMU devices remain open.

Sources

  • Intel, Intel Virtualization Technology for Directed I/O Architecture Specification, content ID 671081. Intel page metadata on 2026-05-12 listed Date 2022-06-02 and Version 5.1 (Latest). Sections used: 6.2.2 “Context-Cache”, 6.2.4 “IOTLB”, 6.5.1 “Register-based Invalidation Interface”, 6.5.2 “Queued Invalidation Interface”, 6.5.3 “IOTLB Invalidation Considerations”, 6.6 “Set Root Table Pointer Operation”, 6.8 “Write Buffer Flushing”, 7.10 “Software Steps to Drain Page Requests & Responses”, 8.3 “DMA Remapping Hardware Unit Definition Structure”, 8.3.1 “Device Scope Structure”, 9.1 “Root Entry”, 9.3 “Context Entry”, 9.4 “Scalable-Mode Context-Entry”, and 11.4.5-11.4.9 covering the root-table-address, invalidation, fault, protected-memory-range, and invalidation-queue registers.
  • AMD, AMD I/O Virtualization Technology (IOMMU) Specification 48882, 48882-PUB Rev 3.10, February 2025. Sections used: 2.2 device table, device-table entry, I/O page table, and interrupt-remapping material; 2.4 “Commands”; 2.5 “Event Logging”; 3.4 “IOMMU MMIO Registers”; IVRS/device-table/page-table, command-buffer, completion-wait, invalidation, and event-log material.
  • QEMU, qemu-manpage entries for -device intel-iommu, -device amd-iommu, and -device virtio-iommu-pci; and QEMU PCI developer documentation for PCI IOMMU and IOTLB notifier APIs. These are current-master QEMU docs, not a frozen release manual; the qemu-manpage and PCI developer pages observed on 2026-05-12 were generated for QEMU version 11.0.50.

Intel VT-d Grounding

Intel VT-d identifies DMA request sources through PCI requester/source IDs and resolves them through DMA remapping hardware units described by DMAR DRHD structures. The table path is rooted at a root table and context tables. Root entries select context tables, context entries bind a source to a translation type, domain identifier, address width, and second-level page-table root, and scalable-mode context entries extend that context format. The landed QEMU smoke (kernel/src/iommu.rs, cfg(qemu)) uses exactly this path: DRHD unit, PCI segment and BDF/source ID, domain ID, aw-bits=39 address width, and a 3-level second-level page-table root. Scalable-mode context entries, 48-bit IOVA space, interrupt remapping, and multi-device domains remain out of scope for the current slice.

Invalidation is part of the mapping lifetime, not a diagnostic detail. Intel’s register-based and queued invalidation interfaces cover context-cache, IOTLB, device-TLB, interrupt-entry-cache, and wait/completion descriptors. The landed smoke uses register-based context-cache invalidation (CCMD.ICC global granularity) and domain-selective IOTLB invalidation (IOTLB.IVT, CAP.IRO-decoded offset), both with bounded completion-bit polling. Page reuse is ordered strictly after invalidation completion; a poll exhausted without observing completion fails closed and does not free the backing pages. Queued invalidation (GCMD.QIE) is not set in the current slice. Fault-reporting registers (FSTS.PPF, FRCD[0].F) are the minimum diagnostic surface for translation failures and protection faults, and are exercised by the unmapped-IOVA and stale-DMA hostile proofs.

QEMU’s intel-iommu documentation is useful for focused emulator smokes but should not be treated as hardware coverage. It is q35-only in QEMU current master. Relevant options include intremap, caching-mode, device-iotlb, and aw-bits=39|48; QEMU documents 39-bit IOVA space for 3-level IOMMU page tables and 48-bit IOVA space for 4-level tables.

AMD-Vi Grounding

AMD-Vi uses a different vocabulary and table root. Device requests are keyed by DeviceID and resolved through a Device Table Entry. A DTE carries validity, translation, interrupt-remapping, DomainID, mode/page-table-depth, and page-table-root information. Future shared capOS abstractions can name the logical domain and IOVA lifetime generically, but AMD-specific code should not pretend it is programming Intel root/context tables.

AMD invalidation and completion are command-buffer operations. The future mapping lifetime must include command-buffer invalidation commands, completion wait, and event-log handling. The event log is the basic hardware-facing diagnostic record for malformed requests, page faults, and table errors; the MMIO register set covers control/status, command and event pointers, event-log state, alternate event-log buffers, device-table segment bases, and extended features.

QEMU’s amd-iommu documentation is also q35-only in current master. The documented options include dma-remap for DMA address translation and permission checking and intremap for interrupt remapping. Treat these as emulator smoke inputs until capOS has separate hardware or provider evidence.

QEMU Test Surface

QEMU provides the emulator-level test surface for IOMMU smokes:

  • intel-iommu on q35 with aw-bits=39 (3-level second-level page tables) is the shape used by the landed make run-iommu-remapping smoke, pinned to QEMU 8.2.2. The smoke asserts table programming, hardware-DMA translation (mapped_iova_translated=hardware-dma), unmapped-IOVA fault observation (unmapped_iova_fault=observed), two-phase invalidation/IOTLB-flush, and IOMMU-backed hostile stale-DMA proofs.
  • amd-iommu on q35 with DMA remapping enabled is grounded here for a future AMD-Vi table-programming slice.
  • virtio-iommu-pci on q35 x86_64 or virt ARM covers a portable virtio-IOMMU frontend if selected later.
  • PCI IOMMU/IOTLB notifier APIs in QEMU developer docs describe how emulated devices observe translation changes; they are not guest architectural requirements.

QEMU citations in the Sources section are current-master documentation observed on 2026-05-12. Tests pin the local qemu-system-x86_64 --version, machine type, and full device option string in the smoke evidence.

Implementation Status and Future Slices

Intel VT-d QEMU smoke (landed, cfg(qemu)):

  • DMAR/DRHD discovery, MMIO/fault-status diagnostics, and disabled IOVA ledger preflight records: landed as prerequisites.
  • kernel/src/iommu.rs real VT-d legacy-mode entry programming, RTAR write, GCMD/GSTS SRTP-then-TE handshake, hardware-DMA translation proof via virtio-rng, unmapped-IOVA fault observation via FSTS/FRCD, two-phase invalidation/IOTLB-flush revocation, and IOMMU-backed hostile stale-DMA smokes: all landed as of 2026-05-14 (slices A1/A2/B/C). See ddf-iommu-qemu-intel-remapping-smoke.
  • IOVA export stays disabled for this slice (iova_export=disabled-this-slice); hostile_hardware_isolation=not-claimed in all evidence.

Future slices (not yet started):

  • AMD-Vi table programming: separate source grounding and evidence; AMD-specific DTE, DeviceID, command-buffer, and event-log names must not be conflated with Intel root/context tables.
  • Source-grounding refresh for AMD or additional Intel features (48-bit IOVA, scalable-mode context entries, interrupt remapping, device-IOTLB) when a real branch selects them.
  • Bounce-buffer policy for QEMU shapes without intel-iommu: an explicit decision on IOMMU/remapping or an explicit bounce-buffer policy for non-IOMMU devices remains open.
  • Trusted multi-device sharing groups, production NIC or storage driver ownership, and moving the live virtio-net path off bounce buffers are not in scope for the current slice.

DMA User-Space Driver Isolation

This note records the DMA-addressing and isolation consequences capOS must use when planning user-space storage and NIC drivers. It is intentionally about authority boundaries, not about a particular NVMe or virtio implementation.

Address Spaces And Trust Boundaries

A DMA-capable device does not use a process virtual address. It consumes a device-visible address carried in descriptors, queue-base registers, PRP/SGL entries, or an equivalent protocol field.

On a bare host with an IOMMU:

user VA --CPU MMU--> host physical address
device IOVA --IOMMU--> host physical address

On a guest VM:

guest user VA --guest MMU--> guest physical address --EPT/NPT--> host physical address

With a virtual or assigned IOMMU, a guest can additionally reason about:

guest device IOVA --vIOMMU or paravirt grant layer--> guest physical address

The host still owns the real host IOMMU or equivalent hypervisor translation. A guest-programmable vIOMMU is useful because it gives the guest kernel a guest-internal DMA authority boundary; it is not direct control of the host IOMMU.

Host User-Space Driver Pattern

A safe host user-space driver resembles the VFIO/IOMMUFD split:

  • The kernel owns PCI discovery, BAR assignment, PCI configuration mediation, IOMMU domain creation, DMA map/unmap, page pinning, interrupt or MSI-X routing, reset, hotplug, and revocation.
  • The user-space driver owns protocol logic: queue formats, descriptor contents, device-specific register sequencing, doorbells, polling, completion handling, and command construction.
  • The driver may receive a domain-scoped IOVA for a live buffer only when the kernel has installed and can revoke the IOMMU mapping for that device.
  • The driver must not receive unrestricted host physical addresses.

UIO-style “map a BAR and deliver interrupts” is not a complete security model for a DMA-capable PCI device. If a user-space process can program a DMA engine through MMIO, then DMA isolation requires either an IOMMU domain or a stricter broker that prevents raw device-address publication.

Guest Microkernel Pattern

Host isolation and guest isolation are different claims.

For an assigned PCI device or SR-IOV VF without a guest-visible IOMMU, the host can still protect itself by mapping the device only to the VM’s memory. That does not protect the guest kernel from an untrusted guest user-space driver: from the guest’s perspective the device can still DMA to arbitrary guest physical pages.

Virtual devices have the same guest-internal issue in a different form. If an untrusted driver can put arbitrary guest physical addresses into virtqueue descriptors, the host backend can write into guest kernel memory while still staying inside the VM boundary. The host remains protected; the guest kernel is not.

A guest microkernel that wants untrusted user-space drivers therefore needs one of these guest-visible authorization layers:

  • a vIOMMU or virtio-iommu path where the guest kernel controls guest IOVA to guest physical mappings;
  • a paravirtual grant-table model where descriptors carry grant identifiers instead of raw guest physical addresses;
  • a trusted mediation service that owns descriptor/device-address fields and lets the untrusted driver submit only typed commands, buffer capabilities, or opaque handles.

The invariant is:

Never let an untrusted guest driver provide a raw guest physical address to a
device or backend unless a guest-visible DMA authorization layer validates it.

BAR, MSI-X, And DMA Are Separate Authority Surfaces

BAR/MMIO controls CPU-to-device register access. DMA controls device-to-memory access. MSI/MSI-X controls device-to-interrupt-controller messages. A safe user-space driver interface needs all three mediated.

  • Mapping a BAR is not enough; a BAR write can enable bus mastering or ring a doorbell that makes descriptors visible to the device.
  • MSI-X tables often live inside a BAR. A driver must not get arbitrary write access to MSI-X message address/data entries unless the kernel or hypervisor can mediate interrupt remapping.
  • IOMMU memory remapping does not by itself protect BAR register semantics or interrupt routing.

For capOS, DeviceMmio, DMAPool/DMABuffer, and Interrupt must remain separate capabilities with a single device-manager ledger tying them to the same owner generation and teardown state.

No-IOMMU Bounce-Buffer Consequences

On a shape without guest-programmable remapping, a real PCI device’s device-visible address is the host physical or bus address the controller uses for DMA. A bounce buffer can keep the data path manager-owned, but it does not magically create an untrusted-driver-safe IOVA namespace.

The no-IOMMU fallback can preserve no-host-physical-exposure only if userspace does not author raw device-address fields. The kernel or a trusted device manager must instead:

  • allocate and pin the device-visible bounce pages;
  • program queue-base registers and PRP/SGL or virtqueue address fields, or translate typed driver requests into those fields;
  • copy between device-visible bounce pages and non-device memory when the selected backend requires it;
  • quiesce outstanding DMA before revoke or page reuse;
  • scrub bounce pages before reuse;
  • keep hostile_hardware_isolation=not-claimed.

The costs are direct: extra copies, higher latency, CPU/cache pressure, bounded pool exhaustion risk, more teardown bookkeeping, and no hostile-hardware memory isolation claim. These costs are the price of not exposing host physical addresses when no guest-programmable remapping exists.

GCP And QEMU Implications

The GCE probes in Cloud DMA Provider Evidence Inventory show no guest-programmable IOMMU on the sampled GCP shapes: no usable DMAR/IVRS/IORT tables or IOMMU groups, and SWIOTLB software bounce buffering in the Linux guest. Host-side or provider-side isolation may still exist, but capOS cannot program or validate it from inside the guest.

The practical split is:

  • QEMU run-iommu-remapping remains the right local proof lane for direct-remapping behavior: domain-scoped IOVA export, per-device domains, invalidation, faults, and stale-DMA behavior.
  • GCP storage and NIC driver planning must treat the probed shapes as no-IOMMU/bounce-buffer targets until a future runtime probe observes a guest-programmable remapping unit.
  • A design that requires the provider to write device-visible queue-base or PRP/SGL addresses is valid only on a verified direct-remapping/vIOMMU path, or after capOS implements a separate synthetic address namespace that the kernel translates before hardware sees it.
  • On the current GCP/no-IOMMU path, the recommended storage design is brokered: userspace owns protocol decisions and buffer capabilities, while the kernel or device manager materializes the device-visible DMA addresses.

Use three explicit modes in planning and task acceptance:

ModeWhen it appliesUser-space device-address exposure
direct-remappingcapOS discovers, programs, and validates a guest-visible IOMMU/vIOMMU domain.Domain-scoped IOVA only, labeled as meaningless outside that domain.
brokered-bounceNo usable guest IOMMU, but a manager-owned bounce path can safely support the device.None: provider passes buffer caps, grant IDs, or typed commands; kernel writes device-visible addresses.
unsupportedObservations are contradictory, unsafe, or no safe brokered path exists.None: device stays unbound or disabled.

For GCP today, brokered-bounce is the only credible storage/NIC driver target on the probed shapes. direct-remapping remains a QEMU proof lane and a future cloud/hardware lane only after runtime evidence shows guest-programmable remapping.

Cloud DMA Provider Evidence Inventory

This note is the research substrate for the cloud DMA backend decision. It records official AWS, Azure, and Google Compute Engine device-surface facts, defines the evidence-matrix schema that the backend policy fills, specifies the live guest-probe checklist a later credentialed cloud-run task captures, and fixes the classification rules that separate a DMA-capable surface from guest-programmable remapping authority.

It makes no backend selection and no per-VM-shape safety claim. It does not launch a cloud VM, require provider credentials, or assert that any instance shape is safe for direct DMA. Selecting a backend and asserting bounce-buffer safety or IOMMU coverage for a specific shape require attended sign-off and are out of scope here; that work is cloud-dma-backend-selection. The model this note feeds is docs/proposals/dma-assurance-model-proposal.md; the local QEMU/IOMMU grounding it builds on is docs/research/iommu-remapping.md.

How These Facts Were Collected

Provider facts are from official provider documentation and API/CLI references only, retrieved on the dates recorded below. A “fact” here is a statement the provider document makes directly. Where a property is read from an API field rather than stated in prose, it is marked as an inference from API field. No statement in this note comes from running a cloud instance; the live-probe checklist exists precisely because a guest cannot prove provider-side isolation from documentation alone.

Provider Official Facts

AWS EC2

Source: ec2:DescribeInstanceTypes API reference (InstanceTypeInfo, NetworkInfo, EbsInfo), retrieved 2026-05-24. The matching CLI is aws ec2 describe-instance-types --instance-types <type>.

  • Network surface. networkInfo.enaSupport reports Elastic Network Adapter (ENA) support with values unsupported | supported | required. networkInfo.efaSupported (boolean) and networkInfo.efaInfo report Elastic Fabric Adapter presence. networkInfo.enaSrdSupported (boolean) reports ENA Express (Scalable Reliable Datagram). networkInfo.encryptionInTransitSupported (boolean) reports automatic in-transit encryption between instances.
  • EBS/NVMe surface. ebsInfo.nvmeSupport reports NVMe support for EBS with values unsupported | supported | required. ebsInfo.ebsOptimizedSupport reports EBS-optimized behavior (unsupported | supported | default).
  • Instance store. instanceStorageSupported (boolean) and instanceStorageInfo report local instance-store NVMe disks.
  • Accelerators. gpuInfo, fpgaInfo, inferenceAcceleratorInfo, neuronInfo, and mediaAcceleratorInfo describe GPU/FPGA/inference/Neuron/ media accelerator surfaces when present.
  • Hypervisor. hypervisor reports nitro | xen. Modern Nitro instances report nitro; the Nitro system is where ENA and NVMe EBS exposure originate.

Inference from API field: an instance type with enaSupport=required and ebsInfo.nvmeSupport=required exposes a DMA-capable NIC and NVMe block surface. This identifies a DMA-capable surface; it is not evidence of guest-programmable remapping authority.

Azure Virtual Machines

Source: Azure Accelerated Networking overview (page ms.date 2026-02-05, last updated 2026-05-05) and az vm list-skus, retrieved 2026-05-24.

  • Network surface. Accelerated Networking enables single-root I/O virtualization (SR-IOV) on supported VM sizes, providing a host-bypass data path. The underlying SR-IOV hardware is one of NVIDIA/Mellanox ConnectX-3, ConnectX-4 Lx, ConnectX-5, or the Microsoft Azure Network Adapter (MANA).
  • Capability query. A VM size’s Accelerated Networking capability is read from az vm list-skus as the AcceleratedNetworkingEnabled capability value. Most general-purpose and compute-optimized sizes with two or more vCPUs support it (four or more on hyperthreaded sizes); NC and NV sizes appear in output but do not support it.
  • VF dynamic binding and revocation. The document states the SR-IOV virtual function (VF) is dynamically revoked and restored across host maintenance and live migration. Guest images must bind to the synthetic hv_netvsc device, not the VF, to keep connectivity, and must mark mana | mlx4_core | mlx5_core SR-IOV devices unmanaged so the synthetic/VF bond is transparent.
  • Driver delivery. Azure does not update the Mellanox or MANA in-guest drivers; the guest kernel/distribution provides them.

Inference from API field: AcceleratedNetworkingEnabled=True identifies a DMA-capable SR-IOV NIC surface whose VF can appear and disappear at runtime. The documented VF revoke/restore behavior is a driver-lifecycle constraint, not remapping evidence.

Google Compute Engine

Source: Use Google Virtual NIC (gVNIC) and About Local SSD disks, retrieved 2026-05-24.

  • Network surface. Third-generation and later machine series (excluding bare metal) support only gVNIC for the virtual network interface (no virtio-net). First- and second-generation machines must use gVNIC when on Arm CPU platforms, when configured as Confidential VM, or when requiring network speeds between 50 and 100 Gbps, and otherwise still support VirtIO-Net. Custom images declare gVNIC support through the GVNIC guest OS feature (--guest-os-features=GVNIC, or guestOsFeatures:[{type:"GVNIC"}]).
  • Local SSD surface. Local SSD is attached over either the NVMe or SCSI interface; the NVMe interface is required for peak performance, and some machine series support only one of the two interfaces. The interface is chosen by the disk interface field (NVME or SCSI).
  • Storage transport. Persistent Disk attaches as virtio-scsi on machine families that expose it, while newer families expose NVMe; the exact transport is a per-machine-family property to be captured per shape rather than assumed.

Inference from API field: a third-generation-or-later GCE machine type exposes a gVNIC NIC surface and may expose NVMe Local SSD/Persistent Disk. This identifies DMA-capable NIC/storage surfaces; it is not remapping evidence.

Evidence-Matrix Schema

The backend policy fills one row per observed (provider, shape, image) tuple. Provider-fact columns come from documentation/API; observation columns come from the live-probe checklist; the last two columns are derived classifications, not provider claims.

ColumnMeaning
Provideraws / azure / gcp.
Region/zoneThe region or zone the observation was taken in.
Instance typeProvider instance type / VM size / machine type.
Image/kernelBoot image identifier and guest kernel version.
Source command or URLThe exact API/CLI command or official doc URL.
Retrieval dateDate the source was read or the probe was captured.
Visible PCI/storage/network devicesDevices the guest enumerates (lspci, block/net inventory).
Visible IOMMU tables/groupsACPI DMAR/IVRS/IORT presence and /sys/kernel/iommu_groups.
Provider-side isolation notesDocumented host-side isolation (support-policy assumption, not proof).
Guest-programmable remapping observationsWhether the guest can discover, program, and validate a remapping authority.
Runtime backend inferred by capOSThe backend capOS would select from observations (see classification rules).
Support-policy statusCoarse advertised-target roll-up: Direct-remapping / Labeled-bounce-buffer / Unsupported, pending attended sign-off.

Seed Rows (docs/API-derived, no safety claim)

These rows are seeded from documentation and API fields only. Observation and backend columns are intentionally blank because no instance was probed; they are filled by a later credentialed cloud-run task. No row asserts that any shape is safe for direct DMA.

ProviderExample shapeDocumented NIC surfaceDocumented storage surfaceRemapping observationBackend
awsNitro instance, enaSupport=required, nvmeSupport=requiredENA (SR-IOV)NVMe EBS + optional instance-store NVMenot yet probednot yet selected
azureSize with AcceleratedNetworkingEnabled=TrueSR-IOV VF (MANA/ConnectX) bonded to synthetic hv_netvscManaged disk (transport per shape)not yet probednot yet selected
gcp3rd-gen+ machine type (e.g. C3)gVNIC onlyNVMe Local SSD / PD per familyprobed 2026-05-24: IOMMU disabled, SWIOTLB (see GCE Live Probe Results)labeled bounce-buffer
gcp1st/2nd-gen, x86, non-Confidential, under 50 GbpsVirtIO-Net or gVNICvirtio-scsi PD / Local SSD (NVMe or SCSI)probed 2026-05-24: IOMMU disabled, SWIOTLB (see GCE Live Probe Results)labeled bounce-buffer

GCE Live Probe Results (2026-05-24)

These rows replace the GCE “not yet probed” placeholders with live guest observations. Four representative shapes were booted on Google Compute Engine (stock Debian 12, kernel 6.1.0-47-cloud-amd64) in a dedicated test project, each running a /sys- and /proc-only probe delivered through instance metadata and read back over the serial console. Every instance booted with no external IP, no service account, and was deleted immediately after its probe output was captured.

Machine typeClassNIC driverStorageGuest IOMMU / DMARDMA path
n1-standard-11st-genvirtio_netvirtio-scsi (sda)intel_iommu=off, DMAR: IOMMU disabled, no DMAR table, empty iommu_groupsSWIOTLB software bounce buffering
e2-small2nd-genvirtio_netvirtio-scsi (sda)same: IOMMU disabled, no DMAR, no groupsSWIOTLB
c3-standard-43rd-gen Intelgvnicnvme Local SSD (Google vendor 0x1ae0)sameSWIOTLB
n2d-standard-2 ConfidentialAMD SEVgvnicnvmesame; additionally Memory Encryption Features active: AMD SEVSWIOTLB forced (512 MB)

Verbatim kernel evidence common to all four shapes:

  • the boot command line carries intel_iommu=off;
  • DMAR: IOMMU disabled;
  • PCI-DMA: Using software bounce buffering for IO (SWIOTLB);
  • /sys/kernel/iommu_groups is empty, and no DMAR, IVRS, or IORT table is present under /sys/firmware/acpi/tables/.

The Confidential (SEV) shape additionally logs software IO TLB: Memory encryption is active and system is using DMA bounce buffers, confirming that bounce buffering is enforced by memory encryption, not merely by configuration.

Classification. No probed GCE shape – neither the older virtio surface nor the modern gVNIC/NVMe surface – exposes a guest-programmable IOMMU that capOS could discover, program, and validate. By the classification rules this rules out the direct-remapping backend and selects the labeled bounce-buffer fallback for the cloud path on these shapes. On the Confidential VM the bounce-buffer path is a hardware invariant: the device cannot reach encrypted guest memory directly. This is a fail-closed observation, not a hostile-hardware isolation claim; the binding backend selection and any “supported shape” advertisement remain attended sign-off work in cloud-dma-backend-selection.

Design implication for GCP storage/NIC drivers. A provider-side or hypervisor-side IOMMU may still protect Google infrastructure, but that is not guest-programmable remapping authority for capOS. On the probed GCE shapes a capOS userspace storage or NIC provider must therefore be planned as a no-IOMMU, brokered-bounce design: userspace receives buffer capabilities, grant IDs, or typed commands, while the kernel or device manager materializes the device-visible queue-base, descriptor, PRP/SGL, or virtqueue address fields. The direct-remapping lane remains valid for QEMU run-iommu-remapping and for future cloud/hardware shapes that expose a guest-programmable remapping unit; it is not a GCP premise today. The generic design consequences are recorded in DMA User-Space Driver Isolation.

Runtime Probe Protocol

A later credentialed cloud-run task captures the following from the guest, with the region/zone, image, kernel, and retrieval date recorded for each command. Capture the verbatim command output as evidence; do not summarize it.

  • lspci -nnk -D – PCI topology with full domain:bus:device.function, vendor/ device IDs, and bound kernel driver per function (NIC, storage controller, accelerator identity).
  • ls /sys/kernel/iommu_groups (and per-group devices/) – whether the guest sees IOMMU groups at all, and how devices are grouped.
  • ACPI table presence: DMAR (Intel VT-d), IVRS (AMD-Vi), IORT (Arm SMMU) under /sys/firmware/acpi/tables/. Absence is itself evidence.
  • Kernel log IOMMU/SWIOTLB lines (dmesg | grep -iE 'iommu|dmar|ivrs|iort|swiotlb') – whether the kernel enabled an IOMMU, fell back to software bounce (SWIOTLB), or found no remapping unit.
  • Network driver identity: ethtool -i <iface> and the bound driver (ena, mana/mlx5_core, gve, virtio_net).
  • Block transport identity: lsblk -o NAME,TRAN,MODEL and controller driver (nvme, virtio_blk, virtio_scsi).
  • NVMe inventory: nvme list and nvme id-ctrl <dev> for controller identity where NVMe is present.

A probe result is only usable evidence if capOS could perform the equivalent discovery from its own ACPI/PCI enumeration; the Linux commands above stand in for that discovery during the research phase.

Classification Rules

These rules are deliberately fail-closed and feed the runtime backend inferred by capOS and support-policy status columns.

  • SR-IOV, a virtual NIC (ENA, gVNIC, MANA, virtio-net), a GPU, an accelerator, or local NVMe identifies a DMA-capable or DMA-adjacent surface. This is the presence of a device that does or could bus-master; it is not a safety claim.
  • A direct-remapping classification requires guest-programmable remapping authority that capOS can discover, program, and validate – a usable Intel VT-d, AMD-Vi, or Arm SMMU unit the guest controls, with translation, fault, and invalidation behavior matching docs/research/iommu-remapping.md. A DMA-capable surface alone never implies this.
  • Provider-side isolation facts (host-enforced VPC isolation, Nitro/host data- path bypass, hypervisor-side IOMMU) are support-policy assumptions, not proof that capOS can safely use direct DMA from inside the guest.
  • Ambiguous, contradictory, or unvalidated observations select Unsupported. This matches the assurance model: unknown or contradictory observations select Unsupported, not an optimistic default.

These map onto the three backend candidates in the assurance model (docs/proposals/dma-assurance-model-proposal.md): a direct remapping domain, a labeled bounce-buffer fallback (direct_dma=blocked, all device-visible memory manager-owned, no host physical address exposed, hostile-hardware isolation not claimed), or Unsupported.

Relationship to Backend Selection

cloud-dma-backend-selection consumes this inventory: it maps each backend candidate to the assurance-model invariants, fills the evidence matrix per cloud VM shape, and drafts the downstream-contract scaffolding (which device-manager policy fields a driver declares – direct_dma, trusted_domain, bounce_buffer – and which stale-handle/stale-completion/teardown/ no-host-physical-exposure gates each candidate must satisfy). That task already declares this inventory as a dependency. The binding backend selection and any per-shape safety assertion remain attended-sign-off work and are not made here.

Relevant Research and Grounding

  • docs/research/iommu-remapping.md – primary-source Intel VT-d/AMD-Vi/QEMU remapping grounding the direct-DMA classification depends on.
  • docs/proposals/dma-assurance-model-proposal.md – the model objects, invariants, and backend-candidate matrix this evidence feeds.
  • docs/dma-isolation-design.md – the manager-owned DMA isolation contract and bounce-buffer fallback the labeled-fallback candidate must satisfy.
  • docs/proposals/cloud-deployment-proposal.md – the cloud deployment context for the usable-instance milestone.
  • docs/tasks/cloud-dma-backend-selection.md – the backend decision that consumes this inventory.

Research: Future Scheduler Architecture

This note records the prior art checked for future capOS scheduling work after the first SMP and per-thread ring milestones exposed that scheduler structure, not only timer programming, will decide whether capOS scales.

Local Grounding

Existing capOS documents already cover part of the answer:

External Sources Checked

Findings

Fair General-Purpose Scheduling

Linux CFS established the now-common model that ordinary tasks should be ordered by virtual runtime, not by a fixed time-slice list. Linux EEVDF keeps the fair-scheduler lineage but chooses the eligible task with the earliest virtual deadline, using request size and lag to improve latency and fairness.

The capOS consequence is not “import Linux CFS.” It is:

  • ordinary best-effort work should use virtual-time accounting (Phase D WFQ is now the active policy; the earlier FIFO round-robin was the bootstrap);
  • latency-sensitive best-effort work should have bounded, policy-visible request sizes or weights rather than hidden scheduler magic;
  • per-CPU run queues are a prerequisite before any EEVDF-like policy matters at SMP scale.

EEVDF is the strongest candidate for the next capOS best-effort policy evolution after WFQ. It should follow WFQ rather than replace it immediately because it depends on accurate runtime charging, per-CPU runnable ownership, and migration accounting that are not yet in place.

Per-CPU Run Queues and Topology

Linux and FreeBSD both make per-CPU scheduler state the normal SMP unit. FreeBSD ULE additionally exposes CPU topology and affinity as first-class placement concerns. This matches the current capOS scaling evidence: one global scheduler lock and one global run queue make every CPU contend on the same state even after per-thread rings remove the process-wide CQ bottleneck.

The near-term capOS scheduling architecture should split:

  • per-CPU current thread and run queue ownership;
  • cross-CPU wakeup and migration paths;
  • shared process/thread metadata protected by narrower locks;
  • placement policy from dispatch mechanism;
  • diagnostic counters for lock hold/spin time, migration, steals, and IPIs.

Realtime and Temporal Isolation

Linux SCHED_DEADLINE uses EDF plus Constant Bandwidth Server-style budget, deadline, and period parameters. Its key lesson for capOS is admission: deadline scheduling without bandwidth control is only a priority policy, not a guarantee.

seL4 MCS is the more capability-native precedent. CPU time is represented by scheduling-context objects. Passive servers can run on caller-donated CPU time, avoiding priority inversion across synchronous IPC. This maps directly to capOS endpoint services and direct IPC handoff.

The capOS split should remain:

  • SQE.deadline_ns: request freshness and propagation metadata;
  • SchedulingContext: spendable CPU-time authority;
  • donation: temporary transfer of CPU budget/deadline along a synchronous capability path;
  • RealtimeIsland: admitted bundle of scheduling contexts, memory/device reservations, communication paths, and overrun policy.

Tickless, Isolation, and Housekeeping

Linux NO_HZ and CPU isolation reinforce that tick suppression is not one feature. Idle tickless is a timer cleanup. Full-nohz is an isolation contract that also needs housekeeping CPUs, accounting, timer migration, deferred work placement, and revocation latency policy.

For capOS, this grounding shaped the implementation order: automatic nohz activation for the narrow single-runnable-entity window and SQPOLL-driven auto-nohz for ring-coupled leases are now implemented (Phase F), both tied to the CpuIsolationLease with housekeeping, deferred-work placement, clockevent deadline substrate, one-SQ-consumer ownership, and fail-closed rollback prerequisites satisfied first. Generic full-nohz for explicitly budgeted compute leases, timeout-based auto-revoke, and SQPOLL nohz for explicitly leased caller-thread rings have since landed. Remaining future work includes:

  • full-nohz tied to policy-service issuance and durable monitoring telemetry;
  • SQPOLL nohz beyond the current caller-thread ring-coupled lease shape;
  • realtime island nohz after admission proves unrelated work, IRQs, deferred frees, and timers are excluded or bounded.

Pluggable and User-Space Policy

Linux sched_ext and ghOSt show that fast scheduler experimentation is useful, but they also preserve privileged dispatch and enforcement. sched_ext runs BPF inside the kernel scheduler framework with fallback; ghOSt delegates policy to user-space agents while retaining kernel mechanisms for safety and preemption.

For capOS, the safe architecture is:

  • keep dispatch, budget enforcement, interrupt handling, idle, and fallback in the kernel;
  • expose policy knobs through capabilities;
  • let a privileged scheduler-policy service own admission, budget selection, CPU partitioning, isolation leases, and tuning;
  • call the policy service on configuration changes, depletion/timeout faults, and coarse placement events, not on every context switch.

Dynamic policy loading is a later experiment. It should not become the first way to make basic SMP scheduling scale.

Datacenter Runtime Schedulers

Shenango, Caladan, Shinjuku, and Arachne target microsecond-scale service latency by managing cores, preempting long request handlers, and separating fast user-level scheduling from coarser kernel control. They are useful because capOS will host services, agent runtimes, and network stacks that want low tail latency.

The shared lessons are:

  • core grants are different from CPU-time budgets;
  • user-level worker schedulers need kernel-visible blocking and preemption boundaries;
  • tail-latency policies need request-level telemetry, not only thread-level CPU shares;
  • cross-core coordination must be cheap enough that it does not dominate the service latency it tries to reduce.

For capOS this argues for scheduler hints and policy capabilities above the kernel mechanism, not a datacenter-specific kernel scheduler as the default.

Stateful Work Graphs

The stateful task/job graph proposal is related at the workload layer. A graph node can carry assignment metadata such as priority, deadline, budget, queue, and lease, and a domain coordinator can decide which node attempt is runnable inside that graph. That is not the same authority as kernel CPU scheduling.

The scheduler consequence is a clean boundary:

  • graph/node priority is domain policy until translated by an authorized scheduler policy service;
  • graph budgets reference resource profiles or scheduling contexts, but do not mint CPU time by themselves;
  • graph deadlines may create request deadlines or admission inputs, but do not bypass scheduler admission;
  • build, init, agent, and operator graph coordinators should lease work and consume scheduler primitives rather than owning a global CPU run queue;
  • scheduler telemetry should be attachable to graph runs as artifacts, so a failed or slow job can explain whether it waited on authority, CPU budget, dependency state, I/O, or policy.
  1. (Done: Phase D/E/F) Finish the current thread-scale evidence before larger policy changes. Phase D WFQ, Phase E SchedulingContext, and Phase F CpuIsolationLease / auto-nohz / SQPOLL-coupled nohz have landed.
  2. Split scheduler state into per-CPU runnable ownership and bounded cross-CPU wake/migration. (Per-CPU queues remain future work; Phase F.5.)
  3. Add precise CPU accounting and scheduler attribution before changing the default policy. (Attribution guardrails landed in Phase A; full per-CPU accounting is Phase F.5 follow-on.)
  4. Move ordinary best-effort work toward an EEVDF-like virtual-deadline policy after accounting and per-CPU queues exist. (WFQ is current; EEVDF is a follow-on evaluation deferred until per-CPU queues exist.)
  5. Keep SCHED_DEADLINE/EDF-CBS and seL4 MCS as the precedent for admitted realtime work, but express CPU authority as capOS SchedulingContext capabilities. (SchedulingContext is implemented; RealtimeIsland admission is Phase G future work.)
  6. Keep user-space scheduler policy coarse-grained and capability-authorized; do not consult a user process on every timer interrupt or dispatch.
  7. Treat SQPOLL, busy polling, and full-nohz as CPU-isolation leases with housekeeping and revocation constraints. (Ring-coupled SQPOLL nohz and generic full-nohz for explicitly budgeted compute leases are implemented; policy-service issuance remains future work.)
  8. Keep runtime schedulers above per-thread rings, futex/park/notification primitives, timers, and explicit thread objects.

The resulting target is a layered scheduler:

  • Kernel dispatch/enforcement: per-CPU queues, context switch, idle, accounting, budget enforcement, timeout faults, direct IPC donation, and cross-CPU wake/migration.
  • Kernel policy primitives: weights, virtual deadlines, scheduling contexts, CPU masks, isolation leases, and realtime-island admission hooks.
  • Userspace policy: profiles, admission, budget selection, service/runtime hints, placement, diagnostics, and policy reload.
  • Userspace runtimes: work stealing, actor queues, async reactors, service request schedulers, and language-specific M:N scheduling.

Open Questions

The questions below have been answered by Phase D/E/F implementation; they are kept for record and context:

  • Answered (Phase D): WFQ is the first virtual-time policy; EEVDF is deferred until per-CPU queues and runtime charging exist.
  • Answered (Phase D): The thread-scale milestone did not require a per-CPU queue split; WFQ on a global queue with per-thread weight sufficed.
  • Answered (Phase E): The initial SchedulingContext ABI uses SchedulingContextSpec (weight, latency class, budget, period, overrun policy) and SchedulingContextInfo for info-only read access; SchedulingContext.info() is method id 0 for stability. Donation/return through endpoints is implemented; realtime island admission is future work.
  • Answered (Phase F): CpuIsolationLease revocation interacts with session logout and process exit through lease-generation staling, which is the load-bearing rollback trigger; service replacement and process exit cleanup go through the same generation-staling path.

Remaining open questions:

  • What is the minimum per-CPU queue split that closes the full-SMP scalability milestone (Phase F.5) without prematurely designing the full fair scheduler?
  • How should policy-service issuance select and renew generic full-nohz and SQPOLL nohz leases beyond the current explicit local proofs?
  • Which scheduler telemetry belongs in the always-on kernel and which belongs behind the benchmark-only measure feature?
  • What is the right RealtimeIsland admission shape for admitted scheduling contexts, memory/device reservations, and overrun policy (Phase G)?

Research: NO_HZ, SQPOLL, and Realtime Scheduling

This note records the external grounding for capOS tickless idle, SQPOLL-oriented full-nohz CPU isolation, and future realtime scheduling contexts. It was written from the 2026-04-29 shared design discussion and checked against primary Linux/seL4 documentation.

Local Grounding

Relevant local docs:

  • Scheduling: current LAPIC tick, bounded timeout waiters, timer-side ring polling, AP scheduler-owner proof, CPL0 idle-thread paths, and Phase F nohz/SQPOLL activation state machine.
  • SMP: LAPIC/IPI foundation and deferred per-CPU run queue/concurrent scheduler ownership work.
  • Ring v2 For Full SMP: per-thread rings and the rule that SQPOLL must have exactly one SQ consumer.
  • Out-of-kernel scheduling: scheduling contexts, user-space policy, and kernel budget enforcement split.
  • Multimedia pipeline latency: admitted realtime island model for media graphs.
  • Robotics realtime control: scheduling-context authority, control-loop admission, and passive-server donation lessons.
  • x2APIC and APIC virtualization: x2APIC as a later backend, not a prerequisite for the current xAPIC LAPIC timer path.

External Sources Checked

NO_HZ Findings

Linux separates three timer policies:

  • periodic scheduler ticks;
  • tick suppression only while a CPU is idle (NO_HZ_IDLE);
  • adaptive tick suppression for CPUs with one runnable task (NO_HZ_FULL).

The first capOS target should match the conservative shape of NO_HZ_IDLE, not Linux NO_HZ_FULL. The Linux docs explicitly call idle tick suppression common/default-useful, while NO_HZ_FULL is specialized for realtime and HPC loads and requires at least one non-adaptive CPU for timekeeping. That maps to capOS because the current scheduler tick still performs too much work: timeout expiry, waiter wakeup, run-queue rotation, timer-side ring dispatch, and transitional network polling.

Linux also records a cost: dyntick-idle adds instructions on idle entry/exit and may require expensive clockevent reprogramming. capOS should therefore add counters before changing behavior and should retain a runtime ForcedPeriodic fallback.

Timekeeping Findings

Linux’s timer stack distinguishes:

  • clock sources: monotonic timeline counters;
  • clock events: hardware devices that interrupt at selected future times;
  • scheduler ticks: one user of clock events, not the timebase itself.

This split is the important design point for capOS. Current TICK_COUNT style timekeeping is adequate for periodic scheduling but becomes the wrong owner once the scheduler can stop the tick. capOS should introduce a monotonic now_ns clocksource layer before enabling tickless idle.

Linux hrtimers provide two lessons without requiring capOS to clone the whole subsystem:

  • waiters should be stored by absolute expiry time, not by periodic tick count;
  • time-ordered expiry structures simplify deadline-based wakeup and avoid scanning every timer on every tick.

capOS already bounds waiter counts, so the first implementation can use a small ordered array, BTreeMap, or heap. The security property is bounded, non-allocating interrupt-path expiry, not a specific data structure.

CPU Isolation and Housekeeping Findings

Linux CPU isolation treats housekeeping as first-class work: unbound timers, workqueues, maintenance, statistics, deferred cleanup, watchdog work, and remote scheduler ticks must move away from isolated CPUs or be explicitly disabled. Linux also requires at least one housekeeping CPU.

For capOS this means full-nohz must not be modeled as a timer flag. It is a CPU ownership contract:

isolated CPU = no unrelated runnable work + no unbound kernel work + explicit
wake/deadline events only

The same rule applies whether the isolated entity is a kernel SQPOLL worker, a userspace poller, or a future admitted realtime loop. CpuIsolationLease names the owner, allowed CPU set, allowed mode, accounting target, and revocation policy. It performs real per-CPU periodic-tick suppression for the narrow single-runnable-entity window (Phase F closed), and a ring-coupled kernelSqpoll lease suppresses ticks while its bound ring is in SQPOLL running/sleeping mode with a live owner (SQPOLL-driven auto-nohz closed). Without a CpuIsolationLease, a latency-sensitive hint must not grant exclusive CPU access. Generic full-nohz for explicitly budgeted compute threads, a generic SQPOLL nohz state machine for explicitly leased caller-thread rings, and timeout-based auto-revoke have since landed. Broader userspace-poller/device-queue issuance remains future work.

io_uring SQPOLL Findings

Linux IORING_SETUP_SQPOLL creates a kernel thread that polls the submission queue. While it remains active, applications can publish SQEs and observe CQEs without entering the kernel on each submission. When the poller sleeps after its idle period, it sets IORING_SQ_NEED_WAKEUP; userspace must call io_uring_enter(..., IORING_ENTER_SQ_WAKEUP) or let liburing do that wake.

The capOS consequence is not “copy io_uring”. It is an ownership rule:

SQPOLL ring: kernel worker owns SQ head; userspace owns SQ tail and CQ head;
cap_enter does not become a second SQ consumer.

This requires Ring v2 or an equivalent per-thread ring endpoint. The current process-wide ring and timer-side ring polling are incompatible with safe SQPOLL because they cannot prevent two consumers from draining the same SQ.

SQPOLL full-nohz required: per-thread rings; a ring mode bit and quiescent mode transitions; per-CPU scheduler ownership and reschedule IPIs; a housekeeping CPU; removal or explicit placement of scheduler-tick-polled networking. Those prerequisites are now closed (Phase F one-SQ-consumer, bounded SQPOLL ring mode, housekeeping/deferred-work placement, per-CPU idle thread). SQPOLL-driven nohz activation is implemented for explicitly leased caller-thread kernelSqpoll rings, including producer wake, bounded service progress, rollback, and stale-owner rejection. Broad userspace-poller/device-queue policy issuance remains future work.

Realtime Findings

Linux SCHED_DEADLINE uses runtime, deadline, and period parameters and depends on admission/bandwidth management. Its documentation is explicit that without admission control, no scheduling guarantee follows. That directly separates per-request deadline metadata from CPU budget authority.

PREEMPT_RT’s main lesson is that realtime latency is destroyed by long non-preemptible sections, unbounded interrupt handling, and priority inversion. Linux addresses this by making most kernel execution schedulable, using priority-inheritance-aware locks, and threading interrupts. capOS does not need to clone PREEMPT_RT, but any realtime path must keep IRQ top halves short, avoid blocking locks in admitted hot paths, and provide donation or inheritance for capability service calls.

seL4 MCS provides the strongest capability-OS precedent. Scheduling contexts are kernel objects representing CPU-time authority; they carry budget and period, are configured through per-CPU scheduling-control authority, and are enforced with a sporadic-server model. Passive servers can run on a caller’s donated scheduling context and return it on reply.

For capOS:

  • SQE.deadline_ns is request freshness metadata.
  • SchedulingContext is CPU-time authority.
  • RealtimeIsland is the admission object for a whole graph/loop.
  • Scheduling-context donation is how timing survives synchronous capability calls through passive services.
  • SQPOLL and AutoNoHz are executor/isolation backends, not the realtime authority itself.

capOS Design Consequences

  1. Implement tickless idle before full-nohz.
  2. Split clocksource from clockevent before stopping periodic ticks.
  3. Convert timeout waiters to absolute monotonic deadlines before one-shot scheduling.
  4. Replace user-mode idle with kernel/per-CPU idle before real tickless idle. Done: the scheduler idle path is a CPL0 per-CPU kernel idle thread; the user-mode idle process is removed.
  5. Keep periodic preemption while there is runnable contention.
  6. Keep networking in ForcedPeriodic or move it to explicit IRQ/deadline polling before enabling tickless on network-active CPUs. Network-polling placement is landed as a fail-closed admission gate; placement routing for arbitrary network-active CPUs remains future work.
  7. Treat full-nohz as a CPU lease and housekeeping design, not a standalone timer optimization. CpuIsolationLease is now implemented, generic full-nohz is landed for explicitly budgeted compute leases, and policy-service issuance remains future work.
  8. Add SQPOLL only after per-thread rings and per-CPU scheduler ownership. Done: one-SQ-consumer ring ownership, bounded SQPOLL ring mode, and SQPOLL-driven auto-nohz activation are all closed.
  9. Require one SQ consumer per ring mode. Done: enforced by the Phase F one-SQ-consumer ring ownership gate.
  10. Use SQE.deadline_ns only for freshness/drop/propagation policy; put budget, period, priority, CPU mask, and overrun policy in SchedulingContext.
  11. Use realtime islands for media/robotics/control graphs; reject hard realtime claims until kernel path, IRQ, device, and WCET evidence exist.

Research: Time and Clock Authority in Operating Systems

This note records verified external grounding for capOS’s time and clock authority design. It covers Linux clock IDs and privilege model, time namespaces, NTP/chrony discipline, PTP/IEEE-1588, Fuchsia’s UTC clock object, and leap-second handling. Findings feed directly into the WallClock / ClockDiscipline / ClockProvenance design in Time and Clock.


1. Linux: Clock IDs and the Read/Discipline Split

Clock IDs

Linux exposes multiple clock IDs through clock_gettime(2):

  • CLOCK_REALTIME — settable system-wide wall clock. Measures seconds since the Unix epoch. Can jump forward or backward when disciplined by settimeofday or NTP. Requires CAP_SYS_TIME to set.
  • CLOCK_MONOTONIC — non-settable system-wide monotonic clock. Counts from an unspecified boot-adjacent point. Cannot jump; unaffected by NTP steps; responds to frequency adjustments only. Does not include suspend time.
  • CLOCK_BOOTTIME — identical to CLOCK_MONOTONIC but includes suspended time. Non-settable. Useful for suspend-aware timers without CLOCK_REALTIME jump exposure.
  • CLOCK_TAI — non-settable clock based on wall time but counting leap seconds (TAI = International Atomic Time). Unlike CLOCK_REALTIME, it has no discontinuity on leap second insertion.

The CAP_SYS_TIME Privilege

CAP_SYS_TIME gates all operations that modify the kernel clock: settimeofday(2), stime(2), adjtimex(2)/clock_adjtime(2) when modes != 0, and setting the hardware RTC. Reading the clock — including a read-only adjtimex call with modes = 0 — requires no privilege. The clock_adjtime(2) variant (added in Linux 2.6.39) accepts an additional clk_id argument so callers can target a specific clock rather than only the system-wide realtime clock.

Concretely: any process can call clock_gettime(CLOCK_REALTIME, &ts) without privilege; only a privileged NTP daemon calls adjtimex() or clock_settime(CLOCK_REALTIME, &ts).

Lesson for capOS

This is the direct prior art for splitting WallClock (read-only cap, granted to ordinary processes) from ClockDiscipline (stronger cap, held only by the designated sync service). The Linux CAP_SYS_TIME flag is a coarse ambient privilege bit; capOS encodes the same split as two distinct capability types, with no ambient privilege required and no escalation path between them.


2. Linux Time Namespaces

What Is Namespaced

Linux time namespaces (added in Linux 5.6) let processes inside a namespace observe different values for CLOCK_MONOTONIC and CLOCK_BOOTTIME than the host. The per-namespace offsets are written to /proc/pid/timens_offsets before any process enters the namespace; once the first process has entered, writes return EACCES. The format is:

<clock-id> <offset-secs> <offset-nanosecs>

CLOCK_REALTIME is deliberately not namespaced: the kernel documentation cites “reasons of complexity and overhead” — in practice, CLOCK_REALTIME is already settable and the step/slew machinery is not per-namespace.

The offsets are pure integers (seconds + nanoseconds); there is no per-namespace frequency correction or NTP discipline within the namespace. This feature is primarily used for container checkpoint/restore (CRIU) where the monotonic clock must appear consistent before and after migration.

Lesson for capOS

Time is not an ambient global fact — it can be a per-context offset applied to a shared monotonic base. capOS’s WallClock cap fits this shape directly: the cap object holds the offset from the kernel monotonic timeline to the wall epoch, and different processes can hold caps with different offsets (timezone, test-clock injection, container clock virtualization). Freezing offsets at namespace creation maps to the capOS invariant that WallClock cannot be retroactively shifted by the holder — only ClockDiscipline can adjust the shared reference.


3. NTP Discipline: chrony and ntpd

Step vs. Slew

NTP daemons correct clock drift using two mechanisms:

  • Slew (gradual): adjust the clock frequency to converge slowly. Linux adjtime(3) / adjtimex(ADJ_OFFSET) implements slew. Default rate is bounded to 500 ppm; corrections over 0.5 seconds are clamped. This preserves monotonicity.
  • Step (abrupt): directly set the clock to the reference value. Breaks timestamp ordering for any process comparing consecutive readings across the step.

chrony makestep: makestep threshold limit allows stepping if the offset exceeds threshold seconds, but only within the first limit clock updates. For example, makestep 1.0 3 steps for offsets over 1 second during the first three updates, then slews only thereafter. A negative limit removes the update-count restriction entirely. After an initial step, chrony reverts to pure slew to protect running applications from abrupt clock changes.

Leap Second Handling (leapsecmode)

chrony supports four modes for the UTC leap second insertion:

  • system (default): the kernel steps the clock at the UTC boundary.
  • step: chronyd performs the step rather than delegating to the kernel.
  • slew: the leap second is absorbed by slewing (~12 seconds of correction at the default 500 ppm rate on Linux).
  • ignore: no automatic correction; the offset is absorbed during normal tracking.

For servers distributing time to clients unaware of leap seconds, chrony combines leapsecmode slew with smoothtime to smear the correction outward over up to 17 hours 34 minutes (when limiting slew to 1000 ppm).

Sync State Exposure

chronyc tracking reports the reference source, stratum, system time offset, frequency error, and RMS offset. chronyc sourcestats shows per-source statistics. These are the client-visible trust/sync signals that a capOS ClockProvenance would encode — the binary ntpSynced or ptpSynced flag plus an error bound.

Lesson for capOS

ClockDiscipline.step() and ClockDiscipline.slew() as distinct cap methods are justified by this split: an NTP daemon that calls step() at startup but only slew() at steady state exposes its policy at the capability boundary. Callers that need monotonic-safe time can check ClockProvenance to distinguish a recently-stepped clock from a stably-slewed one.


4. PTP / IEEE-1588: Hardware Timestamping

What PTP Provides

IEEE 1588 Precision Time Protocol synchronizes clocks using timestamps captured by NIC hardware at the Media Independent Interface (MII) boundary, typically within 100 ns of frame ingress/egress. This eliminates software scheduling jitter that limits NTP to millisecond accuracy. With hardware support, PTP achieves sub-microsecond accuracy.

Linux implements PTP through ptp4l (PTP daemon managing the protocol state machine) and phc2sys (synchronizing the hardware PTP clock to the system clock). ptp4l can configure a system as an Ordinary Clock (single port) or Boundary Clock (multi-port).

Use Cases vs. NTP

NTP is adequate for general server synchronization (sub-10 ms, typically 1–10 ms LAN, sub-ms with GPS). PTP is used where sub-microsecond accuracy is required: industrial automation, 5G RAN timing, financial trading, and audio/video bridging (AVB/TSN). The distinction is hardware timestamping support in the NIC and a local Grandmaster or GNSS-disciplined boundary clock.

Lesson for capOS

Provenance is not binary (synced vs. unsynced). The ptpSynced vs ntpSynced distinction in ClockProvenance is justified: a process requiring microsecond timestamps for audio-visual synchronization or hardware scheduling needs to distinguish PTP discipline from NTP discipline. A cap validator checking ClockProvenance before accepting a timestamp for a hard real-time claim should require ptpSynced and an error bound below the application’s tolerance.


5. Fuchsia / Zircon: UTC Clock Objects

Clock as a Kernel Object

Fuchsia models UTC time as a first-class kernel object (zx_clock_t), not as a syscall or global variable. A clock is a one-dimensional affine transformation of the monotonic reference timeline, maintained atomically and observed through typed operations.

Rights Model

Zircon clock handles carry typed rights:

  • ZX_RIGHT_READ: permits zx_clock_read() (read current time) and zx_clock_get_details() (read transformation parameters and error bound).
  • ZX_RIGHT_WRITE: permits zx_clock_update() — adjusting the clock’s absolute value, frequency (in ppm), and error bound (in nanoseconds).

Any process holding ZX_RIGHT_WRITE acts as a clock maintainer. There is no separate “maintain” right; the write right IS the maintain authority.

Monotonic option: clocks created with ZX_CLOCK_OPT_MONOTONIC reject any zx_clock_update() that would cause the clock to go backward.

Continuous option: clocks created with ZX_CLOCK_OPT_CONTINUOUS allow setting the absolute value only on the first update; subsequent absolute-value changes are rejected, allowing only frequency adjustments.

UTC Maintainer Service

All components started by Fuchsia’s Component Manager receive a UTC clock handle with read-only rights. Only the Timekeeper service receives the write handle. Timekeeper synchronizes against an RTC or a network time source and calls zx_clock_update() to discipline the UTC clock.

The UTC clock has a “backstop” guarantee: it never reports a time earlier than the timestamp of the latest build commit (the backstop value). Before Timekeeper first synchronizes, the clock may be in a fixed state (stopped at backstop) or running-but-unsynchronized state. Fuchsia documents that the UTC clock “is neither monotonic nor continuous” — Timekeeper may step it backward when corrections are needed. Callers needing a reliable timestamp must query the clock details to determine whether the clock has been synchronized.

Lesson for capOS

This is the closest capability-native precedent for capOS’s design. The mapping:

Fuchsia/ZirconcapOS
Clock kernel object with ZX_RIGHT_READ handleWallClock capability (read-only)
Clock handle with ZX_RIGHT_WRITE held by TimekeeperClockDiscipline capability (init-granted)
zx_clock_get_details() error bound and sync signalClockProvenance label on WallClock
Backstop guarantee (never before build timestamp)Provenance downgrades on suspend/resume or loss of sync
ZX_CLOCK_OPT_MONOTONIC flagThe invariant that Timer.now() monotonic base is never adjusted

The Fuchsia UTC design confirms that the right model is: one strong-authority maintainer, many read-only observers, with a typed signal for trust state. capOS extends this by making provenance an explicit labeled field on the cap rather than a query-on-demand operation.


6. Leap Seconds and Clock Steps: Smearing vs. Stepping

The Problem

UTC inserts or deletes leap seconds at irregular intervals, decided by the International Earth Rotation and Reference Systems Service (IERS). Inserting a leap second means UTC has a second labeled 23:59:60 before rolling to midnight, creating a discontinuity in POSIX time (which counts seconds without leap seconds). Deleting a leap second would mean skipping a second.

For software:

  • Stepping: CLOCK_REALTIME jumps by ±1 second at the UTC boundary. Any application comparing two CLOCK_REALTIME readings across the boundary sees a negative elapsed time (on insert) or a missing second (on delete). CLOCK_MONOTONIC must not step; it continues forward through the leap second unaffected.
  • Slewing/Smearing: the correction is distributed over a window. No discontinuity occurs, but CLOCK_REALTIME temporarily deviates from true UTC during the smear window.

Industry Smear Practice

Google has applied a 24-hour linear smear (noon-to-noon UTC) since 2008: each second in the smear window is ~11.6 µs longer than an SI second. AWS’s Amazon Time Sync Service applies the same 24-hour noon-to-noon linear smear automatically. Both services suppress the leap second indicator on their NTP responses so clients do not attempt their own step.

The smear approach means that any client synchronized to Google Public NTP or Amazon Time Sync is not tracking true UTC during the smear window — it tracks “smeared UTC”, which is coordinated but not the same as civil UTC. This is a design choice accepting brief inaccuracy for availability of monotonic-safe time.

CLOCK_MONOTONIC Must Not Jump

CLOCK_MONOTONIC is specifically designed to be immune to steps. Linux documents it as “nonsettable” — no process can set it; only frequency adjustments are permitted. The rationale: timers, timeouts, and scheduling deadlines depend on monotonic ordering. Any step in the monotonic timeline would silently break all in-flight waiters.

Lesson for capOS

The monotonic timeline (Timer.now()) must be the invariant substrate. WallClock is a separate, disciplinable offset layered on top. A ClockDiscipline.step() call adjusts the wall-clock offset without touching the monotonic base — ensuring in-flight ring timeouts and scheduler deadlines are never invalidated. The ClockProvenance.lastStep timestamp lets an auditor see when the wall clock was last stepped, so validators can reject timestamps taken during or shortly after a step if their use case requires continuity.


Applicability to capOS

Read vs. Discipline Authority

Every system surveyed maintains a hard split between reading time (no privilege required, granted to all processes) and adjusting time (strong authority, held by one designated service):

  • Linux: clock_gettime (unprivileged) vs adjtimex/CAP_SYS_TIME (privileged)
  • Fuchsia: ZX_RIGHT_READ handle (distributed to all components) vs ZX_RIGHT_WRITE handle (held only by Timekeeper)
  • chrony/ntpd: any client queries sync state; only the daemon calls adjtimex

capOS should encode this as: WallClock (read-only cap, grantable and attenuable) and ClockDiscipline (separate stronger cap, init-granted at boot, not transferable through normal cap-grant paths).

Clock Provenance as a Typed Signal

Fuchsia’s per-clock error bound and sync signal, and chrony’s tracking command, both expose metadata about trust state alongside the time value itself. capOS’s ClockProvenance label on WallClock captures this: a validator that needs trustworthy time checks provenance rather than relying on the presence of the cap alone.

The ptpSynced / ntpSynced distinction maps directly to the PTP vs NTP accuracy gap: hardware timestamping is a stronger claim than software NTP, and an OS-level audit trail needs to encode which applies.

Wall Clock as a Granted, Attenuable Cap

Linux time namespaces demonstrate that clock offsets can be virtualized per-context rather than being a single global ambient fact. capOS takes this further: WallClock is a capability object, not a process-wide environment variable. A test harness can inject a fake WallClock; a container process can receive a WallClock with a different UTC offset (timezone) without any global state change; a WASI host adapter can supply a per-instance WallClock to each wasm module without sharing a mutable global.

Step vs. Slew as Distinct Cap Methods

chrony’s makestep and leapsecmode options distinguish step (abrupt correction) from slew (rate adjustment). capOS should expose these as distinct ClockDiscipline methods so the discipline policy is explicit at the capability boundary — a sync service can be audited for whether it steps or only slews, and the ClockProvenance.lastStep field makes a step visible to downstream validators.

Monotonic Invariant Is Non-Negotiable

Every surveyed system — Linux CLOCK_MONOTONIC, Fuchsia ZX_CLOCK_OPT_MONOTONIC, chrony slew-only mode — treats monotonic ordering as inviolable. Any step in the monotonic timeline breaks in-flight timers, scheduling deadlines, and ring timeouts. capOS’s Timer.now() monotonic base must never be adjusted; only the wall-clock offset layered above it is disciplinable.

Audit Timestamps and Trusted Time

Audit log entries in capOS will carry timestamps. The ClockProvenance label on the WallClock used to generate those timestamps becomes the evidence of timestamp trustworthiness: an audit consumer can reject entries generated while provenance was unsynchronized or stepped (within a recency window after a step), rather than silently accepting timestamps of unknown reliability.

WASI Realtime Clock Mapping

WASI Preview 1 clock_time_get(CLOCKID_REALTIME) maps naturally to WallClock.wallTime(). A per-instance WASI WallClock cap — granted at module instantiation — means a wasm module receives the same read-only, provenance-labeled time view that native capOS services receive, with no special privilege and no ambient global.


Sources

Research: HPC Parallel Patterns

This note grounds the capOS proposal for generic parallel processing pattern coverage. It is not a request to port full HPC suites immediately. The point is to classify which algorithm shapes capOS benchmarks should eventually cover so future SMP, threading, runtime, storage, network, and multi-node claims do not rest only on embarrassingly parallel worker loops.

Source Set

Consequences For capOS

The current capOS CPU-scaling benchmarks are necessary but narrow. They exercise static worker partitioning, final result verification, and a small amount of spawn/join or process-wait coordination. That covers one important HPC pattern: independent tasks with a final reduction. It does not cover:

  • structured grids and stencil/halo exchange;
  • dense tiled matrix work;
  • sparse matrix and irregular memory access;
  • FFT/transposes and global all-to-all style communication;
  • graph frontier expansion and high-fanout irregular queues;
  • task graphs with dependency scheduling and cancellation;
  • collectives as first-class operations;
  • multi-node communication and authority boundaries.

The benchmark plan should therefore treat “parallel processing” as a matrix of patterns rather than a single scaling demo. A useful capOS coverage target is:

Pattern familySource precedentcapOS evidence it should force
Static map/reduceOpenMP loop/reduction, NAS EPlow-overhead thread/process creation, result aggregation, no hot-path syscalls
Dynamic task graphOpenMP tasks, Berkeley composition pointwork queues, cancellation, dependency fan-in/fan-out, scheduler fairness under uneven tasks
Stencil and halo exchangeNAS MG/BT/SP/LUshared buffers, neighbor exchange, barriers, cache locality, future network transport
Dense tiled linear algebraHPL/LINPACKcompute locality, tile scheduling, reductions, optional SIMD/library runtime support
Sparse iterative solverHPCG, NAS CGirregular memory access, sparse matrix-vector work, global dot-product reductions
FFT/transposeNAS FTall-to-all movement, temporary buffers, memory pressure, future multi-node transpose
Sort/partitionNAS ISall-to-all buckets, prefix/scan, allocator and queue pressure
Graph frontierGraph500irregular frontier queues, atomic-like visited updates, high fanout, load imbalance
Collective communicationMPI collectivesbarrier, broadcast, scatter/gather, reduce/allreduce, all-to-all semantics
Pipeline/streamBerkeley composition point, future service graphsbounded queues, backpressure, stage-local authority, telemetry

The near-term capOS subset should stay CPU-only and single-node until the selected in-process threading milestone is closed. The first expansion should add pattern kernels that reuse existing userspace/runtime mechanisms, then let future networking and storage milestones add multi-node and data-intensive variants.

Cap’n Proto Error Handling: Research Notes

Research on how Cap’n Proto handles errors at the protocol, schema, and Rust crate levels. Used as input for the capOS error handling proposal.


1. Protocol-Level Exception Model (rpc.capnp)

The Cap’n Proto RPC protocol defines an Exception struct used in three positions: Message.abort, Return.exception, and Resolve.exception.

struct Exception {
  reason @0 :Text;
  type @3 :Type;
  enum Type {
    failed @0;        # deterministic bug/invalid input; retrying won't help
    overloaded @1;    # temporary lack of resources; retry with backoff
    disconnected @2;  # connection to necessary capability was lost
    unimplemented @3; # server doesn't implement the method
  }
  obsoleteIsCallersFault @1 :Bool;
  obsoleteDurability @2 :UInt16;
  trace @4 :Text;     # stack trace from the remote server
}

The four exception types describe client response strategy, not error semantics:

TypeClient response
failedLog and propagate. Don’t retry.
overloadedRetry with exponential backoff.
disconnectedRe-establish connection, retry.
unimplementedFall back to alternative methods.

2. Rust capnp Crate (v0.25.x)

Core error types

#![allow(unused)]
fn main() {
pub type Result<T> = ::core::result::Result<T, Error>;

#[derive(Debug, Clone)]
pub struct Error {
    pub kind: ErrorKind,
    pub extra: String,  // human-readable description (requires `alloc`)
}

#[derive(Debug, Clone, Copy, PartialEq, Eq)]
#[non_exhaustive]
pub enum ErrorKind {
    // Four RPC-mapped kinds (match Exception.Type)
    Failed,
    Overloaded,
    Disconnected,
    Unimplemented,

    // Wire format validation errors (~40 more variants)
    BufferNotLargeEnough,
    EmptyBuffer,
    MessageContainsOutOfBoundsPointer,
    MessageIsTooDeeplyNested,
    ReadLimitExceeded,
    TextContainsNonUtf8Data(core::str::Utf8Error),
    // ... etc
}
}

Constructor functions: Error::failed(s), Error::overloaded(s), Error::disconnected(s), Error::unimplemented(s).

The NotInSchema(u16) type handles unknown enum values or union discriminants.

std::io::Error mapping

When std feature is enabled, From<std::io::Error> maps:

  • TimedOut -> Overloaded
  • BrokenPipe/ConnectionRefused/ConnectionReset/ConnectionAborted/NotConnected -> Disconnected
  • UnexpectedEof -> PrematureEndOfFile
  • Everything else -> Failed

3. capnp-rpc Rust Crate Error Mapping

Bidirectional conversion between wire Exception and capnp::Error:

Sending (Error -> Exception):

#![allow(unused)]
fn main() {
fn from_error(error: &Error, mut builder: exception::Builder) {
    let typ = match error.kind {
        ErrorKind::Failed => exception::Type::Failed,
        ErrorKind::Overloaded => exception::Type::Overloaded,
        ErrorKind::Disconnected => exception::Type::Disconnected,
        ErrorKind::Unimplemented => exception::Type::Unimplemented,
        _ => exception::Type::Failed,  // all validation errors -> Failed
    };
    builder.set_type(typ);
    builder.set_reason(&error.extra);
}
}

Receiving (Exception -> Error): Maps exception::Type back to ErrorKind, preserving the reason string.

Server traits return Promise<(), capnp::Error>. Client gets Promise<Response<Results>, capnp::Error>.

4. Cap’n Proto Error Handling Philosophy

From KJ library documentation and Kenton Varda:

“KJ exceptions are meant to express unrecoverable problems or logistical problems orthogonal to the API semantics; they are NOT intended to be used as part of your API semantics.”

“In the Cap’n Proto world, ‘checked exceptions’ (where an interface explicitly defines the exceptions it throws) do NOT make sense.”

Exceptions: infrastructure failures (network down, bug, overload). Application errors: should be modeled in the schema return types.

5. Schema Design Patterns for Application Errors

Generic Result pattern

struct Error {
    code @0 :UInt16;
    message @1 :Text;
}

struct Result(Ok) {
    union {
        ok @0 :Ok;
        err @1 :Error;
    }
}

interface MyService {
    doThing @0 (input :Text) -> (result :Result(Text));
}

Constraint: generic type parameters bind only to pointer types (Text, Data, structs, lists, interfaces), not primitives (UInt32, Bool). So Result(UInt64) doesn’t work – need a wrapper struct.

Per-method result unions

interface FileSystem {
    open @0 (path :Text) -> (result :OpenResult);
}

struct OpenResult {
    union {
        file @0 :File;
        notFound @1 :Void;
        permissionDenied @2 :Void;
        error @3 :Text;
    }
}

Unions must be embedded in structs (no free-standing unions). This allows adding new fields later without breaking compatibility.

6. How Other Cap’n Proto Systems Handle Errors

Sandstorm

Uses the exception mechanism for infrastructure errors. Capabilities report errors through disconnection. The grain.capnp schema does not define explicit error types. util.capnp documents errors as “It will throw an exception if any error occurs.”

Cloudflare Workers (workerd)

Uses Cap’n Proto for internal RPC. JavaScript Error.message and Error.name are preserved across RPC; stack traces and custom properties are stripped. Does not model errors in capnp schema – relies on exception propagation.

OCapN (Open Capability Network)

Adopted the same four-kind exception model for cross-system compatibility. Diagnostic information is non-normative. Security concern: exception objects may leak sensitive information (stack traces, paths) at CapTP boundaries.

Kenton Varda expressed reservations about unimplemented (ambiguity about whether the direct method or callees failed) and disconnected (requires catching at specific stack frames for meaningful retry).

7. Relevance to capOS

capOS uses the capnp crate but not capnp-rpc. Manual dispatch goes through CapObject::call() with caller-provided params/result buffers. Current error handling:

  • capnp::Error::failed() for semantic errors
  • capnp::Error::unimplemented() for unknown methods
  • ? for deserialization errors (naturally produce capnp::Error)
  • Transport errors become negative CQE result codes (CAP_ERR_INVALID_REQUEST, CAP_ERR_INVALID_PARAMS_BUFFER, CAP_ERR_INVALID_RESULT_BUFFER, CAP_ERR_INVOKE_FAILED, CAP_ERR_UNSUPPORTED_OPCODE, CAP_ERR_TRANSFER_NOT_SUPPORTED, CAP_ERR_TRANSFER_ABORTED, etc.).
  • Kernel-produced CapException values are serialized into result buffers for capability-level failures (CAP_ERR_APPLICATION_EXCEPTION) and decoded by capos-rt. If the result buffer is too small to hold the serialized CapException, the CQE result is CAP_ERR_APPLICATION_EXCEPTION_TRUNCATED instead. The per-process ringScratchLimitBytes manifest field bounds the kernel-side scratch allocation and makes this truncated path reachable for tightly constrained process profiles.

capOS extends the standard four-kind ExceptionType with a fifth variant, invalidArgument, for capability-level argument validation failures. This fifth kind has no capnp-rpc equivalent; it maps to Failed when converting back to capnp::ErrorKind for logging.

The normative schema-author rule now lives in Error Handling: CQE status is for ring/transport/kernel dispatch failure, CapException is for capability-level infrastructure failure, and schema result unions are for normal application/domain outcomes.

The capnp::Error type carries the information needed for CapException: kind maps to ExceptionType, and extra maps to message.


Sources

  • Cap’n Proto RPC Protocol: https://capnproto.org/rpc.html
  • Cap’n Proto C++ RPC: https://capnproto.org/cxxrpc.html
  • Cap’n Proto Schema Language: https://capnproto.org/language.html
  • Cap’n Proto FAQ: https://capnproto.org/faq.html
  • KJ exception.h: https://github.com/capnproto/capnproto/blob/master/c%2B%2B/src/kj/exception.h
  • rpc.capnp schema: https://github.com/capnproto/capnproto/blob/master/c%2B%2B/src/capnp/rpc.capnp
  • OCapN error handling discussion: https://github.com/ocapn/ocapn/issues/10
  • Cap’n Proto usage patterns: https://github.com/capnproto/capnproto/discussions/1849
  • capnp-rpc Rust crate: https://crates.io/crates/capnp-rpc
  • Cloudflare Workers RPC errors: https://developers.cloudflare.com/workers/runtime-apis/rpc/error-handling/
  • Sandstorm util.capnp: https://docs.rs/crate/sandstorm/0.0.5/source/schema/util.capnp

Research: Cloudflare, Cap’n Proto, Workers RPC, and Cap’n Web

This file summarizes Kenton Varda’s Cloudflare work and Cloudflare’s Cap’n Proto-derived RPC stack, with capOS design consequences.

capOS alignment note (2026-05-16): capOS currently uses capnp v0.25 for serialization only (wire format, no capnp-rpc). The capOS kernel is planned to become a capnp-rpc router (Design Principle 5), but capnp-rpc is not yet in use. Implications in this file that reference “remote capability proofs” or “typed Cap’n Proto RPC” describe planned/future work, not current state.

Executive Summary

Cloudflare is the most important modern production context for Cap’n Proto. Kenton Varda, the creator of Cap’n Proto, is the lead engineer for Cloudflare Workers, and Cloudflare’s Workers team is now the primary maintainer of the main C++ Cap’n Proto/KJ implementation. Cloudflare uses Cap’n Proto/KJ in the Workers runtime, Durable Objects, sandbox/supervisor and cross-machine communication, internal service bindings, and Workers RPC. Cap’n Web is a separate JavaScript-native sibling protocol inspired by Cap’n Proto rather than a Cap’n Proto/KJ-based runtime system.

The main capOS lessons are:

  • Typed Cap’n Proto RPC is a practical production bridge for systems written in Go/Rust/C++ and JavaScript, not merely historical prior art.
  • Object-capability RPC can be a normal developer-facing API, not just a kernel/protocol mechanism. Workers RPC and Cap’n Web both expose object references, functions, promise pipelining, and capability-style security.
  • Production systems distinguish the core runtime from the full security product. workerd is open source and capability-shaped, but Cloudflare warns that it is not by itself a complete secure sandbox.
  • Cap’n Proto RPC remains resource-exhaustion-sensitive. capOS must add its own quota/resource-ledger discipline at every remote-capability boundary.
  • Cap’n Web shows a separate web-facing branch of the same design family: schema-free, JSON-based, TypeScript-friendly, HTTP/WebSocket/postMessage transports, but still object-capability RPC with promise pipelining.

Source Map

Primary sources read:

  • Kenton Varda’s Cloudflare author archive, used as the inventory of his posts.
  • Cap’n Proto FAQ and Cap’n Proto 0.9 release notes.
  • Cloudflare blog posts:
    • “Durable Objects in Dynamic Workers: Give each AI-generated app its own database”
    • “Sandboxing AI agents, 100x faster”
    • “Code Mode: the better way to use MCP”
    • “Introducing workerd: the Open Source Workers runtime”
    • “We’ve added JavaScript-native RPC to Cloudflare Workers”
    • “Why Workers environment variables contain live objects”
    • “Building Cloudflare on Cloudflare”
    • “Cap’n Web: a new RPC system for browsers and web servers”
    • “Eliminating Cold Starts 2: shard and conquer”
    • “Zero-latency SQLite storage in every Durable Object”
    • “Durable Objects: Easy, Fast, Correct – Choose three”
    • “Dynamic Process Isolation: Research by Cloudflare and TU Graz”
    • “Mitigating Spectre and Other Security Threats: The Cloudflare Workers Security Model”
    • “Introducing lua-capnproto: better serialization in Lua”
  • Cloudflare developer docs:
    • Workers RPC
    • Workers RPC visibility/security model
    • Durable Objects overview
    • Dynamic Workers bindings

Cloudflare and Kenton Varda

The Cap’n Proto FAQ says Cloudflare Workers is led by Kenton Varda and that Workers heavily uses Cap’n Proto. It also says the Cloudflare Workers team is now the primary developer and maintainer of Cap’n Proto’s main C++ implementation.

The Cap’n Proto 0.9 release notes state that Cap’n Proto development had become primarily driven by Cloudflare Workers. At that point Workers had already moved from mostly using KJ to heavily using Cap’n Proto RPC for Durable Objects.

For capOS, Cloudflare is strong evidence that:

  • Cap’n Proto RPC is still a living system, not only Sandstorm-era history.
  • KJ’s async/runtime design matters because it is deployed in Workers.
  • Cap’n Proto’s object-capability RPC model is compatible with large-scale production infrastructure, but only with additional platform hardening and resource controls.

The full author archive also includes posts that are not Cap’n Proto-specific but are relevant to capOS architecture:

  • live object bindings as capability-shaped environment entries
  • Workers security and Spectre mitigation
  • Dynamic Process Isolation research with TU Graz
  • Durable Objects as single-threaded colocated compute/storage actors
  • Dynamic Workers for fast disposable isolate sandboxes
  • Code Mode for having agents write code against typed APIs rather than emit direct tool calls

workerd

workerd is Cloudflare’s open-source JavaScript/Wasm runtime, sharing most of the code that powers Cloudflare Workers. It is designed as a server runtime for Workers-compatible applications, local testing, and programmable proxy use cases.

Cloudflare explicitly warns that workerd alone is not a secure sandbox for untrusted code. The Workers service adds environment-specific hardening, including V8 patch automation, risk-profile separation, kernel features, and resource-limit enforcement. The project is also not an independent governance surface: Cloudflare Workers priorities drive the repository, and internal interfaces may churn.

capOS implications:

  • Borrow the shape, not the whole product. A capOS userspace JavaScript/Wasm runtime can learn from workerd, but must not assume workerd alone provides OS-grade isolation.
  • Treat runtime internals as unstable unless pinned. If capOS embeds or adapts workerd, the trusted-build-input and upgrade policy must account for churn.
  • Keep the capOS kernel/resource model as the isolation and quota authority; runtime-level object capabilities are an additional layer.

Live Object Bindings

Cloudflare Workers environment variables are not only strings. Bindings are live objects scoped to a specific Worker’s env parameter. A Worker cannot reach a protected service by guessing a URL or global name; it must hold the binding object. The post explicitly compares this to capability-based security: the binding designates a resource and confers permission to access it, is not in a global namespace, and must be invoked explicitly.

The post also notes that current Workers bindings are not a complete capability system because ordinary bindings are not generally passed dynamically between Workers yet, though future dynamic bindings are discussed.

capOS implications:

  • This strongly supports capOS’s bootstrap CapSet and broker-issued bundle model: authority should arrive as live objects/caps, not ambient service names plus bearer tokens.
  • It also supports treating the environment/capset as an explicit function parameter rather than global state. This preserves composition and testability.
  • It reinforces the policy that remote protocol fields, URLs, and names should not become authority by themselves.

Durable Objects

Durable Objects are Cloudflare’s single-threaded stateful actor-like compute units colocated with durable storage. Public posts describe them as an approach where code runs where the data is stored, often in the same thread as embedded SQLite, avoiding network storage latency. Earlier Durable Objects posts focus on race avoidance and correctness: single-threaded object execution makes it natural to keep state in memory while serializing operations that touch the same object.

Dynamic Worker Facets extend this idea to Dynamic Workers: generated code can run in a disposable isolate while using a per-app Durable Object facet with its own isolated SQLite-backed storage.

capOS implications:

  • Durable Objects are strong prior art for capOS service objects that combine single-threaded state, colocated storage, and actor-style request handling.
  • For paper-scoped persistence, a minimal service-owned store/object proof is closer to this model than to a global filesystem.
  • For hosted agents, per-task or per-app isolated storage facets are a useful pattern, but the storage capability must remain broker-issued and revocable.

Workers Security, Spectre, and Dynamic Process Isolation

Kenton’s Workers security posts separate API-level capability design from execution isolation. Workers uses V8 isolates for density, but wraps them with cordons, process isolation for higher-risk cases, Linux sandboxing, supervisor processes, local proxy mediation, V8 patch discipline, and timing/Spectre mitigations. Dynamic Process Isolation research with TU Graz addresses the harder Spectre isolation problem when many tenants share isolate infrastructure.

The “Sandboxing AI agents, 100x faster” post reuses this isolate foundation for Dynamic Workers: fast disposable isolates for AI-generated code, rather than heavyweight containers. The post emphasizes speed and density, but the security claim depends on the broader Workers platform, not a bare runtime library.

capOS implications:

  • Capability-shaped APIs are not a substitute for execution isolation. capOS should continue treating page tables/processes, resource ledgers, and future sandboxing as separate security layers.
  • If capOS runs AI-generated code, isolate-style fast startup is attractive, but the capOS trust boundary must include side-channel and resource controls.
  • For browser/Wasm proposals, Workers is evidence that isolates can scale, but also that Spectre/timing mitigations are first-order design constraints.

Cloudflare’s Cap’n Proto Uses

Cloudflare’s public sources describe several Cap’n Proto uses:

  • Workers runtime implementation: Cap’n Proto and KJ are core implementation pieces.
  • Sandbox/supervisor and cross-machine/datacenter communication.
  • Durable Objects: Cap’n Proto RPC is heavily used for communication in the system.
  • Internal Workers: Cloudflare added Cap’n Proto RPC bindings so internal Workers can call services such as Quicksilver, DNS, and DoS-protection systems. Schemas are bundled with the Worker at publication time, and the runtime converts JavaScript data to/from Cap’n Proto.
  • Worker sharding/cold-start reduction: cross-instance communication in the Workers runtime uses Cap’n Proto RPC, including capabilities to lazily loaded local Worker instances.
  • Older Cloudflare infrastructure: Cloudflare wrote lua-capnproto and used Cap’n Proto in logging/analytics pipelines before Kenton joined.

capOS implications:

  • A typed Cap’n Proto RPC bridge is a credible first remote-capability proof: Cloudflare uses schema-bundled service calls from JavaScript into internal Go/Rust services.
  • Lazy capabilities are useful for cold-start and placement problems. A remote cap may represent a lazily created service, but invocation must still be explicit and resource-accounted.
  • The capOS “capability proxy” should be framed as a service with explicit listen/connect authority, schema selection, and resource budgets, not a generic kernel network mode.

Workers JavaScript-Native RPC

Cloudflare Workers RPC lets Workers and Durable Objects communicate by calling methods on JavaScript classes exposed through bindings. It is built on Cap’n Proto but removes schema boilerplate from the JavaScript developer surface.

Key properties:

  • Calls are asynchronous regardless of whether the server method was declared async.
  • Parameters and return values can include structured-clonable data.
  • Functions and objects can be passed by reference; the receiver gets a stub and later calls back to the original location.
  • Calls to service bindings often stay in the same thread, reducing local RPC overhead dramatically.
  • When calls cross the network, promise pipelining lets dependent calls on a returned object travel in one round trip.
  • Security is object-capability based: a side can only invoke objects/functions for which it has received a stub.

capOS implications:

  • It is reasonable for capOS to expose developer-friendly language bindings above typed capability transport. The kernel ABI should stay narrow, but userspace runtime APIs can feel like method calls on local objects.
  • Promise pipelining is not optional polish for object-style APIs over latency. Cloudflare documents it as the mechanism that prevents API designs from collapsing into coarse ad hoc batch methods.
  • A local fast path matters. RPC calls that stay within one scheduler/runtime context should avoid unnecessary network-shaped overhead.

Code Mode and Agents

Kenton’s Code Mode post argues that agents should often write code against a typed API rather than emit raw tool calls directly. The Cloudflare claim is that MCP is useful as an API discovery/connection layer, but complex workflows are better expressed as code that calls a TypeScript API. This reduces token flow through the model when chaining operations and lets normal language tooling carry structure.

capOS implications:

  • This supports the capOS hosted-agent direction: present capability-scoped tools as typed APIs and let agents compose them in code under a sandbox, instead of exposing broad stringly tool surfaces directly to the model.
  • Approval gates should wrap the capability/API boundary, not be hidden inside prompt text.
  • Promise pipelining and object references may reduce tool-call latency, but only after authority and review gates are preserved.

Cap’n Web

Cap’n Web is a 2025 Cloudflare RPC protocol and TypeScript implementation by Kenton Varda and Steve Faulkner. It is explicitly described as a spiritual sibling to Cap’n Proto for the web stack.

Design differences from Cap’n Proto:

  • no Cap’n Proto schemas
  • JSON plus preprocessing for special values instead of Cap’n Proto binary encoding
  • TypeScript-friendly APIs
  • HTTP, WebSocket, and postMessage() transports
  • small dependency-free browser/server package

Shared design lineage:

  • object-capability RPC
  • bidirectional calls
  • functions and objects passed by reference
  • promise pipelining
  • capability-based security patterns
  • import/export tables for pass-by-reference objects

Cap’n Web also introduces a web-specific .map()-style pipelining feature that records a restricted non-Turing-complete instruction set derived from pipelined calls, addressing a GraphQL-like “waterfall” case.

capOS implications:

  • Cap’n Web is useful prior art for browser-hosted capOS experiments or web admin clients, not for the kernel ABI.
  • Schema-free RPC trades away capOS’s current “schema is permission surface” discipline. It may fit JavaScript/web adapters, but core capOS services should remain typed and schema-governed unless a proposal explicitly accepts the runtime-validation burden.
  • HTTP batch mode and broken references after batch completion are useful patterns for paper-scoped network-transparency proofs: short-lived remote caps can have explicit lifetime boundaries.

Security and Resource Warnings

Important warnings from primary sources:

  • Cap’n Proto’s serialization layer is intended to be safe against malicious bytes, but the reference implementation has not had a formal security review.
  • Cap’n Proto RPC is designed for mutually distrusting parties, but the FAQ warns that it is not robust against resource exhaustion attacks.
  • Cap’n Proto does not provide encryption by itself; use an encrypted transport such as TLS.
  • workerd is not a complete sandbox for malicious code without Cloudflare’s surrounding platform hardening.
  • Cap’n Web/Workers TypeScript surfaces do not automatically enforce runtime type checks merely because TypeScript types exist.

capOS implications:

  • Every remote-capability proposal must include resource ledgers for table entries, queued calls, queued bytes, streams, retries, and live objects.
  • The first capOS remote-capability proof should validate failure behavior: disconnect, overload, broken refs, stale refs, and malformed payloads.
  • Treat TypeScript or schema-free web adapters as convenience layers that require runtime validation at the trust boundary.
  • Encryption/authentication is a transport requirement, not something Cap’n Proto RPC gives for free.

Design Consequences for capOS

  1. The first external capability proxy should be typed and schema-bundled, closer to Cloudflare’s internal Worker-to-service Cap’n Proto RPC bindings than to full OCapN/CapTP compatibility.
  2. Developer ergonomics can improve above the transport: object stubs, language-native async calls, and promise pipelining are legitimate runtime APIs.
  3. Keep the kernel/user ABI and core service contracts schema-first. Cap’n Web is compelling for web-facing clients, but its schema-free design does not replace capOS’s typed authority model.
  4. Promise pipelining should be designed as a core performance and authority feature, not as an optional batching trick.
  5. Remote cap lifetimes need explicit scopes. HTTP batch-style broken refs, session-scoped refs, and disconnect-driven broken promises are all useful precedents.
  6. Resource exhaustion must be solved by capOS, not delegated to Cap’n Proto.
  7. Runtime isolation remains an OS responsibility. A language runtime can be capability-oriented while still needing kernel/VM/sandbox containment.

Sources

  • Cap’n Proto FAQ: https://capnproto.org/faq.html
  • Cap’n Proto 0.9 release notes: https://capnproto.org/news/2021-08-14-capnproto-0.9.html
  • Kenton Varda author archive: https://blog.cloudflare.com/author/kenton-varda/
  • Durable Objects in Dynamic Workers: https://blog.cloudflare.com/durable-object-facets-dynamic-workers/
  • Sandboxing AI agents, 100x faster: https://blog.cloudflare.com/dynamic-workers/
  • Code Mode: the better way to use MCP: https://blog.cloudflare.com/code-mode/
  • Introducing workerd: the Open Source Workers runtime: https://blog.cloudflare.com/workerd-open-source-workers-runtime/
  • We’ve added JavaScript-native RPC to Cloudflare Workers: https://blog.cloudflare.com/javascript-native-rpc/
  • Why Workers environment variables contain live objects: https://blog.cloudflare.com/workers-environment-live-object-bindings/
  • Building Cloudflare on Cloudflare: https://blog.cloudflare.com/building-cloudflare-on-cloudflare/
  • Cap’n Web: a new RPC system for browsers and web servers: https://blog.cloudflare.com/capnweb-javascript-rpc-library/
  • Eliminating Cold Starts 2: shard and conquer: https://blog.cloudflare.com/eliminating-cold-starts-2-shard-and-conquer/
  • Zero-latency SQLite storage in every Durable Object: https://blog.cloudflare.com/sqlite-in-durable-objects/
  • Durable Objects: Easy, Fast, Correct – Choose three: https://blog.cloudflare.com/durable-objects-easy-fast-correct-choose-three/
  • Dynamic Process Isolation: Research by Cloudflare and TU Graz: https://blog.cloudflare.com/spectre-research-with-tu-graz/
  • Mitigating Spectre and Other Security Threats: The Cloudflare Workers Security Model: https://blog.cloudflare.com/mitigating-spectre-and-other-security-threats-the-cloudflare-workers-security-model/
  • Introducing lua-capnproto: better serialization in Lua: https://blog.cloudflare.com/introducing-lua-capnproto-better-serialization-in-lua/
  • Workers RPC docs: https://developers.cloudflare.com/workers/runtime-apis/rpc/
  • Workers RPC visibility and security model: https://developers.cloudflare.com/workers/runtime-apis/rpc/visibility/
  • Durable Objects overview: https://developers.cloudflare.com/durable-objects/concepts/what-are-durable-objects/
  • Dynamic Workers bindings: https://developers.cloudflare.com/dynamic-workers/usage/bindings/

Research: Spritely, OCapN, and CapTP

Research note last checked 2026-05-16. This file records the related specifications, protocols, and design principles behind Spritely’s OCapN/CapTP work and translates them into capOS design consequences. It intentionally summarizes the specifications rather than copying them; the upstream documents are draft standards and should remain the source of truth.

Executive Summary

Spritely’s most relevant contribution for capOS is not a single library. It is a coherent model for secure distributed object programming:

  • Authority is an unforgeable object reference. If a peer was not handed a reference, it cannot use the object.
  • Object references can cross a network without turning into global names or ACL checks. References remain local session table entries, generally cheap integer positions between the two peers that share a CapTP session.
  • Networking is explicit at the implementation boundary but mostly absent from application object design. A program can pass references and send messages to asynchronous objects without inventing a bespoke protocol for every service.
  • Latency is handled by promise pipelining. Dependent messages can be sent to the eventual result of an earlier message before the earlier message settles.
  • Resource lifetime is part of the protocol. CapTP includes cooperative distributed garbage collection for exported references and answer promises.
  • Third-party handoffs solve the hard case where A gives B a reference to an object hosted by C without making A a permanent proxy.

The design is close enough to capOS’s schema-as-ABI direction to matter: capOS already treats typed Cap’n Proto interfaces as authority boundaries and has reserved ring fields for future promise pipelining. OCapN/CapTP gives a prior-art shape for the next network-transparent capability layer, but the current OCapN documents are still drafts and should not be adopted as frozen wire compatibility commitments.

Source Map

Primary sources read:

  • Spritely Institute:
    • “What is CapTP, and what does it enable?”
    • “Introducing OCapN, interoperable capabilities over the network”
    • “The Heart of Spritely: Distributed Objects and Capability Security”
    • Spritely Goblins 0.18.0 release notes
    • Guile Goblins manual sections for OCapN and CapTP
  • OCapN draft specifications:
    • CapTP Specification.md
    • Model.md
    • Netlayers.md
    • Locators.md
    • Syrup repository and draft specification material
  • Related lineage and implementations:
    • Cap’n Proto RPC protocol documentation
    • Endo @endo/ocapn documentation
    • E / CapTP lineage as summarized by Spritely, OCapN, and Cap’n Proto docs

The OCapN draft repository HEAD observed during this pass was 18400d8508fb67467da6d659412ae19c27b0cd08. The Syrup repository HEAD observed was 931fa528b8ddda976febba577fb09ee0726845d4.

Current Status

The old spritelyproject.org site is historical. Active project material is now under the Spritely Institute site and files.spritely.institute.

OCapN is not yet a final standard. The draft specs explicitly warn that they are likely to change significantly. Spritely’s 2026-04-21 Goblins 0.18.0 release is a useful data point: it changed OCapN protocol details, removed the old op:deliver-only operation, renamed GC operations to plural batched forms, and bumped the protocol version incompatibly with earlier Goblins releases.

For capOS this means:

  • Use OCapN/CapTP as design grounding, not as a frozen ABI.
  • Avoid promising wire-level OCapN compatibility until a concrete version is selected and a test-suite target exists.
  • Keep capOS’s own ring and schema ABI evolution policy independent from OCapN draft churn.

Spritely System Model

Spritely Goblins is a distributed object programming environment. Its core objects are actors. Actors live in vats/actormaps, receive messages, and may evolve by returning replacement behavior rather than mutating global ambient state. The programming model is explicitly object-capability based: references are authority, and authority flows by ordinary reference passing.

Spritely adds several properties that are relevant to OS design:

  • Transactional turns. Local synchronous object updates happen in turns that can roll back on failure. This keeps partial state updates from becoming visible after an exception.
  • Asynchronous references and promises. The same programming model handles local and remote asynchronous objects.
  • Persistence and sleeping actors. Goblins can persist actor state and, in 0.18.0, optionally evict actors from the hot cache while retaining live references that wake them on demand.
  • Distributed debugging and time travel. Spritely treats deterministic turns and persistent state as debugging tools, not only durability features.

capOS should not copy Goblins’ language runtime shape into the kernel. The usable lesson is the boundary: keep kernel capability objects small and typed, while allowing userspace runtimes to build richer object, promise, rollback, and persistence semantics above them.

Object-Capability Principles

The Spritely/OCapN material uses classic object-capability principles:

  • No ambient authority. Code begins without dangerous authority and gains power only through values it is passed.
  • Designation is authorization. The reference both names the object and grants the right to invoke it.
  • Attenuation by wrapping. A narrower object can hold a broader object and expose only a smaller method surface or policy-filtered behavior.
  • Revocation by indirection. A revoker can sit between holder and target and later stop forwarding.
  • Accountability by explicit relationship. Authority flow is visible as graph edges between objects, not hidden inside a global namespace.
  • Mutual suspicion. A remote peer is not trusted just because the transport is authenticated; it is treated as a potentially adversarial object holding only the capabilities it has received.

This matches capOS’s existing direction: typed interfaces define permission surfaces, and narrower capabilities are preferable to broad rights bitmasks attached to generic handles.

OCapN Protocol Suite

OCapN is a suite, not just CapTP. The important layers are:

LayerRolecapOS relevance
OCapN ModelAbstract passable value model shared across languages.Defines which values can cross a capability-network boundary.
SyrupCanonical binary serialization used by current OCapN drafts.Useful for signed certificates and dynamic interop, but not a replacement for capOS’s Cap’n Proto schema ABI.
LocatorsPeer and sturdyref identity syntax.Prior art for durable object references and bootstrap URIs.
NetlayersTransport abstraction for secure ordered channels.Strong precedent for separating object protocol from TCP/TLS/Tor/libp2p/etc.
CapTPSession protocol for messages, promises, GC, and handoffs.Directly informs future network-transparent capability invocation.
Test suiteInteroperability tests for implementations.capOS should not claim OCapN compatibility without passing a selected suite/version.

OCapN Data Model

The model draft defines passable values as atoms, containers, references, and errors.

Atoms:

  • Undefined
  • Null
  • Boolean
  • arbitrary precision signed Integer
  • IEEE 754 Float64
  • Unicode String
  • ByteArray
  • Symbol

Containers:

  • List
  • unordered string-keyed Struct
  • Tagged, a tag string plus one value

References:

  • Target, an object reference that can receive messages
  • Promise, a pending eventual value that can queue messages

Errors are still unsettled in the model draft. The draft preserves only the coarse requirement that an error round trip as an error. This is weaker than capOS’s desired error-layer split, so capOS should keep its local rule: transport status in CQEs, capability infrastructure failure in CapException, and domain outcomes in schema result unions.

The model also defines pass invariants. The important one for capOS is that remote passage should preserve type, and for most values preserve a specified equality relation when values leave and later return. Promises and errors are special: promises preserve type but not identity equality, and error semantics are deliberately not settled yet.

Syrup

Syrup is a canonical binary serialization format used by OCapN drafts. It is inspired by canonical s-expressions and bencode. It supports booleans, integers, floats, byte strings, strings, symbols, lists, dictionaries/structs, records, and sets. Its important property for CapTP is canonicalization: unordered collections are emitted in a deterministic order, so serialized bytes can be signed and verified consistently.

This matters most for OCapN handoff certificates. A signed envelope signs the canonical serialized form of a CapTP object, so implementations need byte-stable encoding.

capOS implications:

  • Keep using Cap’n Proto for typed capOS ABIs and kernel/userspace messages.
  • Treat Syrup as an interop codec for a future OCapN bridge, not as the native kernel ring format.
  • If capOS implements OCapN handoffs, canonical serialization becomes part of the trusted boundary. Fuzzing and cross-implementation test vectors would be mandatory.

Locators and Sturdyrefs

OCapN locators represent peers and durable object entry points.

A peer locator contains:

  • transport: the netlayer name
  • designator: usually a key or other netlayer-defined identity
  • hints: optional routing data

Only transport and designator identify the peer for comparison. Hints can help connection setup but do not define identity.

A sturdyref locator contains:

  • a peer locator
  • a swiss-num, a secret-ish object token used to fetch a specific object from that peer’s bootstrap object

URI forms include:

ocapn://<designator>.<transport>
ocapn://<designator>.<transport>/s/<swiss-num>

The draft states that a sturdyref should be treated as a capability: the locator plus swiss number is enough to try to obtain the object reference.

capOS implications:

  • A future durable capOS network reference must not be confused with a local CapId. Local cap slots, generations, receiver selectors, session ids, and kernel object pointers are not portable authority.
  • If capOS adds sturdyrefs, they belong in a userspace naming/storage authority or broker, not in the kernel cap table.
  • Hints must never become security identity. They are routing metadata only.
  • Swiss-number strength and storage policy are security-critical; weak or enumerable swiss numbers would become bearer-token vulnerabilities.

Netlayers

OCapN netlayers are the transport interface underneath CapTP. A compliant netlayer provides:

  • bidirectional message transmission
  • delivery while the session remains active
  • in-order receipt
  • security against third-party message insertion

Encryption and reachability are desirable and often necessary, but the netlayer draft distinguishes required session integrity from optional transport properties. The Tor Onion netlayer is documented in the draft; Spritely Goblins has historically emphasized Tor, while OCapN discussions also mention TCP/TLS, WebSocket, libp2p, IBC, I2P, Unix sockets, and other transports.

capOS implications:

  • Follow the OCapN split: object protocol above transport authority.
  • Represent listen/connect authority as explicit capabilities, as capOS already does for narrowed TCP listener authority.
  • Bind peer identity to the netlayer’s authenticated designator, not to DNS names, host strings, or untrusted hints.
  • Treat reconnect and disconnect as first-class protocol states. All remote capabilities served by a severed session must fail closed or become broken promises.

CapTP Session Establishment

A CapTP session is pairwise. It runs over a reliable ordered netlayer channel. The draft session setup:

  • establish a secure channel out of band or as part of a handoff
  • create a per-session cryptographic key pair
  • exchange op:start-session
  • verify the remote session start message
  • export a bootstrap object at position 0

The bootstrap object conventionally supports:

  • fetch, to fetch an object by swiss number
  • deposit-gift, for third-party handoffs
  • withdraw-gift, for third-party handoffs

capOS implications:

  • A remote session needs an explicit session object with state, cryptographic identity, import/export tables, answer table, handoff table, and disconnect state.
  • Bootstrap authority should be narrow. A peer’s bootstrap object is the initial remote authority root and should expose only intended fetch/handoff behavior.
  • A future capOS OCapN bridge should make protocol version negotiation and feature gating explicit because upstream OCapN has already changed incompatibly.

CapTP References and Descriptors

CapTP references are represented by descriptors whose integer positions have meaning only within a single session. The key descriptor families are:

  • desc:import-object: the receiver is importing an object at a position
  • desc:import-promise: the receiver is importing a promise at a position
  • desc:export: refer to an object/promise already exported by the receiving side
  • desc:answer: refer to a promise created by a previous answer position
  • desc:sig-envelope: signed wrapper over a canonical serialized CapTP object
  • desc:handoff-give: gift certificate from gifter to receiver
  • desc:handoff-receive: receiver certificate used to redeem a gift

The subtle convention is perspective: descriptors describe references from the receiver’s side of the session. This keeps pairwise table entries small but requires careful implementation.

capOS implications:

  • Do not serialize process-local cap ids across a network.
  • Network references need a separate table keyed by session-local import/export position and generation or epoch.
  • Descriptor direction needs tests. Perspective errors here become authority leaks or denial of service bugs.

CapTP Operations

The current CapTP draft includes operations in these groups:

  • session lifecycle: op:start-session, op:abort
  • delivery: op:deliver
  • promise observation/resolution: op:listen, promise resolver fulfill and break behavior
  • promise pipelining and extraction: op:get, op:index, op:untag
  • cooperative GC: op:gc-exports, op:gc-answers
  • handoff bootstrapping through bootstrap methods: deposit-gift, withdraw-gift

op:deliver-only should be treated as stale for current research because Goblins 0.18.0 and the current draft dropped it in favor of op:deliver.

capOS implications:

  • capOS’s reserved pipeline_dep / answer-id style fields should be evaluated against CapTP’s answer table model.
  • op:get, op:index, and op:untag show that pipelining is not only “call a method on a promised object”; it can also project a reference out of an eventual container without transmitting irrelevant intermediate values.
  • Batched GC operations are an important shape for avoiding per-reference chatter.

Promise Pipelining

Promise pipelining is the latency-critical idea shared by E, Cap’n Proto RPC, Agoric/Endo, and OCapN. If a call returns a promise for an object, the caller can immediately send follow-on messages to the promised result. The receiver queues or forwards those messages when the promise resolves.

This preserves object-shaped interfaces in high-latency networks. Without pipelining, developers tend to collapse clean object graphs into singleton services with path strings or ad hoc batching APIs, weakening both design and authority boundaries.

capOS implications:

  • Promise pipelining is a Tier-1 paper evidence candidate in docs/roadmap.md; this research reinforces that priority.
  • Pipelining should target result-cap/answer namespaces, not caller-selected global ids.
  • Broken promises must propagate failure to dependent calls. Silent drops would violate caller expectations and leak resources.
  • Pipelined calls must remain bounded by resource ledgers: answer table slots, queued message bytes, queued call count, and per-session memory all need caps.

Distributed Garbage Collection

CapTP uses cooperative distributed GC for references exported across a session. At a high level:

  • When a reference is exported, the exporting side keeps it alive on behalf of the importing side.
  • The importer tracks how many times it received the reference.
  • When the importer no longer needs the reference, it sends batched GC deltas.
  • The exporter decrements its per-session reference count and may reclaim once the count reaches zero.
  • Answer promises also have explicit op:gc-answers cleanup so answer positions can be reused.

The Goblins docs call this acyclic distributed GC. Cycles spanning machines are not automatically collected in the deployed Guile Goblins path.

capOS implications:

  • Network reference release must be explicit and idempotent under disconnect and retry conditions.
  • Reference accounting must have one ledger of record per session. Parallel counters in transport, object proxy, and app layers would be unreviewable.
  • A capOS bridge should not rely on distributed cycle collection. Design protocols so remote cycles are either impossible, bounded by lease/session lifetime, or broken by explicit revocation.
  • Disconnect should conservatively release exports owned solely by the session and break unresolved imports/promises.

Third-Party Handoffs

Third-party handoffs solve the case where A has a reference to an object hosted by C and sends that reference to B. A should not need to proxy every future call from B to C, and B should not gain arbitrary authority at C. The OCapN draft uses certificate-style gifts.

Roles:

  • Gifter: the peer sharing a reference it holds
  • Receiver: the peer receiving that reference
  • Exporter: the peer hosting/exporting the referenced object

Protocol shape:

  • The gifter deposits a gift with the exporter’s bootstrap object.
  • The gifter sends the receiver a signed desc:handoff-give.
  • The receiver validates what it can, connects to the exporter if needed, and sends a signed desc:handoff-receive to withdraw the gift.
  • The exporter verifies signatures, session ids, receiver binding, and replay protection, then fulfills the receiver’s promise with the gifted reference.

Security properties to preserve:

  • The gift is designated to a specific receiver session identity.
  • The exporter must reject invalid signatures or replayed handoff counts.
  • The handoff can complete whether deposit or withdrawal arrives first.
  • Unauthorized peers that observe messages should not be able to redeem the object reference.

capOS implications:

  • Handoffs are the correct precedent for cross-session capability transfer. Avoid proxy-only designs as the permanent architecture.
  • A capOS implementation needs persistent in-flight handoff state with bounded memory and expiry.
  • The replay counter/nonce table is security-sensitive. It should be scoped by exporter-receiver session and garbage collected with the session.
  • Handoff certificates should be opaque to ordinary applications unless a debugging authority is explicitly granted.

Error Propagation

OCapN and CapTP allow promises to break with an error value, but the data model has not converged on a rich normative error structure. The CapTP draft warns that transmitting exception details or backtraces can leak sensitive data.

capOS implications:

  • Keep the capOS error-layer split. OCapN errors should map into CapException or schema-level results only through a deliberate adapter.
  • Strip or seal debug details at network boundaries by default.
  • Treat remote error text as untrusted input. It is diagnostic material, not an authority decision input.

Security Risks and Failure Modes

Important risks found in the source material:

  • Spec churn. OCapN is draft/pre-standardization and has changed incompatibly.
  • Resource exhaustion. Goblins docs state that CapTP does not solve memory usage or resource management by itself.
  • Acyclic-only GC. Cycles between servers are not automatically reclaimed in current Goblins’ practical model.
  • Peer-wide trust boundary. Even if CapTP routes to specific objects, a malicious remote peer can collude internally. Treat the peer as a single adversarial object with the authority surface of all references it holds.
  • Signing-oracle bugs. Goblins 0.18.0 fixed a signing oracle vulnerability in a WebSocket netlayer designator-authentication path. This is a concrete reminder that handoff/netlayer signing APIs need strict domain separation.
  • Debug info leakage. Broken promises or exceptions can accidentally expose paths, stack traces, or internal object topology.
  • Replay and stale-reference bugs. Handoff counts, session ids, export positions, and answer positions require generation/reuse discipline.

capOS mitigations:

  • Version every network protocol boundary.
  • Bound every per-session table and queue with resource ledgers.
  • Domain-separate all signatures by protocol label, session id, role, and operation kind.
  • Fuzz canonical codec parsing and descriptor validation.
  • Add negative tests for stale answer positions, stale export positions, replayed handoffs, mismatched receiver keys, malformed locators, and disconnect during handoff.

Relationship to Cap’n Proto

Cap’n Proto RPC is a close relative rather than the same protocol:

  • It is schema-first and statically typed.
  • Interface references are first-class capabilities.
  • Promise pipelining is central.
  • Persistent capabilities and three-way interactions are defined as higher protocol levels.
  • Cap’n Proto RPC deliberately does not make remote calls look like local blocking calls; the API exposes promises and network failure.

capOS already uses Cap’n Proto for schemas and serialization, but not full capnp-rpc. OCapN’s dynamic model is useful for language-agnostic distributed objects; Cap’n Proto remains the better fit for capOS’s typed ABI and generated interface surface.

The practical direction for capOS:

  • Keep local kernel/userspace ABI fixed-layout where needed and Cap’n Proto schema-shaped at service boundaries.
  • Learn from OCapN’s session, handoff, locator, and GC machinery.
  • Do not replace typed schemas with untyped dynamic symbols unless building an explicit OCapN bridge.

Relationship to Agoric and Endo

Agoric and Endo continue the E-language object-capability lineage in hardened JavaScript. Endo’s @endo/ocapn docs describe a tentative OCapN implementation with layers for client/session management, CapTP dispatch and slot management, codecs, and netlayers. The package is explicitly a work in progress and treats OCapN as a moving target.

This independently validates the same architectural split:

  • object/capability semantics
  • session/slot management
  • canonical codec
  • netlayer abstraction
  • higher-level client API for sturdyrefs and handoffs

For capOS, that split is more important than JavaScript-specific APIs.

CapOS Design Consequences

  1. Keep CapId local. Never serialize local cap table ids, endpoint generations, receiver selectors, or kernel session ids as portable network authority.
  2. Treat remote references as session-local imports/exports with explicit generation/reuse rules.
  3. Put sturdyrefs and durable fetch authority in userspace naming/storage services, not in the kernel cap table.
  4. Keep network transport authority separate from object authority. A process may hold permission to listen/connect without holding permission to fetch a particular remote object, and vice versa.
  5. Implement promise pipelining through answer/result-cap namespaces. Avoid path-string singleton APIs created only to hide latency.
  6. Bound all per-session state: exports, imports, answers, queued pipelined deliveries, handoff gifts, handoff replay counters, incoming message bytes, and pending reconnects.
  7. Make disconnect semantics explicit. Remote refs become disconnected/broken, not silently retrying with ambient authority.
  8. Strip or seal diagnostic errors crossing a remote boundary.
  9. Use canonical serialization only where signatures require it. Do not move the kernel ring to Syrup.
  10. Defer OCapN compatibility claims until capOS targets a specific draft, version negotiation, and test suite.

Open Questions for capOS

  • Should capOS expose an OCapN bridge as a userspace service that maps OCapN targets to local typed Cap’n Proto capabilities, or should it first implement a Cap’n Proto RPC bridge for typed external clients?
  • What is the narrowest promise-pipelining proof that advances the paper track: local ring answer pipelining, capnp-rpc-compatible pipelining, or OCapN-like answer descriptors?
  • How should capOS represent durable remote authority: opaque broker-held sturdyrefs, sealed persistent capabilities, or storage-service entries that mint live session refs on demand?
  • Which cryptographic identity should a capOS netlayer use first: TLS certificates, Noise static keys, Tor Onion service ids, or a local test-only key?
  • How much of OCapN’s dynamic value model should be admitted at capOS service boundaries, given the existing schema-first security posture?

For current capOS work, this research should be used as grounding for:

  • promise pipelining design
  • network-transparent capability proxy experiments
  • Cap’n Proto RPC interop work
  • durable naming/sturdyref design
  • remote capability release and disconnect semantics
  • third-party capability handoff designs

It should not yet be used to require OCapN wire compatibility for existing capOS demos or to replace the typed Cap’n Proto service model.

Sources

  • Spritely Institute: https://spritely.institute/
  • What is CapTP, and what does it enable?: https://spritely.institute/news/what-is-captp.html
  • Introducing OCapN, interoperable capabilities over the network: https://spritely.institute/news/introducing-ocapn-interoperable-capabilities-over-the-network.html
  • Spritely Goblins v0.18.0 release notes: https://spritely.institute/news/spritely-goblins-v0-18-0-sleepy-actors.html
  • The Heart of Spritely: Distributed Objects and Capability Security: https://files.spritely.institute/papers/spritely-core.html
  • Guile Goblins CapTP manual: https://files.spritely.institute/docs/guile-goblins/0.17.0/CapTP-The-Capability-Transport-Protocol.html
  • Guile Goblins OCapN manual: https://files.spritely.institute/docs/guile-goblins/0.16.1/OCapN.html
  • OCapN draft specifications: https://github.com/ocapn/ocapn/tree/main/draft-specifications
  • CapTP draft specification: https://github.com/ocapn/ocapn/blob/main/draft-specifications/CapTP%20Specification.md
  • OCapN model draft: https://github.com/ocapn/ocapn/blob/main/draft-specifications/Model.md
  • OCapN netlayers draft: https://github.com/ocapn/ocapn/blob/main/draft-specifications/Netlayers.md
  • OCapN locators draft: https://github.com/ocapn/ocapn/blob/main/draft-specifications/Locators.md
  • Syrup repository: https://github.com/ocapn/syrup
  • Cap’n Proto RPC protocol: https://capnproto.org/rpc.html
  • Endo @endo/ocapn: https://docs.endojs.org/modules/_endo_ocapn.html

Research: Browser Engines, Document Engines, and Agent Browsers

Survey of mainstream browser engines, embedding paths, automation protocols, and Donut Browser-style profile orchestration for Browser Capability and Agent Web Sessions.

Source Snapshot

Checked on 2026-04-30:

Design Consequences For capOS

  • Do not make a browser engine a near-term kernel or GUI prerequisite. Modern browser engines assume a large userspace substrate: processes, threads, shared memory, timers, files, DNS, sockets/TLS, fonts, image codecs, GPU or software compositing, profile storage, crash handling, and a sandbox.
  • Split browser work into three tracks: agent/shell browser sessions first, a cap-native document engine as the middle target, then visual browser after GUI. The first track can start as a capability wrapper around an external or hosted engine. The middle track validates cap-backed web host APIs over provided document data. The visual-browser track needs compositor, input, fonts, storage, networking, and userspace-driver safety.
  • Treat browser profiles as capability objects. Cookies, local storage, cache, permissions, proxy selection, downloads, and automation endpoints should be held by BrowserProfile/BrowserContext caps, not ambient files under a hidden profile directory.
  • Standardize the agent-facing surface above CDP/WebDriver BiDi, not below it. CDP is powerful and Chromium-specific; WebDriver BiDi is standardizing bidirectional browser automation. capOS should expose a typed, narrowed BrowserSession capability and use CDP/BiDi/Playwright only as backends.
  • Borrow Donut Browser’s useful product ideas – profile isolation, local API, persistent sessions, per-profile proxy/VPN selection, MCP integration, and AI-control hooks – without adopting anti-detection as a capOS goal. Fingerprint, geolocation, locale, proxy, and user-agent choices must be explicit, auditable policy, not stealth defaults.
  • Reuse the project rule “the interface is the permission.” A process with BrowserNavigate can navigate; a process with BrowserReadPage can inspect page state; a process with BrowserInput can click/type; a process with BrowserDownload and a granted DownloadSink can receive downloaded bytes. Bundling all of those into one raw DevTools port would recreate ambient authority.
  • Treat a browser as a shell capability, not as the shell. The native shell or agent runner may hold a browser session and use it as a tool, but browser JavaScript must not directly hold the shell’s file, launch, network, or approval capabilities.
  • Add a middle track for a cap-native document engine: JS, DOM/CSS, layout, rendering, and perhaps WebAssembly over caller-provided document/resource data, with web host APIs backed by explicit capOS capabilities. This is not full internet browsing, but it could power local HTML/CSS/JS apps and test the browser authority model earlier.

Engine Portability Surface

Chromium has the broadest web compatibility and the strongest automation ecosystem. Ozone is the relevant porting layer: it centralizes low-level input and graphics behind platform interfaces, supports runtime platform binding, and expects new platforms to implement an Ozone backend. CEF is the production embedding path for many native applications: it wraps Chromium/Blink behind stable APIs, binary distributions, and release branches tracking Chromium. WebView2 is Microsoft’s Windows embedding product around Edge/Chromium, with evergreen and fixed-version runtime choices.

capOS implications:

  • Best near-term backend for agent/shell usage is an external Chromium family process controlled through CDP, WebDriver BiDi, or Playwright, with capOS wrapping the endpoint as typed caps.
  • A native capOS Chromium port is a very large post-GUI project. The likely port boundary is Ozone plus a capOS sandbox/profile/network/storage backend, not direct Blink surgery.
  • CDP must not be directly handed to ordinary capOS workloads. It exposes navigation, DOM, network, runtime, storage, input, tracing, and debugging authority in one endpoint and has no stable backward-compatibility guarantee for tip-of-tree protocol use.

WebKit / WPE

WebKit’s upstream port model makes ports first-class maintainable units. WebKitGTK and WPE are maintained by Igalia; WPE is specifically designed as a small-footprint embedded WebKit port with a backend architecture, hardware acceleration, GStreamer media, and periodic releases.

capOS implications:

  • WPE is the most plausible visual-browser candidate once capOS has a GUI substrate because it is meant for embedded systems without a full desktop toolkit.
  • WPE still needs a platform backend, graphics/EGL or software fallback, input, fonts, networking/TLS, storage, media dependencies, and an update story. It is not an early shell feature.
  • WebKit’s port/release discipline is useful precedent for a capOS browser backend: keep platform-specific code narrow and upstreamable where possible.

Gecko / GeckoView

Gecko is Firefox’s full web platform: JavaScript, layout, graphics, media, networking, profiles, preferences, principals, and more. GeckoView is Mozilla’s Android embedding library and powers active Mozilla Android browsers. Its API separates GeckoRuntime, GeckoSession, and GeckoView, delegates storage and UI behavior to embedders, and hides internal principals from the public API.

capOS implications:

  • Gecko is credible as an external backend, especially for browser diversity and WebDriver BiDi, but GeckoView itself is Android-specific and not a desktop/no-OS embedding path for capOS.
  • Gecko’s principal model is important precedent: origin/security context is a first-class internal object. capOS should make origin/session policy explicit in its browser capability layer rather than flattening it to URLs.
  • The runtime/session/view split maps cleanly to capOS capabilities: engine/service supervision, per-profile context, and visual surface should be separate authorities.

Servo

Servo is a Rust browser engine with WebView embedding ambitions, WebGL/WebGPU support, modular architecture, parallel layout, and active cross-platform work. It is not yet a mainstream compatibility replacement for Chromium/WebKit/Gecko, but it is closer to capOS’s implementation culture than the large C++ engines.

capOS implications:

  • Servo is the best research-aligned engine to track for a future native capOS engine experiment because Rust and modular embedding fit capOS better than direct Chromium/Gecko ports.
  • It is not the first user-facing browser choice if the goal is broad web compatibility for operators or agents.
  • Servo’s WebView API and crate decomposition are worth watching for a possible BrowserView/BrowserSession backend once capOS has GUI and ordinary userspace dependencies.

Ladybird / LibWeb

Ladybird is building an independent browser engine from scratch, with an alpha target for Linux and macOS in 2026. It uses a multi-process architecture and is focused on standards rather than embedding today. It is valuable prior art for independent engine architecture and process separation, not a near-term capOS dependency.

capOS implications:

  • Track Ladybird for architecture ideas: isolated renderer processes, separate network and image-decoder processes, and specification-driven development.
  • Do not depend on Ladybird for capOS’s browser plan until its API, platform support, and compatibility stabilize.
  • Its “no inherited engine” posture is inspirational but not pragmatic for capOS near-term. capOS should expose capability-native browser APIs while reusing maintained engines underneath.

Cap-Native Document Engine Substrate

A cap-native document engine is a smaller target than a full browser. It executes a document graph supplied by capOS – for example a boot package, Store object, generated UI bundle, or test fixture – and returns a rendered surface, screenshot, event stream, and bounded DOM/accessibility snapshot. Networking, storage, permissions, clipboard, downloads, and device access are not internal browser privileges; they are host bindings backed by separate capabilities.

This track changes the portability question. Instead of asking “which browser can capOS port?”, it asks “which engine pieces can run with capOS as the host environment?”

Servo As A Document Engine

Servo is the closest architectural fit for this middle track. It is Rust, embeddable, modular, parallel, and already presents itself as a WebView-capable engine. The value for capOS is not only memory safety. It is the possibility of treating the embedding API as the boundary where fetch, storage, permission prompts, surfaces, and resource loading are backed by typed caps.

Risks:

  • Servo still brings a large standards surface.
  • API stability and completeness must be checked at implementation time.
  • A WebView embedding API is not the same as a small deterministic document-rendering library; capOS may still need substantial host glue.

Ladybird / LibWeb As A Document Engine

Ladybird’s LibWeb/LibJS stack is attractive as readable independent-engine prior art. Its multi-process browser architecture also maps well to capOS service decomposition. However, Ladybird is focused on building a full browser, not on providing a stable embeddable document engine for external hosts.

capOS should track it for design ideas and perhaps future experiments, but should not treat it as the near-term substrate for local HTML/CSS/JS apps.

SpiderMonkey

SpiderMonkey is Mozilla’s JavaScript and WebAssembly engine, used by Firefox and Servo, and can be embedded in C++ and Rust projects. It is useful if capOS wants a serious JS/Wasm runtime while building DOM/layout/rendering and host bindings separately or while experimenting with Servo components.

The tradeoff is that SpiderMonkey is only the JS/Wasm engine. DOM, CSS, layout, rendering, networking, storage, event loops, Web APIs, and browser security objects remain host responsibilities unless capOS embeds a larger engine.

JavaScriptCore

JavaScriptCore is WebKit’s ECMAScript engine and an optimizing VM with interpreter and JIT tiers. It is a mature engine, but its natural home is inside WebKit. For capOS, JavaScriptCore is most relevant if the visual-browser track chooses WPE/WebKit; it is less obviously attractive as a standalone cap-native document-engine substrate than Servo or a Rust-native JS engine.

Boa

Boa is an embeddable JavaScript engine written in Rust, with actively maintained crates and a focus on ECMAScript conformance. It is attractive for capOS experiments because it is Rust, smaller than the mainstream browser JS engines, and easier to embed in native services.

The tradeoff is compatibility and performance. Boa is a plausible substrate for trusted/local UI scripting or early host-binding proofs, not a replacement for the JS engine in a general web browser.

QuickJS

QuickJS is a small embeddable JavaScript engine. It is useful as a reference for tiny host-controlled JS runtimes and deterministic local scripting. It is not a DOM/layout/rendering engine and should not be mistaken for browser compatibility.

Consequences

  • A cap-native document engine should start with local/trusted bundles, not arbitrary internet pages.
  • The host API contract matters more than the JS engine choice. fetch, storage, clipboard, downloads, timers, workers, and Wasm imports must all be explicit cap-backed facets.
  • The first proof can be intentionally small: render a packaged HTML/CSS/JS dashboard or demo UI, capture a screenshot and accessibility/DOM snapshot, and prove that missing network/storage/download caps fail closed.
  • Full browser compatibility remains a later engine-port problem. This track buys capOS-native web UI and authority-model validation, not Chrome parity.

Automation And Agent Protocols

CDP

Chrome DevTools Protocol can instrument, inspect, debug, profile, capture screenshots, manipulate DOM/runtime/network state, and control browser targets. It is excellent as a backend and dangerous as a user-facing authority surface. The tip-of-tree protocol changes frequently and is not compatibility-stable.

capOS implication: a CDP endpoint is equivalent to a broad browser-admin cap. Only a trusted browser service should hold it. Ordinary agents receive narrowed typed operations.

WebDriver BiDi

WebDriver BiDi is a W3C Working Draft for bidirectional remote control of user agents. It introduces event streaming over WebSocket and includes modules for browser contexts, browsing contexts, emulation, network, script, and input.

capOS implication: BiDi is a better standards-shaped backend contract than raw CDP for cross-engine automation, but it still exposes more authority than most capOS workloads should receive directly.

Playwright

Playwright operates across Chromium, WebKit, and Firefox and manages specific browser versions for each Playwright release. It is practical as an early host-side harness or browser-service backend while capOS lacks native browser engine support.

capOS implication: use Playwright for development and host-side proof harnesses, but keep it out of the capOS ABI. The capOS ABI should be the typed BrowserSession/BrowserProfile capability surface.

MCP Browser Tools

MCP standardizes how LLM applications connect to external tools, resources, and prompts, with explicit consent and tool-safety guidance. Browser tools are already becoming a common MCP shape: navigate, snapshot, click, type, screenshot, download, and inspect network state.

capOS implication: the browser capability can export an MCP adapter for external agents, but MCP is only an adapter. It must not smuggle raw browser, network, file, or shell authority around the capOS broker.

Donut Browser Lessons

Donut Browser is an open-source anti-detect browser application with a Tauri Rust/TypeScript codebase, AGPL app licensing, per-profile isolation, local REST API, MCP server, proxy/VPN controls, persistent sessions, sync, and engine choice through Wayfern (Chromium-based) and Camoufox (Firefox-based). Its own mission page states that the app is open source while the browser-engine anti-detection components have a mixed proprietary/open-source model.

Useful to adapt:

  • Profile manager as the primary product object.
  • Per-profile cookies, storage, extensions, fingerprint settings, proxy/VPN, and persistent session state.
  • Local API and MCP server as automation surfaces.
  • Ability to launch a profile and attach Playwright/Puppeteer/Selenium through a backend automation endpoint.
  • Default-browser routing where each link chooses a profile/context.

Not adopted:

  • Anti-detection as a default product promise.
  • Closed fingerprint-spoofing logic as a security dependency.
  • Treating “looks like a real device” as a capOS correctness goal.
  • Exposing a broad local browser-control API without capability-scoped grants.

capOS replacement framing:

  • BrowserPersona is explicit policy: user agent, viewport, locale, timezone, geolocation, WebRTC exposure, proxy, and storage partition.
  • BrowserProfile holds state and can be cloned, snapshotted, exported, or destroyed through typed caps.
  • BrowserAutomation is split by operation class, not by one admin token.
  • Audits record profile, persona, network route, downloads, uploads, and whether a human or agent initiated each action.

Open Research Gaps

  • Which backend should be the first in-capOS visual engine candidate: WPE or Servo?
  • Which substrate should be tried first for a cap-native document engine: Servo WebView components, Ladybird/LibWeb experimentation, SpiderMonkey with a custom DOM, Boa for trusted local UI scripting, or QuickJS for tiny proofs?
  • How much of a browser profile should be persistent Store state versus revocable in-memory session state?
  • What is the smallest useful DOM/screenshot/accessibility snapshot for an LLM tool that avoids dumping excessive page data into model context?
  • How should downloads and uploads preserve provenance and consent across browser, shell, and storage caps?
  • Can WebDriver BiDi become the only external automation backend, or is CDP unavoidable for practical Chromium compatibility?

OS Error Handling in Capability Systems: Research Notes

Research on error handling patterns in capability-based and microkernel operating systems. Used as input for the capOS error handling proposal.


1. seL4

Error Codes

seL4 defines 11 kernel error codes in errors.h:

typedef enum {
    seL4_NoError            = 0,
    seL4_InvalidArgument    = 1,
    seL4_InvalidCapability  = 2,
    seL4_IllegalOperation   = 3,
    seL4_RangeError         = 4,
    seL4_AlignmentError     = 5,
    seL4_FailedLookup       = 6,
    seL4_TruncatedMessage   = 7,
    seL4_DeleteFirst        = 8,
    seL4_RevokeFirst        = 9,
    seL4_NotEnoughMemory    = 10,
} seL4_Error;

Error Return Mechanism

  • Capability invocations (kernel object operations) return seL4_Error directly.
  • IPC messages use seL4_MessageInfo_t with label, length, extraCaps, capsUnwrapped. The label is copied unmodified – kernel doesn’t interpret it.
  • MR0 (Message Register 0) carries return codes for kernel object invocations via seL4_Call.

Error Propagation

Fault handler mechanism: each TCB has a fault endpoint capability. On fault (capability fault, VM fault, etc.):

  1. Kernel blocks the faulting thread.
  2. Kernel sends an IPC to the fault endpoint with fault-type-specific fields.
  3. Fault handler (separate process) receives, fixes, and replies.
  4. Kernel resumes the faulting thread.

Design Choices

  • seL4_NBSend on invalid capability: silently fails (prevents covert channels).
  • seL4_Send/seL4_Call on invalid capability: returns seL4_FailedLookup.
  • No application-level error convention – user servers choose their own protocol.
  • Partial capability transfer: if some caps in a multi-cap transfer fail, already-transferred caps succeed; extraCaps reflects the successful count.

Sources

  • seL4 errors.h: https://github.com/seL4/seL4/blob/master/libsel4/include/sel4/errors.h
  • seL4 IPC tutorial: https://docs.sel4.systems/Tutorials/ipc.html
  • seL4 fault handlers: https://docs.sel4.systems/Tutorials/fault-handlers.html
  • seL4 API reference: https://docs.sel4.systems/projects/sel4/api-doc.html

2. Fuchsia / Zircon

zx_status_t

Signed 32-bit integer. Negative = error, ZX_OK (0) = success.

Categories:

CategoryExamples
GeneralZX_ERR_INTERNAL, ZX_ERR_NOT_SUPPORTED, ZX_ERR_NO_RESOURCES, ZX_ERR_NO_MEMORY
ParameterZX_ERR_INVALID_ARGS, ZX_ERR_WRONG_TYPE, ZX_ERR_BAD_HANDLE, ZX_ERR_BUFFER_TOO_SMALL
StateZX_ERR_BAD_STATE, ZX_ERR_NOT_FOUND, ZX_ERR_TIMED_OUT, ZX_ERR_ALREADY_EXISTS, ZX_ERR_PEER_CLOSED
PermissionZX_ERR_ACCESS_DENIED
I/OZX_ERR_IO, ZX_ERR_IO_REFUSED, ZX_ERR_IO_DATA_INTEGRITY, ZX_ERR_IO_DATA_LOSS

FIDL Error Handling (Three Layers)

Layer 1: Transport errors. Channel broke. Currently all transport-level FIDL errors close the channel. Client observes ZX_ERR_PEER_CLOSED.

Layer 2: Epitaphs (RFC-0053). Server sends a special final message before closing a channel, explaining why. Wire format: ordinal 0xFFFFFFFF, error status in the reserved uint32 of the FIDL message header. After sending, server closes the channel.

Layer 3: Application errors (RFC-0060). Methods declare error types:

Method() -> (string result) error int32;

Serialized as:

union MethodReturn {
    MethodResult result;
    int32 err;
};

Error types constrained to int32, uint32, or an enum thereof. Deliberately no standard error enum – each service defines its own error domain. Rationale: standard error enums “try to capture more detail than we think is appropriate.”

C++ binding: zx::result<T> (specialization of fit::result<zx_status_t, T>).

Sources

  • Zircon errors: https://fuchsia.dev/fuchsia-src/concepts/kernel/errors
  • RFC-0060 error handling: https://fuchsia.dev/fuchsia-src/contribute/governance/rfcs/0060_error_handling
  • RFC-0053 epitaphs: https://fuchsia.dev/fuchsia-src/contribute/governance/rfcs/0053_epitaphs

3. EROS / KeyKOS / Coyotos

KeyKOS Invocation Message Format

KC (Key, Order_code)
   STRUCTFROM(arg_structure)
   KEYSFROM(arg_key_slots)
   STRUCTTO(reply_structure)
   KEYSTO(reply_key_slots)
   RCTO(return_code_variable)
  • Order code: small integer selecting the operation (method selector).
  • Return code: integer returned by the invoked object via RCTO.
  • Data string: bulk data parameter (up to ~4KB).
  • Keys: up to 4 capability parameters in each direction.

Invocation Primitives

  • CALL: send + block for reply. Kernel synthesizes a resume key (capability to resume caller) as 4th key parameter to callee.
  • RETURN: reply using a resume key + go back to waiting.
  • FORK: send and continue (fire-and-forget).

Keeper Error Handling

Every domain has a domain keeper slot. On hardware trap (illegal instruction, divide-by-zero, protection fault):

  1. Kernel invokes the keeper as if the domain had issued a CALL.
  2. Keeper receives fault information in the message.
  3. Keeper can fix and resume (via resume key) or terminate.
  4. A non-zero return code from a key invocation triggers the keeper mechanism.

Coyotos (EROS Successor) – Formalized Error Model

Cleanly separates invocation-level vs application-level exceptions:

Invocation-level (before the target processes the message): MalformedSyscall, InvalidAddress, AccessViolation, DataAccessTypeError, CapAccessTypeError, MalformedSpace, MisalignedReference

Application-level: signaled via OPR0.ex flag bit in the reply control word. If set, remaining parameter words contain a 64-bit exception code plus optional info.

Sources

  • KeyKOS architecture: https://dl.acm.org/doi/pdf/10.1145/858336.858337
  • Coyotos spec: https://hydra-www.ietfng.org/capbib/cache/shapiro:coyotosspec.html
  • EROS (SOSP 1999): https://sites.cs.ucsb.edu/~chris/teaching/cs290/doc/eros-sosp99.pdf

4. Plan 9 / 9P

9P2000 Rerror Format

size[4] Rerror tag[2] ename[s]
  • ename[s]: variable-length UTF-8 string describing the error.
  • No Terror message – only servers send errors.
  • String-based, not numeric. Conventional strings (“permission denied”, “file not found”) but no fixed taxonomy.

9P2000.u Extension (Unix compatibility)

size[4] Rerror tag[2] ename[s] errno[4]

Adds a 4-byte Unix errno as a hint. Clients should prefer the string. ERRUNDEF sentinel when Unix errno doesn’t apply.

Design Rationale

Avoids “errno fragmentation” where different Unix variants assign different numbers to the same condition. The string is authoritative; the number is an optimization for Unix-compatibility clients.

Sources

  • 9P2000 RFC: https://ericvh.github.io/9p-rfc/rfc9p2000.html
  • 9P2000.u RFC: https://ericvh.github.io/9p-rfc/rfc9p2000.u.html

5. Genode

RPC Exception Propagation

GENODE_RPC_THROW(func_type, ret_type, func_name,
                 GENODE_TYPE_LIST(Exception1, Exception2, ...),
                 arg_type...)

Only the exception type crosses the boundary – exception objects (fields, messages) are not transferred. Server encodes a numeric Rpc_exception_code, client reconstructs a default-constructed exception of the matching type.

Undeclared exceptions: undefined behavior (server crash or hung RPC).

Infrastructure-Level Errors

  • RPC_INVALID_OPCODE: dispatched operation code doesn’t match.
  • Rpc_exception_code: integral type, computed as RPC_EXCEPTION_BASE - index_in_exception_list.
  • Ipc_error: kernel IPC failure (server unreachable).
  • Server death: capabilities become invalid, subsequent invocations produce Ipc_error.

Sources

  • Genode RPC: https://genode.org/documentation/genode-foundations/20.05/functional_specification/Remote_procedure_calls.html
  • Genode IPC: https://genode.org/documentation/genode-foundations/23.05/architecture/Inter-component_communication.html

6. Cross-System Comparison: Transport vs Application Errors

Every capability/microkernel IPC system separates two failure modes:

  1. Transport errors – the invocation mechanism failed before the target processed the request (bad handle, insufficient rights, target dead, malformed message, timeout).

  2. Application errors – the service processed the request and returned a meaningful error (not found, resource exhausted, invalid operation).

SystemTransport errorsApplication errors
seL4seL4_Error (11 values) from syscallIPC message payload (user-defined)
Zirconzx_status_t (~30 values) from syscallFIDL per-method error type
EROS/CoyotosInvocation exceptions (kernel)OPR0.ex flag + code in reply
Plan 9Connection lossRerror with string
GenodeIpc_error + RPC_INVALID_OPCODEC++ exceptions via GENODE_RPC_THROW
Cap’n Proto RPCdisconnected/unimplementedfailed/overloaded or schema types

Common pattern: small kernel error code set for transport + typed service-specific errors for application.


7. POSIX errno: Strengths and Weaknesses for Capability Systems

Strengths

  • Simple (single integer, zero overhead on success).
  • Universal (every Unix developer knows it).
  • Low overhead (no allocation on error path).

Weaknesses for Capability Systems

  • Ambient authority assumption: EACCES/EPERM assume ACL-style access control. In capability systems, having the capability IS the permission.
  • Global flat namespace: all errors share one integer space. Capability systems have typed interfaces; errors should be scoped per-interface.
  • No structured information: just an integer, no “which argument” or “how much memory needed.”
  • Thread-local state: clobbered by intermediate calls, breaks down with async IPC or promise pipelining.
  • No transport/application distinction: EBADF (transport) and ENOENT (application) in the same space.
  • Not composable across trust boundaries: callee’s errno meaningless in caller’s address space without explicit serialization.

No capability system uses a POSIX-style global errno namespace.

Crash Recovery and Supervision: Prior-Art Survey

Survey of crash recovery, supervision, and failure propagation patterns across production systems. Used as input for the capOS Crash Recovery proposal.


1. Erlang/OTP Supervision Trees

Erlang/OTP is the canonical prior art for declarative crash recovery in a capability-shaped process model.

Supervision strategies

A supervisor declares one of four restart strategies:

  • one_for_one: only the crashed child is restarted; siblings are unaffected.
  • one_for_all: when any child crashes, every child is terminated and then every child is restarted. Used when children have shared state.
  • rest_for_one: the crashed child and all children started after it (in declaration order) are terminated and restarted. Used when later children depend on earlier ones.
  • simple_one_for_one: a simplified one_for_one for dynamically added homogeneous workers.

Restart intensity

Supervisors carry an intensity (max restart count) and period (seconds window). If more than intensity restarts occur in any rolling period-second window, the supervisor terminates all children and then itself, escalating the failure to its own parent supervisor. The defaults are intensity = 1 and period = 5; that is, one restart per five seconds before the supervisor gives up.

Each child spec declares a restart type:

  • permanent — always restarted.
  • transient — restarted only on abnormal exit (exit reason other than normal, shutdown, or {shutdown, Term}).
  • temporary — never restarted.

“Let it crash”

The design philosophy is to avoid defensive error-handling at the crash site. A process that encounters an unexpected condition should exit cleanly, relying on its supervisor to restart it in a known-good state. Error recovery code introduces its own bugs; a clean restart from a known-good init is safer.

Linked processes propagate EXIT signals bidirectionally. A supervisor traps exits (process_flag(trap_exit, true)) and converts them to ordinary messages {'EXIT', Pid, Reason}, allowing it to react rather than crash itself. Monitors (erlang:monitor/2) give a unidirectional {'DOWN', Ref, process, Pid, Reason} without the bidirectional link risk.

Lesson for capOS

  • Restart budgets (intensity + period) translate directly: the kernel service supervisor should maintain a crash-loop budget — max N restarts per T seconds — and escalate to a parent authority or enter degraded boot if exceeded.
  • The three child restart types (permanent/transient/temporary) match the restart policy field a capOS service manifest would declare.
  • “Let it crash” applies: a capability server that encounters an unexpected decode error or illegal state should exit rather than continue with corrupted internal state. The supervisor restarts it; stale client caps observe a Disconnected CQE before the server is live again.

2. systemd Service Recovery

systemd is the dominant Linux service supervisor. Its restart model is policy-driven, external to the service.

Restart= modes

The Restart= directive accepts: no (default), on-success, on-failure, on-abnormal, on-watchdog, on-abort, or always.

  • on-failure covers non-zero exit codes, signals (including core dump), and watchdog timeout — the common production choice.
  • on-abnormal covers signals, operation timeouts, and watchdog, but not non-zero exit codes.
  • always restarts unconditionally.

Timing

RestartSec (default 100 ms) is the delay before a restart attempt. It is not a backoff — it is a flat delay between each attempt.

Crash-loop budget

StartLimitIntervalSec (default 10 s) and StartLimitBurst (default 5) form the crash-loop budget: more than StartLimitBurst starts within StartLimitIntervalSec puts the unit in a permanently failed state until manually reset or the system reboots. This is the systemd analogue of OTP intensity/period.

Dependency cascades

OnFailure= lists units to activate when a service enters the failed state; it is typically used to run a notification or diagnostic unit.

Watchdog

WatchdogSec enables a software watchdog: the service must call sd_notify(0, "WATCHDOG=1") at intervals shorter than WatchdogSec. If the heartbeat is absent for the full interval, systemd kills and (if Restart= includes watchdog triggers) restarts the service. This catches live-lock and hang states that do not produce a crash signal.

Lesson for capOS

  • A capability service watchdog translates to a periodic sd_notify-style ping to a watchdog capability. If the server does not renew within a budget, the supervisor sends SIGKILL (or the kernel analogue) and restarts.
  • The crash-loop budget (StartLimitIntervalSec/StartLimitBurst) is the second time this pattern appears, reinforcing that a fixed restart budget per time window is the correct primitive.
  • RestartSec (flat delay, not exponential) is simpler than Kubernetes backoff and appropriate for always-available system services.

3. Kubernetes: Probes and CrashLoopBackOff

Kubernetes separates health probes (liveness, readiness, startup) from the container restart policy, giving operators fine-grained control.

Probes

  • Liveness probe: if it fails, kubelet kills the container and subjects it to the restart policy. Used to detect live-lock (process alive, making no progress).
  • Readiness probe: if it fails, the pod’s IP is removed from all matching Service EndpointSlices. No restart is triggered; the pod stays up but receives no traffic.
  • Startup probe: disables liveness and readiness probes until it succeeds, giving slow-starting containers time to initialize without being killed prematurely.

RestartPolicy

Always, OnFailure, or Never. With Always or OnFailure, a failed container is restarted with exponential backoff: 10 s, 20 s, 40 s, … capped at 5 minutes. If the container runs successfully for 10 minutes, the backoff counter resets.

CrashLoopBackOff

When the restart backoff delay is active and the pod is waiting before the next attempt, the pod status shows CrashLoopBackOff. It is not a terminal state — the pod will still be restarted — but it indicates the container is stuck in a restart loop and kubelet is applying backoff.

Lesson for capOS

  • The readiness/liveness split maps cleanly: a capOS service can expose two status indicators — “alive” (process is running and heartbeating) and “ready” (service is accepting new capability requests). Supervisors and routing layers can use them independently.
  • Exponential backoff with a cap (10 s → 5 min) and a reset window (10 min healthy) is appropriate for user-facing services that should self-heal but not spin continuously.
  • The startup probe concept is relevant for services whose init phase takes longer than the steady-state heartbeat budget.

4. Fuchsia Component Framework

Fuchsia’s Component Framework manages component lifecycles and capability routing between components.

Lifecycle states

A component instance progresses through: Created → Resolved → Started → Stopped → (Shutdown) → Destroyed. Stopping preserves persistent state; Destroyed removes it entirely.

Client observation of a crashed component

When a Fuchsia component crashes, the kernel pauses the faulting thread and delivers a message to registered exception channels. The component’s process is killed (as if via zx_task_kill()), which closes all Zircon channels held by that process. Clients observing those channels receive ZX_CHANNEL_PEER_CLOSED. Component manager receives ZX_CHANNEL_PEER_CLOSED on the runner channel for the component, allowing it to detect and log the crash.

Clients that were bound to a crashed component’s exposed protocol channels also observe ZX_CHANNEL_PEER_CLOSED. Component manager then handles restarting the component (if configured). A new binding request after restart provides a fresh channel — there is no automatic reconnection of the pre-crash channel.

Lesson for capOS

  • The Fuchsia model confirms that the clean contract for server death in a capability system is channel close / peer-closed on all outstanding client channels. capOS should emit a Disconnected CQE to every caller that has a pending request or open session to a server that dies.
  • There is no implicit re-connect: the client must explicitly re-acquire a new capability to the restarted service. Stale caps acquired before the crash must not be silently re-animated after restart.

5. Microkernel Precedent: seL4 and Genode

seL4

seL4 provides no built-in mechanism to notify a client when the process that holds an endpoint dies. A thread fault (capability fault, VM fault, etc.) triggers the thread’s configured fault endpoint, which notifies a designated fault-handler process. The fault handler can fix and resume, or terminate the faulting thread. However, this is per-thread fault delivery — not a general “server died, notify clients” mechanism.

If a server process is killed (all its capabilities revoked, its CNode destroyed), outstanding seL4_Call callers remain blocked on the endpoint permanently unless the endpoint object itself is also destroyed or a reply capability is used. seL4 has no automatic dead-server notification for waiting callers. Building supervision requires explicit userspace monitors (e.g., a watchdog thread with a notification capability polled by the supervisor).

Genode

Genode’s component model gives the parent ultimate control over its children. When a component is destroyed (whether intentionally by the parent or due to a crash), the kernel invalidates all capabilities whose associated RPC object is destroyed, as a direct side effect of object destruction. Subsequent invocations of those capabilities by other components produce an Ipc_error exception at the call site.

The parent observes a graceful exit via the exit() RPC on the parent interface; it receives no explicit crash notification from the kernel. Detecting unexpected death requires the parent to poll state reports or use the heartbeat mechanism in Genode’s init component, which tracks skipped_heartbeats per monitored child.

Lesson for capOS

  • seL4’s silence-on-server-death confirms the gap: callers must not be silently blocked forever when a server dies. capOS must deliver a Disconnected CQE (or equivalent transport-level error) to every pending caller when the server capability is revoked or the process exits.
  • Genode’s implicit capability invalidation on object destruction is the right kernel primitive: the kernel, not userspace, ensures no stale cap can reach a destroyed object. capOS already has this via CapTable revocation.
  • Active death notification to a supervisor capability (rather than polling) is the correct extension — analogous to OTP process monitors.

6. Coredump and Minidump: Capture and Redaction

Core dumps contain a complete snapshot of a process’s address space at the time of the crash. The Linux kernel writes them via core_pattern; systemd routes them through systemd-coredump running as a socket-activated service to enforce access controls and journaling.

The primary security concern is that capability keys, cryptographic material, and user credentials present in process memory at crash time are written verbatim to the dump file. systemd-coredump stores dumps in a mode readable only by root and the process owner, but it provides no built-in redaction of sensitive memory regions. Disabling core dumps (ulimit -c 0) for security-sensitive services is the common mitigation.

Two recent vulnerabilities (CVE-2025-4598 in systemd-coredump and CVE-2025-5054 in Apport) demonstrate that race conditions in coredump handlers can allow local privilege escalation via sensitive memory access.

Lesson for capOS

  • A capability OS dump is structurally more dangerous than a POSIX dump: the crashed process’s CapTable may contain live capabilities to kernel resources that the dump reader does not possess. Dumping capability indices without revocation could allow replay.
  • The correct policy on process crash is to revoke all capabilities of the crashed process before writing any dump — the kernel holds the only authoritative revocation path. A dump tool operating post-revocation sees only dead cap indices, not live authority.
  • Memory regions tagged as containing key material (capability ring buffers, decrypted secrets) should be excluded from dumps; a MADV_DONTDUMP analogue applied to sensitive pages at allocation time is the mechanism.

Applicability to capOS

Across all surveyed systems, four design invariants recur:

  1. Crash-loop budget. Every production supervisor limits restarts per time window (OTP intensity/period; systemd StartLimitBurst/ StartLimitIntervalSec; Kubernetes CrashLoopBackOff backoff). capOS service manifests should carry a maxRestarts + restartWindowSecs budget; on exhaustion the supervisor enters a degraded-boot state rather than spinning.

  2. Dead-server notification is the kernel’s job. seL4 and Genode both demonstrate what happens when the kernel is silent: callers block forever or receive opaque errors. capOS must emit a Disconnected CQE to pending callers when a server’s capability is revoked, and must revoke server capabilities atomically on process exit.

  3. No stale authority after restart. A restarted service gets new capabilities — it does not inherit the pre-crash CapTable. Clients must re-acquire capabilities to the new instance. The Fuchsia model (fresh channel on new binding) and OTP model (new process Pid, old monitors fire DOWN) both enforce this.

  4. Watchdog caps complement passive monitoring. systemd’s WatchdogSec and Genode’s heartbeat mechanism both address live-lock states that produce no crash signal. A watchdog capability that the service must renew periodically is the capOS translation: if the service fails to renew, the supervisor kills and restarts it.


Sources

  • Erlang OTP Supervisor Behaviour: https://www.erlang.org/doc/system/sup_princ.html
  • Erlang stdlib supervisor module: https://www.erlang.org/doc/apps/stdlib/supervisor.html
  • systemd.service(5) man page (Debian): https://manpages.debian.org/jessie/systemd/systemd.service.5.en.html
  • Kubernetes Pod Lifecycle: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/
  • Kubernetes Probes: https://kubernetes.io/docs/concepts/configuration/liveness-readiness-startup-probes/
  • Fuchsia Component Lifecycle: https://fuchsia.dev/fuchsia-src/concepts/components/v2/lifecycle
  • Fuchsia Exception Handling: https://fuchsia.dev/fuchsia-src/concepts/kernel/exceptions
  • Fuchsia Component Runner FIDL: https://fuchsia.dev/reference/fidl/fuchsia.component.runner
  • seL4 Fault Handlers: https://docs.sel4.systems/Tutorials/fault-handlers.html
  • seL4 IPC Tutorial: https://docs.sel4.systems/Tutorials/ipc.html
  • Genode Recursive System Structure: https://genode.org/documentation/genode-foundations/20.05/architecture/Recursive_system_structure.html
  • Genode Init Component: https://genode.org/documentation/genode-foundations/21.05/system_configuration/The_init_component.html
  • systemd-coredump documentation: https://systemd.io/COREDUMP/
  • CVE-2025-4598 systemd-coredump analysis: https://blogs.oracle.com/linux/analysis-of-cve-2025-4598
  • Core dump security (Kicksecure): https://www.kicksecure.com/wiki/Core_Dumps

Debug, Trace, and Profiling Authority: Prior-Art Survey

Survey of how existing systems scope and gate debug, trace, and profiling access. Each section states the verified fact and the lesson it carries for a capability OS.


1. GDB Remote Serial Protocol (gdbstub)

The GDB Remote Serial Protocol (RSP) is the wire protocol between a GDB client and a gdbstub running on or alongside the target. A stub exposes the target’s entire register file and address space to the connected client via a small set of packet types:

  • g/G — read and write all general-purpose registers.
  • p/P — read and write individual registers.
  • m/M/X — read and write arbitrary memory ranges.
  • Z/z — set and clear software breakpoints, hardware breakpoints, and hardware watchpoints (read, write, or access).
  • s/c — single-step and continue execution.

Feature negotiation (qSupported) lets client and stub advertise extensions, but the baseline packet set already provides full read/write authority over the target’s memory and execution state.

Lesson for capOS. A DebugSession capability is not a read-only observer — it is a read/write authority over the target’s registers, memory, and control flow. Attaching the stub to a process is itself the high-privilege act; the session object must be issued by an explicit grant (e.g., a ProcessSpawner debug grant or a ThreadControl-derived debug capability) rather than derived from any lesser handle. The gdbstub pattern shows that the session boundary is the right chokepoint: once a client holds the session capability the protocol can proceed without further kernel checks.


2. Linux ptrace and the Yama LSM

Ambient-authority problem

Linux ptrace originally allowed any process to PTRACE_ATTACH to any other process running under the same UID that was marked as dumpable. The kernel docs summarize the risk: “a single user is able to examine the memory and running state of any of their processes. For example, if one application (e.g. Pidgin) was compromised, it would be possible for an attacker to attach to other running processes (e.g. Firefox, SSH sessions, GPG agent, etc) to extract additional credentials and continue to expand the scope of their attack.”

Yama ptrace_scope levels

The Yama Linux Security Module adds a sysctl kernel.yama.ptrace_scope with four levels to progressively restrict this ambient authority:

LevelBehaviour
0Classic: PTRACE_ATTACH to any same-UID dumpable process.
1Restricted: only descendants (or processes that have called prctl(PR_SET_TRACER, ...)) may be attached.
2Admin-only: only processes holding CAP_SYS_PTRACE may attach.
3No attach: PTRACE_ATTACH and PTRACE_TRACEME are blocked system-wide; the setting is irreversible once applied.

Most Linux distributions now ship with level 1 as the default, but level 0 remains the kernel default if Yama is not loaded.

Lesson for capOS. Yama exists solely because ambient-authority ptrace is a privilege-escalation footgun. The correct model is the inverse: no process should be able to attach to another without an explicit, pre-granted capability. In capOS terms, DebugSession attach must require a pre-issued debug capability (analogous to level 3 everywhere, not level 0 with an opt-out). The parent process or init can hold a ThreadControl-derived debug grant; a RestrictedLauncher can be configured to never issue one. There is no ambient fallback.


3. Linux perf_events and eBPF gating

perf_event_paranoid

/proc/sys/kernel/perf_event_paranoid is the sysctl controlling what unprivileged processes may sample:

ValueEffect
-1No scope or access restrictions; most permissive.
≥ 0Raw tracepoints blocked for unprivileged users.
≥ 1CPU-level (system-wide) profiling blocked; per-process only.
≥ 2Kernel profiling blocked; user-space events only.

Debian-based distributions additionally define 4 (block all perf for unprivileged users) and use it as the distro default.

CAP_PERFMON and CAP_BPF

Linux 5.8 introduced CAP_PERFMON to separate performance-monitoring authority from the broad CAP_SYS_ADMIN. Holding CAP_PERFMON lets a process bypass perf_event_paranoid scope checks. Similarly, CAP_BPF gates loading BPF programs that have performance or tracing implications (e.g., kprobes, uprobes, perf maps); attaching BPF to a kprobe tracepoint requires CAP_PERFMON or CAP_SYS_ADMIN.

The split reflects the principle of least privilege: a profiling daemon should not require CAP_SYS_ADMIN merely to sample hardware counters.

Lesson for capOS. Read-only sampling (hardware counters, ring buffer) is a distinct authority from read/write debugging. capOS should issue a Sampler capability (read-only, non-interrupting, no memory write) separately from a DebugSession (register/memory read-write, breakpoints). The sampler does not stop the target and transfers no writable authority; the perf/CAP_PERFMON split is the prior-art justification for keeping these two surfaces apart.


4. Fuchsia / Zircon: handle-scoped debug authority

debug_agent and zxdb

On Fuchsia the debugger is split into two components: debug_agent, a component running on the target that holds process handles and communicates with the kernel, and zxdb, the developer-facing client that connects to debug_agent over a socket. The fuchsia.debugger FIDL library defines the boundary:

  • DebugAgent — core protocol; AttachTo accepts a name pattern and FilterType to select which processes to attach to.
  • ProcessInfoIterator, AttachedProcessIterator — read access to thread and process state.
  • Launcher — creates new DebugAgent instances.

The debug_agent acquires process handles from the kernel by being granted them through the Zircon job/process handle tree. Zircon’s handle model means that process operations (reading memory, setting breakpoints, receiving exceptions) all require the caller to hold a process handle with the appropriate ZX_RIGHT_* bits. A process that does not hold a handle to another process cannot inspect or modify it, regardless of UID. The zxdb UI can inspect handle tables of attached processes and displays their ZX_RIGHT_READ/ZX_RIGHT_WRITE/ZX_RIGHT_INSPECT/ZX_RIGHT_SIGNAL rights to the developer.

Exception delivery in Zircon is also capability-scoped: zx_task_create_exception_channel creates a channel on a task (thread, process, or job) object; the caller must hold that task handle. The resulting channel is read-only and can only receive exception messages, not issue commands, which means observing crashes requires the task handle but does not by itself grant write authority.

Lesson for capOS. Fuchsia demonstrates that a production debugger can be built entirely on object-handle authority without ambient attach. The debug_agent component acts like a bounded debug authority domain: it holds process handles for the processes it is authorized to debug, and zxdb interacts only through the FIDL protocol that debug_agent exposes. The capOS equivalent is a DebugSession capability issued per target process, scoped to a running session, with a separate read-only ExceptionChannel cap for crash observation.


5. seL4: TCB capability and debug-build gate

Hardware debug API

seL4 exposes hardware breakpoints, watchpoints, and single-stepping to userspace via TCB object methods, but only when the kernel is built with KernelDebugBuild (equivalently, HardwareDebugAPI=1 in the CMake config). The available TCB invocations are:

  • seL4_TCB_SetBreakpoint — configure a breakpoint or watchpoint (virtual address, access type: read/write/exec, size).
  • seL4_TCB_GetBreakpoint — read the current configuration.
  • seL4_TCB_UnsetBreakpoint — disable a slot.
  • seL4_TCB_ConfigureSingleStepping — break on every N-th instruction.

Each invocation takes a capability to the target TCB. Only a holder of that TCB capability can manipulate the thread’s debug registers.

Debug-only kernel syscalls

KernelDebugBuild also enables:

  • seL4_DebugSnapshot — outputs a CapDL dump of the current kernel capability state to the serial console.
  • seL4_DebugDumpScheduler — dumps TCB addresses, thread names, instruction pointers, priorities, and scheduler states.

These syscalls expose global kernel state and are intentionally excluded from verified (proof) builds where the information flow would violate the formal security model.

Lesson for capOS. seL4 gates per-thread debug authority on possession of the TCB capability, which is the right model. capOS’s DebugSession should similarly be derived from ThreadControl so that only the process or entity that holds ThreadControl for a thread can open a debug session on it. The seL4_DebugSnapshot pattern also shows that a system-wide cap-table snapshot is a separate, higher-privilege operation from per-thread debug access; in capOS a read-only CapTableSnapshot authority can be issued for audit purposes without granting register/memory write access.


6. Genode: capability-session GDB monitor

Architecture

Genode implements user-level debugging via a GDB monitor component that interposes between a target application and its parent. The GDB monitor:

  1. Intercepts session requests from the target before they reach the parent.
  2. Provides local virtual implementations of the CPU service, RM (region-map/address-space) service, and ROM service, wrapping the real core implementations.
  3. Exposes a gdbserver protocol endpoint over a terminal session (TCP or UART).

This gives the GDB monitor “full control over all threads and memory objects (dataspace) and the address space of the target.” The monitor holds real capabilities to the target’s CPU and address-space sessions; the target’s own session handles are virtualized stubs that forward to the monitor.

Capability session scoping

Genode’s Cpu_session interface allows retrieving and modifying thread register and execution state. The API comment in the framework explicitly notes that these operations are “primarily designated for realizing user-level debuggers.” Because the monitor interposes the CPU session, it holds the same authority the parent would hold, but the target holds only stubs — the target cannot see or touch its own debug registers directly.

Lesson for capOS. The Genode monitor pattern reinforces that debugging authority flows from capability delegation, not from process identity. The interposition model also clarifies the ring-trace design decision: debug_tap in capOS captures SQE/CQE ring records passively and does not require interposing a CPU session, which keeps ring-trace authority weaker and non-interrupting by construction. A full DebugSession (register read/write, breakpoints) requires explicit session acquisition from the parent or init, matching the Genode monitor’s explicit CPU-session grant.


Applicability to capOS

The cross-system survey points to a consistent set of design invariants:

  1. DebugSession attach is an explicit, audited capability grant, not ambient. The anti-pattern is Linux ptrace at level 0; Yama level 3 is the correct default posture. In capOS no process inherits the ability to debug another: a DebugSession is derived from ThreadControl, issued by the process’s parent or init, and recorded in the audit log.

  2. Read-only cap-table snapshots transfer no authority. seL4’s seL4_DebugSnapshot is a separate, opt-in, debug-build-only facility. In capOS a CapTableSnapshot cap can be issued for audit visibility without granting any write access to the observed process.

  3. Ring-trace builds on debug_tap and does not stop the target. perf/CAP_PERFMON shows that sampling is a distinct authority class from full debugging. capOS debug_tap ring records are append-only, non-interrupting, and do not feed back into the target’s execution — matching the sampler authority class, not the DebugSession class.

  4. Sampler does not stop the target. Hardware performance counter sampling (CAP_PERFMON semantics) and ring-record sampling (debug_tap) are passive read surfaces. A DebugSession that can set breakpoints, modify registers, or write memory is a distinct, higher-privilege capability and must not be conflated with passive tracing.

  5. Exception observation is weaker than debug write authority. Zircon’s zx_task_create_exception_channel returns a read-only channel. capOS should provide a similar ExceptionObserver capability (receive crash notifications, no write access) independent of DebugSession.


Sources

IX-on-capOS Hosting Research

Research note on using IX as a package corpus and content-addressed build model for a more mature capOS system. It explains what IX provides, why it is useful for capOS, and how to extract the most value from it without importing CPython/POSIX assumptions as an architectural dependency.

capOS alignment note (2026-05-16): None of the stages described here (Stages A–F) are implemented. The capability-native services sketched in this note (BuildCoordinator, Store, Namespace, Fetcher, Archive, BuildSandbox) do not yet exist. Cloud usable-instance work, which IX hosting depends on, remains blocked on DMAPool/DeviceMmio/Interrupt authority and a production NIC/storage driver path. The POSIX adapter track (Phase P1.4) is proceeding independently. IX hosting is future work contingent on a credible userspace-compatibility and storage foundation.

What IX Is

IX is a source-based package/build system. It describes packages as templates, expands those templates into build descriptors and shell scripts, fetches and verifies source inputs, executes dependency-ordered builds, stores outputs in a content-addressed store, and publishes usable package environments through realm mappings.

For capOS, IX should be treated as three separable assets:

  • a package corpus with thousands of package definitions and accumulated build knowledge;
  • a content-addressed build/store model that already fits reproducible artifact management;
  • a compact Python control plane that can be adapted once authority-bearing operations move behind capOS services.

IX should not be treated as a requirement to reproduce Unix inside capOS. Its current implementation uses CPython, Jinja2, subprocesses, shell tools, filesystem paths, symlinks, hardlinks, signals, and process groups because it runs on Unix-like hosts today. Those are implementation assumptions, not the part worth preserving unchanged.

Why IX Is Useful for capOS

capOS needs a credible path from isolated demos to a useful userspace closure. IX is useful because it supplies a package/build corpus and model that can exercise the exact system boundaries capOS needs to grow:

  • process spawning with explicit argv, env, cwd, stdio, and exit status;
  • fetch, archive extraction, and content verification as auditable services;
  • Store and Namespace capabilities instead of ambient global filesystem authority;
  • build sandboxing with explicit input, scratch, output, network, and resource policies;
  • static-tool bootstrapping before a full dynamic POSIX environment exists;
  • differential testing against the existing host IX implementation.

The main value is leverage. IX can give capOS real package metadata, real build scripts, and real toolchain pressure without making CPython or a broad POSIX personality the first required userspace milestone.

Best Way to Get the Most from IX

The optimal strategy is to preserve IX’s package corpus and build semantics while replacing the Unix-shaped execution boundary with capability-native services.

The high-value path is:

  1. Run upstream IX on the host first to build and validate early capOS artifacts.
  2. Use CPython/Jinja2 on the host as a reference oracle, not as the in-system foundation.
  3. Render IX templates through a Rust ix-template component that implements the subset IX actually uses.
  4. Run the adapted IX planner/control plane on native MicroPython once capOS has enough runtime support.
  5. Move fetch, extract, build, Store commit, Namespace publish, and process lifecycle into typed capOS services.

This gets most of IX’s value: package knowledge, reproducible build structure, and a practical self-hosting path. It avoids the lowest-value part: spending early capOS effort on a large CPython/POSIX compatibility layer just to preserve upstream implementation details.

Position

CPython is not an architectural prerequisite for IX-on-capOS.

It is a compatibility shortcut for running upstream IX with minimal changes. For a clean capOS-native integration, the better design is:

  • keep IX’s package corpus and content-addressed build model;
  • adapt IX’s Python control-plane code instead of preserving every CPython and POSIX assumption;
  • run the adapted control plane on a native MicroPython port;
  • move build execution, fetching, archive extraction, store mutation, and sandboxing into typed capOS services;
  • render IX templates through a Rust template service or tightly scoped IX template engine, not full Jinja2 on MicroPython;
  • keep CPython on the host as a differential test oracle and bootstrap tool, not as a required foundation layer for capOS.

MicroPython is a credible sweet spot only with that boundary. It is not a credible sweet spot if the requirement is “make upstream Jinja2, subprocess, fcntl, process groups, and Unix filesystem behavior all work inside MicroPython.”

Sources Inspected

  • Upstream IX repository: https://github.com/pg83/ix
  • IX package guide: PKGS.md
  • IX core: core/
  • IX templates: pkgs/die/
  • Bundled IX template deps: deps/jinja-3.1.6/, deps/markupsafe-3.0.3/
  • MicroPython library docs: https://docs.micropython.org/en/latest/library/index.html
  • MicroPython CPython-difference docs: https://docs.micropython.org/en/latest/genrst/
  • MicroPython porting docs: https://docs.micropython.org/en/latest/develop/index.html
  • Jinja docs: https://jinja.palletsprojects.com/en/latest/intro/
  • MiniJinja docs: https://docs.rs/minijinja/latest/minijinja/

Upstream IX Shape

IX is a source-based, content-addressed package/build system. Package definitions are Jinja templates under pkgs/, mostly named ix.sh, and the template hierarchy under pkgs/die/ expands those package descriptions into JSON descriptors and shell build scripts.

The inspected clone has:

  • 3788 package ix.sh files;
  • 66 files under pkgs/die;
  • a template chain centered on base.json, ix.json, script.json, sh0.sh, sh1.sh, sh2.sh, sh.sh, base.sh, std/ix.sh, and language/build-system templates for C, Rust, Go, Python, CMake, Meson, Ninja, WAF, GN, Kconfig, and shell-only generated packages.

The IX template surface is broad but not arbitrary Jinja. In the package tree surveyed, the Jinja tags used were:

TagCount
block14358
endblock14360
extends3808
if / endif451 / 451
include344
else123
set / endset52 / 52
for / endfor49 / 49
elif23

No macro, import, from, with, filter, raw, or call tags were found in the inspected tree. That matters: IX’s template needs are probably a finite subset around inheritance, blocks, self.block(), super(), includes, conditionals, loops, assignments, expressions, and custom filters.

IX’s own Jinja wrapper is small. core/j2.py defines:

  • custom loader with // root handling;
  • include inlining;
  • filters such as b64e, b64d, jd, jl, group_by, basename, dirname, ser, des, lines, eval, defined, field, pad, add, preproc, parse_urls, parse_list, list_to_json, and fjoin.

That makes the template layer replaceable. The risk is not “Jinja is impossible.” The risk is “full upstream Jinja2 drags in a CPython-shaped runtime just to implement a template subset IX mostly uses in a disciplined way.”

Current IX Runtime Surface

The IX Python core uses ordinary host-scripting features:

  • os, os.path, json, hashlib, base64, random, string, functools, itertools, platform, getpass;
  • shutil.which, shutil.rmtree, shutil.move;
  • subprocess.run, check_call, check_output;
  • os.execvpe, os.kill, os.setpgrp, signal.signal;
  • fcntl.fcntl to reset stdout flags;
  • asyncio for graph scheduling;
  • multiprocessing.cpu_count;
  • contextvars fallback support for asyncio.to_thread;
  • tarfile, zipfile;
  • ssl, urllib3, usually only to suppress certificate warnings while fetchers are shell-driven;
  • os.symlink, os.link, os.rename, os.makedirs, open, and file tests.

core/execute.py is the important boundary. It schedules a DAG, prepares output directories, calls shell commands with environment variables and stdin, checks output touch files, and kills the process group on failure.

core/cmd_misc.py and core/shell_cmd.py cover fetch, extraction, hash checking, archive unpacking, and hardlinking fetched inputs.

core/realm.py maps build outputs into realm names using symlinks and metadata under /ix/realm.

core/ops.py selects an execution mode. Today the modes are local, system, fake, and molot. A capOS executor mode is the correct integration point.

CPython Path

CPython is the obvious route for upstream compatibility:

  • upstream Jinja2 is designed for modern Python and uses normal CPython-style standard library facilities;
  • IX’s current Python code assumes subprocess, asyncio, fcntl, shutil, archive modules, and process semantics;
  • CPython plus libcapos-posix would let a large fraction of that code run with limited changes.

That does not make CPython the right product dependency for IX-on-capOS. CPython pulls in a large libc/POSIX surface and encourages preserving Unix process and filesystem assumptions that capOS should make explicit through capabilities.

CPython should be used in two places:

  1. Host-side bootstrap and reference evaluation.
  2. Optional compatibility mode once libcapos-posix is mature.

It should not be the required path for a clean IX-capOS integration.

If CPython is needed later, capOS has two routes:

  1. Native CPython through musl plus libcapos-posix.
  2. CPython compiled to WASI and run through a native WASI runtime.

The native POSIX route is the only route that makes sense for IX-style build workloads. It needs fd tables, path lookup, read/write/close/lseek, directory iteration, rename/unlink/mkdir, time, memory mapping, posix_spawn, pipes, exit status, and eventually sockets. That is the same compatibility work needed for shell tools and build systems, so it should arrive as part of the general userspace-compatibility track, not as an IX-specific dependency.

The WASI route is useful for sandboxed or compute-heavy Python, but it is a poor fit for IX package builds because IX fundamentally drives external tools, filesystem trees, fetchers, and process lifecycles. WASI CPython can be useful as a script sandbox, not as the main IX appliance runtime.

MicroPython Path

MicroPython is attractive because capOS needs an embeddable system scripting runtime before it needs a full desktop Python environment.

The upstream docs frame MicroPython as a Python implementation with a smaller, configurable library set. The latest library docs list micro versions of modules relevant to IX, including asyncio, gzip, hashlib, json, os, platform, random, re, select, socket, ssl, struct, sys, time, zlib, and _thread, while warning that most standard modules are subsets and that port builds may include only part of the documented surface.

That is a good fit for capOS. It means a capOS port can expose a deliberately chosen OS surface instead of pretending to be Linux.

MicroPython should host:

  • package graph traversal;
  • package metadata parsing;
  • target/config normalization;
  • dependency expansion;
  • high-level policy;
  • command graph generation;
  • calls into capOS-native services.

MicroPython should not own:

  • generic subprocess emulation;
  • shell execution internals;
  • process groups or Unix signals;
  • TLS/network fetching;
  • archive formats beyond small helper cases;
  • hardlink/symlink implementation;
  • content store mutation;
  • build sandboxing;
  • parallel job scheduling if that wants kernel-visible resource control.

Those belong in capOS services.

Native MicroPython Port Shape

A capOS MicroPython port should be a new MicroPython platform port, not the Unix port with a large compatibility shim underneath.

The port should provide:

  • VM startup through capos-rt;
  • heap allocation from a fixed initial heap first, then VirtualMemory when growth is available;
  • stdin/stdout/stderr backed by granted stream or Console capabilities;
  • module import from a read-only Namespace plus frozen modules;
  • a small VFS adapter over Store/Namespace for scripts and package metadata;
  • native C/Rust extension modules for capOS capabilities;
  • deterministic error mapping from capability exceptions to Python exceptions.

The initial built-in surface should be deliberately small:

  • sys with argv/path/modules;
  • os path and file operations backed by a granted namespace;
  • time backed by a clock capability;
  • hashlib, json, binascii/base64, random, struct;
  • optional asyncio if the planner keeps Python-level concurrency;
  • no general-purpose subprocess until the service boundary proves it is necessary.

For IX, the MicroPython port should ship frozen planner modules and native bindings to ix-template, BuildCoordinator, Store, Namespace, Fetcher, and Archive. That keeps the trusted scripting surface small and avoids import-time dependency drift.

Jinja2 and MicroPython

Full Jinja2 compatibility on MicroPython remains unproven and is probably not the optimal target.

Current Jinja docs say Jinja supports Python 3.10 and newer, depends on MarkupSafe, and compiles templates to optimized Python code. The bundled IX Jinja tree imports modules such as typing, weakref, importlib, contextlib, inspect, ast, types, collections, itertools, io, and MarkupSafe. Some of these can be ported or stubbed, but that is a CPython compatibility project, not a small MicroPython extension.

The better path is to treat IX’s template language as an input format and render it with a capOS-native component.

Recommended template strategy:

  1. Build an ix-template Rust component using MiniJinja or a smaller IX-specific template subset.
  2. Register IX’s custom filters from core/j2.py.
  3. Implement IX’s loader semantics: // package-root paths, relative includes, and cached sources.
  4. Reject unsupported Jinja constructs with deterministic errors.
  5. Keep CPython/Jinja2 as a host-side oracle for differential testing until the capOS renderer matches the package corpus.

MiniJinja is a practical candidate because it is Rust-native, based on Jinja2 syntax/behavior, supports custom filters and dynamic objects, and has feature flags for trimming unused template features. IX needs multi-template support because it uses extends, include, and block.

If MiniJinja compatibility is insufficient, the fallback is not CPython by default. The fallback is an IX-template subset evaluator that implements the constructs actually used by pkgs/.

Optimal Architecture

The clean design is an IX-capOS build appliance, not a Unix personality layer that happens to run IX.

flowchart TD
    CLI[ix CLI or build request] --> Planner[ix planner on MicroPython]
    Planner --> Template[ix-template renderer]
    Planner --> Graph[normalized build graph]
    Template --> Graph

    Graph --> Coordinator[capOS BuildCoordinator service]
    Coordinator --> Fetcher[Fetcher service]
    Coordinator --> Extractor[Archive service]
    Coordinator --> Store[Store service]
    Coordinator --> Sandbox[BuildSandbox service]

    Fetcher --> Store
    Extractor --> Store
    Sandbox --> Proc[ProcessSpawner]
    Sandbox --> Scratch[writable scratch namespace]
    Sandbox --> Inputs[read-only input namespaces]
    Proc --> Tools[sh, make, cc, cargo, go, coreutils]
    Sandbox --> Output[write-once output namespace]
    Output --> Store
    Store --> Realm[Namespace snapshot / realm publish]

The planner remains small and scriptable. The authority-bearing work happens in services:

  • BuildCoordinator: owns graph execution and job state.
  • Store: content-addressed objects and output commits.
  • Namespace: names, realms, snapshots, and package environments.
  • Fetcher: network-capable source acquisition with explicit TLS and cache policy.
  • Archive: deterministic extraction and path-safety checks.
  • BuildSandbox: constructs per-build capability sets.
  • ProcessSpawner: starts shell/tools with controlled argv, env, cwd, stdio, and granted capabilities.
  • Toolchain packages: statically linked tools built externally first, then eventually by IX itself.

The adapted IX planner should call service APIs instead of shelling out for operations that are native capOS concepts.

Control-Plane Boundary

MicroPython should see a narrow, high-level API. It should not synthesize Unix from first principles.

Example shape:

import ixcapos
import ixtemplate

pkg = ixcapos.load_package("bin/minised")
desc = ixtemplate.render_package(pkg.name, pkg.context)
graph = ixcapos.plan(desc, target="x86_64-unknown-capos")
result = ixcapos.build(graph)
ixcapos.publish_realm("dev", result.outputs)

The Python layer can still look like IX. The implementation behind it should be capability-native.

Service API Sketch

The exact schema should follow the project schema style, but this is the shape of the boundary:

interface BuildCoordinator {
  plan @0 (package :Text, target :Text, options :BuildOptions)
      -> (graph :BuildGraph);
  build @1 (graph :BuildGraph) -> (result :BuildResult);
  publish @2 (realm :Text, outputs :List(OutputRef))
      -> (namespace :Namespace);
}

interface BuildSandbox {
  run @0 (command :Command, inputs :List(Namespace),
          scratch :Namespace, output :Namespace, policy :SandboxPolicy)
      -> (status :ExitStatus, log :BlobRef);
}

interface Fetcher {
  fetch @0 (url :Text, sha256 :Data, policy :FetchPolicy)
      -> (blob :BlobRef);
}

interface Archive {
  extract @0 (archive :BlobRef, policy :ExtractPolicy)
      -> (tree :Namespace);
}

Important policy fields:

  • network allowed or denied;
  • wall-clock and CPU budgets;
  • maximum output bytes;
  • allowed executable namespaces;
  • allowed output path policy;
  • whether timestamps are normalized;
  • whether symlinks are preserved, rejected, or translated;
  • whether hardlinks become store references or copied files.

Store and Realm Mapping

IX’s /ix/store maps well to capOS Store.

IX’s realms should not be literal symlink trees in capOS. They should be named Namespace snapshots:

IX conceptcapOS mapping
/ix/store/<uid>-nameStore object/tree with stable content hash and metadata
build output dirwrite-once output namespace
build temp dirscratch namespace with cleanup policy
realmnamed Namespace snapshot
symlink from realm to outputNamespace binding or bind manifest
hardlinked source cacheStore reference or copy-on-write blob binding
touch output sentinelbuild-result metadata, optionally synthetic file for compatibility

This preserves IX’s reproducibility model without importing global Unix authority.

Process and Filesystem Requirements

A mature capOS needs these primitives before IX builds can run natively:

  • ProcessSpawner and ProcessHandle;
  • argv/env/cwd/stdin/stdout/stderr passing;
  • exit status;
  • pipes or stream capabilities;
  • fd-table support in the POSIX layer for ported tools;
  • read-only input namespaces;
  • writable scratch namespaces;
  • write-once output namespaces;
  • directory listing, create, rename, unlink, and metadata;
  • symlink translation or explicit rejection policy;
  • hardlink translation or store-reference fallback;
  • monotonic time;
  • resource limits;
  • cancellation.

For package builds, the tool surface is larger than IX’s Python surface:

  • sh;
  • find, sed, grep, awk, sort, xargs, install, cp, mv, rm, ln, chmod, touch, cat;
  • tar, gzip, xz, zstd, zip, unzip;
  • make, cmake, ninja, meson, pkg-config;
  • C compiler/linker/archive tools;
  • cargo and Rust toolchains;
  • Go toolchain;
  • Python only for packages that build with Python.

IX’s static-linking bias helps because the early tool closure can be imported as statically linked binaries.

What to Patch Out of IX

For a clean capOS fit, patch or replace these upstream assumptions:

Upstream assumptioncapOS replacement
subprocess.run everywhereBuildSandbox.run() or ProcessSpawner
process groups and SIGKILLProcessHandle.killTree() or sandbox cancellation
fcntl stdout flag resetremove or make no-op
chrt, nicescheduler/resource policy on sandbox
sudo, su, chownno permission-bit authority; use capability grants
unshare, tmpfs, jailBuildSandbox with explicit caps
/ix/store global pathStore capability plus namespace mount view
/ix/realm symlink treeNamespace snapshot/publish
hardlinks for fetched filesStore refs or copy fallback
curl/wget subprocess fetchFetcher service
Python tarfile/zipfileArchive service
asyncio executorBuildCoordinator scheduler

This is more invasive than a “light patch”, but it is cleaner. The IX package corpus and target/build knowledge are preserved; Unix process plumbing is not.

MicroPython Port Scope

The MicroPython port should be sized around IX planner needs plus general system scripting:

Native modules:

  • capos: bootstrap capabilities, typed capability calls, errors.
  • ixcapos: package graph and build-service client bindings.
  • ixtemplate: template render calls if the renderer is an embedded Rust/C component.
  • ixstore: Store and Namespace helpers.

Python/micro-library requirements:

  • json;
  • hashlib;
  • base64 or binascii;
  • os.path subset;
  • random;
  • time;
  • small shutil subset for path operations if old IX code remains;
  • small asyncio only if planner concurrency remains in Python.

Avoid implementing:

  • general subprocess;
  • general fcntl;
  • full signal;
  • full multiprocessing;
  • full tarfile;
  • full zipfile;
  • full ssl/urllib3;
  • full Jinja2.

Those are symptoms of preserving the wrong boundary.

CPython Still Has a Role

CPython remains useful even if it is not a capOS prerequisite:

  • run upstream IX on the development host;
  • compare rendered descriptors from CPython/Jinja2 against ix-template;
  • generate fixtures for the capOS renderer;
  • bootstrap the first static tool closure;
  • serve as a later optional POSIX compatibility demo.

Differential testing should be explicit:

flowchart LR
    Pkg[IX package] --> Cpy[Host CPython + Jinja2]
    Pkg --> Cap[capOS ix-template]
    Cpy --> A[descriptor A]
    Cap --> B[descriptor B]
    A --> Diff[normalized diff]
    B --> Diff
    Diff --> Corpus[compatibility corpus]

This makes CPython a test oracle, not a trusted runtime dependency inside capOS.

Staged Plan

Stage A: Host IX builds capOS artifacts

Run IX on Linux host first. Add a capos target and recipes for static capOS ELFs. This validates package metadata, target triples, linker flags, and static closure assumptions before capOS hosts any of it.

Outputs:

  • x86_64-unknown-capos target model in IX;
  • recipes for libcapos, capos-rt, shell/coreutils candidates, MicroPython, and archive/fetch helpers;
  • static artifacts imported into the boot image or Store.

Stage B: Template compatibility harness

Build ix-template on the host. Render a package corpus through CPython/Jinja2 and through ix-template. Normalize JSON/script output and record divergences.

Outputs:

  • supported IX template subset;
  • custom filter implementation;
  • fixture corpus;
  • list of unsupported packages or constructs.

Stage C: Native MicroPython port

Port MicroPython to capOS as a normal native userspace program using capos-rt and a small libc/POSIX subset only where needed.

Outputs:

  • REPL or script runner;
  • frozen IX planner modules;
  • native capos, ixcapos, and ixtemplate modules;
  • no promise of full CPython compatibility.

Stage D: BuildCoordinator and sandboxed execution

Implement capOS-native build services and run simple package builds using externally supplied static tools.

Outputs:

  • build graph execution;
  • per-build scratch/output namespaces;
  • deterministic logs and output commits;
  • cancellation and resource policies.

Stage E: IX package corpus migration

Patch IX templates for capOS target semantics. Start with simple C/static packages, then Rust, then Go.

Outputs:

  • C/static package subset;
  • regular Rust package support once regular Rust runtime/toolchain work is ready;
  • Go package support when GOOS=capos or imported Go toolchain support is credible;
  • WASI packages as a separate target family where useful.

Stage F: Self-hosting

Run the IX-capOS appliance inside capOS to rebuild a meaningful part of its own userspace closure.

Outputs:

  • build the MicroPython IX planner inside capOS;
  • build core shell/coreutils/archive tools inside capOS;
  • build libcapos and selected static service binaries;
  • eventually build Rust and Go runtime/toolchain pieces.

Why This Is Better Than “CPython First”

The CPython-first route optimizes for running upstream IX quickly. The MicroPython-plus-services route optimizes for capOS’s actual design:

  • capability authority stays typed and explicit;
  • build isolation is native instead of Linux namespace emulation;
  • Store/Namespace are first-class rather than hidden behind /ix;
  • fetch/archive/build operations are auditable services;
  • the scripting runtime remains small;
  • the system does not need full CPython before it can have a package manager;
  • CPython can still be added later through the POSIX layer without blocking IX-capOS.

The tradeoff is that IX-capOS becomes a real port/fork at the control-plane boundary. That is acceptable for a clean capability-native fit.

Risks

Template compatibility is the main technical risk. IX uses a restricted-looking Jinja subset, but exact self.block(), super(), whitespace, expression, and undefined-value behavior must match closely enough for package hashes to remain stable. This needs corpus testing, not confidence.

Build-script compatibility is the largest scope risk. Even if IX planning is native, the package corpus still executes conventional build systems. capOS must provide enough shell, coreutils, archive, compiler, and filesystem behavior for those tools.

Toolchain bootstrapping is a long dependency chain. The first useful IX-capOS system will import statically linked tools from a host. Native self-hosting is late-stage work.

Store semantics need care around directories, symlinks, hardlinks, mtimes, and executable bits. These details affect build reproducibility and package compatibility.

MicroPython must not grow into a bad CPython clone. If many missing modules are implemented only to satisfy upstream IX assumptions, the design boundary has failed.

Recommendation

Adopt IX as a package corpus and build model, not as a CPython/POSIX program to preserve unchanged.

The optimal capOS-native solution is:

  1. Host-side upstream IX remains available for bootstrap and oracle tests.
  2. ix-template in Rust renders the actual IX template subset.
  3. Native MicroPython runs the adapted IX planner/control plane.
  4. capOS services execute all authority-bearing operations: fetch, extract, build sandbox, Store commit, Namespace publish, and process lifecycle.
  5. CPython is deferred to general POSIX compatibility and optional tooling.

This makes MicroPython the sweet spot for the in-system IX control plane while avoiding the trap of turning MicroPython into CPython.

Pingora Architecture and Philosophy: Research Report for capOS

Research on Cloudflare’s Pingora framework and whether capOS high-level interfaces should borrow its shape.

Status 2026-06-10 13:25 UTC: the kernel Phase B socket path described below is since retired — the kernel socket owner, TcpSocket.intoTerminalSession, and the telnet-gateway demo are removed, and the production socket path is the Phase C userspace network stack. The directional guidance still applies, read against the userspace stack.

Status 2026-05-23 00:06 UTC: the directional guidance in this report remains current. Since the report was first written, Phase B of the networking proposal has landed: NetworkManager, TcpListener, TcpSocket, and TcpSocket.intoTerminalSession are implemented in-kernel; the telnet-gateway userspace service runs on a manifest-forwarded TcpListenAuthority and RestrictedShellLauncher, exercising the accept/negotiate/session-mint/shell-launch/cleanup lifecycle described in the “Concrete capOS Direction” section. Bounded SSH gateway prerequisites (SshHostKey, AuthorizedKeyStore, public-key session minting, restricted shell launch) are implemented as kernel stubs and fixture proofs; encrypted SSH transport and an OpenSSH-compatible handshake are not yet implemented. capos-service slice 1 has landed as a standalone no_std crate: the plaintext Telnet gateway now uses ServiceMain/ServiceRuntime for initialize, dependency-wait, readiness, and run-loop structure. The TerminalSessionFromByteStream / byte-stream terminal host, endpoint-loop helpers, metrics, budgeting, and graceful handoff pieces remain open work in docs/proposals/capos-service-proposal.md.

Bottom Line

capOS should build some high-level userspace interfaces inspired by Pingora’s architecture, but should not make Pingora’s HTTP proxy model, callback set, or runtime structure part of the kernel ABI.

The useful idea is not “copy Pingora.” The useful idea is an opinionated library layer that owns repetitive service mechanics and exposes a typed, phase-oriented customization surface to application code. For capOS, that belongs above the capability ring, in userspace libraries and domain services:

  • capos-rt remains the raw transport owner: bootstrap, CapSet, ring client, typed handles, completion matching, release flushing, exception decoding.
  • capos-service should own service lifecycle mechanics: endpoint receive/return loops, readiness, dependency waiting, shutdown, background tasks, metrics hooks, and graceful handoff.
  • Domain libraries such as libcapos-http, terminal hosts, network services, storage services, and agent services can expose Pingora-style phase hooks for their specific request lifecycle.
  • Kernel capability interfaces should stay narrow, typed, and stable. Do not add a generic Service capability, callback registry, plugin API, or Pingora-like phase machine to the kernel.

This is a “yes, but only at the userspace framework layer” recommendation.

Sources

Primary external sources:

capOS grounding read for this comparison:

What Pingora Is

Pingora is a Rust framework for building programmable network services, especially HTTP proxies. Cloudflare built it after concluding that NGINX’s process/worker architecture and extension model were limiting performance, connection reuse, safety, and feature velocity at Cloudflare scale.

The original design pressure matters:

  • NGINX’s per-worker connection pools harmed reuse as worker count increased. Pingora’s shared multithreaded architecture improved origin connection reuse and reduced new TCP/TLS handshakes.
  • Cloudflare wanted a statically typed, memory-safe implementation language rather than a C core plus Lua extension layer.
  • Cloudflare chose to implement its own HTTP handling rather than rely on an off-the-shelf library because it needed control over non-standard Internet traffic and product-specific behavior.
  • Pingora is a library and toolset, not a finished proxy binary. Users build their own executable around Pingora’s server, service, and proxy APIs.

That last point is the main architectural lesson for capOS: the framework is valuable because it packages the hard reusable mechanics while leaving product logic in typed extension points.

Architecture

Server, Services, and Applications

Pingora’s top-level Server represents one process. It owns configuration, CLI handling, daemonization, signal handling, service startup, graceful shutdown, and zero-downtime upgrade mechanics. A Server hosts multiple services.

A Service is the long-running unit of work. Listening services own one or more endpoints and an application object. Background services run supporting tasks such as discovery, health checks, metrics, or bootstrap logic. Recent Pingora versions also include service dependency metadata, readiness watches, and topological startup ordering.

The layering is deliberately split:

  • Server owns process-level operation.
  • Service owns listener setup, endpoint accept loops, runtime choice, and shutdown propagation.
  • ServerApp handles an established transport stream.
  • HttpServerApp adds HTTP session negotiation and H1/H2 handling.
  • HttpProxy implements the HTTP proxy workflow.
  • User code implements ProxyHttp to customize the proxy phases.

This means the server has no special concept of “proxy” at the root. Proxying is one application shape hosted by the generic service container.

Per-Service Runtime

Each service gets its own runtime/threadpool. Pingora can use Tokio’s normal multi-threaded work-stealing runtime or a “no steal” runtime built from multiple single-threaded Tokio runtimes. The no-steal option exists because work stealing has overhead, while isolated current-thread runtimes can still use multiple cores.

The important design lesson is not the exact runtime. capOS cannot inherit a Tokio process model directly. The lesson is that runtime policy is a service container concern, not application business logic.

Phase-Oriented Proxy Logic

Pingora’s ProxyHttp trait exposes an ordered lifecycle for a proxied request:

  • initialize per-request context,
  • run early and normal request filters,
  • decide whether to serve from cache or go upstream,
  • select an upstream peer,
  • handle connect success or failure,
  • modify the upstream request,
  • process request body chunks,
  • process upstream response headers, body chunks, and trailers,
  • process downstream response headers, body chunks, and trailers,
  • decide retry/failover behavior,
  • report final logging and summaries.

Most filters are optional. A per-request CTX object is created for each request and is passed mutably through the phases. Shared state across requests is ordinary Rust shared state such as Arc, atomics, or locks.

The ergonomics are strong because the framework gives engineers a lifecycle map. Application code overrides the phase where it has policy, while the framework owns parsing, connection setup, pooling, retries, duplex body forwarding, common error response handling, and resource cleanup.

Connection Pooling and Peer Identity

Pingora pools upstream connections automatically after successful requests, but only reuses a connection for the exact same Peer. Its peer identity includes address, scheme, SNI, client certificate, certificate verification behavior, hostname verification, alternate common name, and proxy settings.

The security lesson is broad: resource reuse must be keyed by all attributes that affect authority, identity, confidentiality, and protocol semantics. A connection pool keyed only by address is wrong for a multi-tenant service.

Failure and Retry

Pingora separates connect failure from post-connect proxy failure. It lets application code mark errors retryable, and it documents the idempotency boundary: retrying after the request was sent is not generally safe for non-idempotent methods.

Its common error type carries an error type, source, retry status, optional cause, and context. That mirrors capOS’s existing split between transport errors and typed application exceptions, but Pingora puts more emphasis on whether a high-level operation can be retried.

Operations Are Part of the Framework

Pingora treats startup, daemonization, graceful termination, graceful upgrade, configuration, error logging, Prometheus metrics, readiness, and service dependencies as framework-level concerns.

The zero-downtime upgrade path transfers listening sockets from an old process to a new one and lets existing requests drain during a grace period. That is a specific Linux mechanism, but the higher-level idea maps to capOS live upgrade: stable acceptor or endpoint authority should be retargetable without dropping new work, and old in-flight calls should be allowed to drain when policy says they can.

Philosophy

Pingora’s philosophy is pragmatic, not minimalist:

  • Build a framework, not a monolithic product.
  • Own the hot-path mechanics so users do not reimplement them incorrectly.
  • Expose typed hooks at lifecycle points where policy naturally belongs.
  • Keep common operational behavior in the container rather than every service.
  • Prefer static typing and memory safety for extensibility.
  • Share reusable resources across workers when the safety boundary allows it.
  • Give application code enough control to handle product-specific edge cases.

There is tension in that philosophy. Pingora’s permissive, Internet-facing HTTP goals require supporting odd traffic and complex reuse rules. That flexibility can create security hazards if defaults are too generous or protocol state is not exhausted before reuse.

The 2025 and 2026 Pingora request-smuggling advisories are directly relevant to capOS design. The lesson is not that Pingora is unsafe. The lesson is that high-level frameworks become security-critical because they decide defaults, message framing, cache keys, retry rules, and reuse conditions for their users. capOS libraries should treat those defaults as part of the trusted interface.

Mapping to capOS

What capOS Should Adopt

1. A userspace service framework layer.

capOS already has a low-level transport owner in capos-rt. The next layer should be an opinionated service framework that runs on top of typed capability clients and endpoint server helpers. It should not replace capos-rt; it should use it.

Candidate shape:

  • service lifecycle: init, ready, run, shutdown, drain;
  • dependency waiting: typed readiness handles, not global service names;
  • endpoint serving: generated or handwritten RECV/RETURN loops;
  • background tasks: timers, discovery, health checks, metrics export;
  • graceful handoff: transfer or retarget listener/endpoint authority;
  • structured observability: request summaries, metrics, error suppression policy, and panic boundaries;
  • resource accounting: explicit budgets or donated resources for sessions.

2. Phase-oriented domain libraries.

Pingora-style phases fit domains with real lifecycles:

  • HTTP proxy and fetch service: request filter, route, connect, upstream request, response, body chunks, logging, failover.
  • Terminal host: accept transport, negotiate transport options, authenticate session, spawn shell, proxy terminal I/O, log, cleanup.
  • Storage service: authorize operation, resolve object, choose cache path, perform read/write, commit, audit.
  • Agent service: authenticate caller, bind tool authority, plan invocation, stream outputs, log decision context.

The phase names should be domain-specific. A generic OS-wide phase machine would become vague and hard to secure.

3. Per-request context objects.

Pingora’s CTX model is a good fit for capOS service libraries. Each request or session should have an owned context object dropped at the end of the lifecycle. That context should carry derived policy decisions, peer identity, timing, resource reservations, and transfer state.

This is cleaner than hidden globals and safer than asking later phases to reparse the original request.

4. Resource reuse keyed by authority identity.

Future capOS HTTP/TLS/TCP services should reuse expensive resources, but the pool key must include all security-relevant identity:

  • target address and protocol;
  • TLS SNI, ALPN, certificate policy, and client certificate;
  • authority cap identity or object epoch;
  • caller/session identity if it affects policy;
  • cache namespace or tenant;
  • request transformation policy when it changes what upstream sees.

This is the capOS analogue of Pingora’s strict Peer equality.

5. Operational lifecycle as an API.

The service framework should make readiness, graceful shutdown, and upgrade handoff explicit. That connects to capOS’s future live-upgrade proposal and avoids baking operational behavior into ad hoc service code.

6. Retry semantics as typed policy.

High-level clients should surface retry decisions only where the domain can state idempotency and replay safety. For example, HttpEndpoint.get() can have different retry policy than HttpEndpoint.post(), and a storage write should not be retried unless the interface defines idempotent operation IDs.

What capOS Should Reject

1. Do not make Pingora’s phases kernel concepts.

The kernel should continue to dispatch narrow CapObject methods over the ring. It should not know about request filters, upstream peers, retries, logging phases, or protocol-specific context. Those belong in userspace.

2. Do not add a generic service/plugin capability.

A generic Service.call(phase_id, bytes) or callback registry would weaken capOS’s central design bet: the typed interface is the permission. Use a domain-specific Cap’n Proto interface for authority and a domain-specific library for ergonomics.

3. Do not inherit Pingora’s process model.

Pingora is one unprivileged Linux process hosting multiple services with per-service runtimes. capOS’s isolation model is many processes with explicit capability grants. Service libraries may internally multiplex tasks, but authority boundaries should remain process and capability boundaries.

4. Do not use globals as authority.

Pingora’s ordinary Rust shared-state model is reasonable inside one trusted process. In capOS, cross-service authority must flow through capabilities, not statics, process-wide registries, or global service discovery.

5. Do not ship permissive defaults where explicit policy is needed.

Pingora 0.8.0 removed an insecure cache-key default and hardened HTTP framing. capOS should take this as a rule: cache keys, tenant identity, message framing, body drain behavior, reuse policy, and transfer semantics must be explicit or fail closed.

Concrete capOS Direction

The right decomposition is:

schema/capos.capnp
    Stable authority-bearing interfaces.
    Keep small and domain-specific.

capos-rt
    Raw runtime and transport:
    CapSet, ring, typed handles, release, result caps, exceptions.

capos-service
    Generic userspace service container:
    lifecycle, endpoint loops, readiness, shutdown, background tasks,
    metrics, request context, resource budgeting.

domain libraries
    Pingora-like phase APIs where they make sense:
    HTTP/fetch, terminal host, storage, supervisor, agent tools.

init/supervisors
    Compose services by passing capabilities, not by global names.

The first useful application is not the current runtime/Go milestone. The nearest capOS milestone where this should shape implementation is networking Phase B and the Telnet Shell Demo:

  • Keep NetworkManager, TcpListener, TcpSocket, and TerminalSession as narrow capability interfaces.
  • Build the Telnet gateway as a userspace service that uses a lifecycle helper: accept connection, negotiate Telnet, create a socket-backed TerminalSession, spawn shell with exact grants, proxy until exit, log, cleanup.
  • Later, build Fetch and HttpEndpoint services with a Pingora-inspired HTTP lifecycle library rather than exposing raw socket authority to apps.

The first concrete proposal should therefore target terminal/networking lifecycle, not HTTP. This is now tracked in capos-service. A useful slice is:

  1. TerminalSessionFromByteStream / byte-stream terminal host;
  2. lifecycle wrapper around accept, session minting, proxying, and cleanup;
  3. metrics plus request/session context hooks;
  4. network service container;
  5. HTTP/fetch services only after the terminal/networking lifecycle proves the authority and cleanup model.

For generated clients, the Pingora lesson argues for generated or handwritten thin wrappers, not raw Cap’n Proto calls everywhere. The wrapper owns:

  • parameter encoding and result decoding;
  • typed application exceptions;
  • retry classification if the interface defines it;
  • result-cap adoption;
  • request summary and metrics hooks.

Risks and Review Rules

Any Pingora-inspired capOS framework should be reviewed against these rules:

  • Extension hooks must receive the narrowest capabilities needed for that phase. Do not hand a broad service object to every hook by convenience.
  • Request context must be lifecycle-owned and dropped deterministically.
  • Pool keys must include all authority and identity fields that affect reuse.
  • Retry policy must be explicit about whether upstream side effects may have happened.
  • Cache-key construction must have no insecure default for multi-tenant data.
  • Protocol parsers must drain or close before reusing a stream.
  • Background tasks must be budgeted and cancellable during service shutdown.
  • Readiness must mean the exported capability is actually ready to serve, not merely that the process started.
  • Generated high-level wrappers must preserve the transport/application error split already documented in the capability ring and userspace runtime docs.

Recommendation

Use Pingora as precedent for a capability-native service framework: library-first, typed, phase-oriented, operationally aware, and opinionated about common mechanics.

Do not use Pingora as precedent for broad kernel interfaces, ambient service discovery, global registries, generic plugin phases, or permissive defaults. The capOS version should make authority narrower than Pingora does, because capOS has a stronger capability model available at every boundary.

Research: Game Mechanics Prior Art

This note records the external game-mechanics grounding used for Aurelian Frontier planning. It exists because the original planning commit 79a9afc translated external mechanics references into capability-shaped Aurelian tasks, but did not leave a standalone research note. The recorded planning rationale for that commit used Stardew Valley, EVE Online, Evil Islands, PixiJS, and Tiled references, with an explicit instruction not to clone those games. This note covers the game systems used by Aurelian: Stardew Valley, EVE Online, and Evil Islands.

Source Snapshot

Checked on 2026-04-29:

The Stardew Valley Wiki and EVE Online support/academy pages are treated as the primary grounding for their systems. For Evil Islands, the official FAQ mirror and Nival page ground construction and broad tactical identity; Ars Technica and PlayItHardcore are secondary sources for combat details.

Planning Audit

Commit 79a9afc records the durable planning outcome: external mechanics are inputs to capOS-shaped tasks, not clone targets. The planning context named these stable mechanics:

  • Stardew Valley: seasons, festivals, schedules, gifts, and affection.
  • EVE Online: brokered markets and blueprint/material/facility industry.
  • Evil Islands: material/level/gold/equipment construction and limited enchantment.
  • PixiJS/Tiled: later browser-client rendering, outside this note’s mechanics scope.

The commit body for 79a9afc says the patch translates external mechanics into capOS-shaped tasks for seasonal calendars, regional settlements/outposts, service-mediated order books, blueprint/artifact construction, token-budgeted agent NPCs, and a 2D tilemap browser client. This research note keeps that translation explicit and auditable.

Stardew Valley

Stable mechanics to borrow:

  • Seasons create calendar pressure. Stardew Valley uses four seasons of 28 days; routines, festivals, visuals, and available resources can vary by season.
  • Resource availability is table-driven. Crops, forage, fish, and shop selections are season-sensitive.
  • Season boundaries matter. Ordinary crops wither at season change unless they are explicitly multi-season.
  • Festivals are scheduled events that can alter access, activities, prizes, shops, dialogue, and social opportunities.
  • Relationships are explicit profile facts. Talking, gifts, missed interaction, and events affect a visible relationship meter rather than being pure flavor.

Aurelian translation:

  • Model AdventureCalendar as explicit service-owned state, with fixed-smoke calendar values for deterministic QEMU proof and separate production seeds later.
  • Keep seasonal resources bounded and generated from content: crops, forage, fish, shop stock, route hazards, and outpost production.
  • Make multi-season resources explicit in content validation.
  • Treat festivals and military events as scheduled overlays that affect actor presence, witness availability, shop stock, route risk, quests, and debrief choices.
  • Store gifts, favors, affection, faction standing, and event participation as profile or ledger facts owned by game services, not client-local counters.

Do not borrow:

  • Unbounded daily chores as the core loop. Aurelian is an expedition and authority game; calendar pressure should sharpen mission choices, not become farm maintenance.
  • Client-owned social counters. Social state must remain authoritative and auditable.

EVE Online

Stable mechanics to borrow:

  • Markets are brokered. Buy and sell requests go through an order-matching system rather than choosing a specific counterparty directly.
  • Market eligibility matters. Some assets can be traded through market orders; others require contracts, custody, corporation roles, or special flows.
  • Matching is deterministic and rule-driven. Orders match immediately when compatible prices and ranges cross; otherwise they remain listed.
  • Industry is blueprint and job based. Manufacturing uses blueprints, materials, job types, time, and facility or slot constraints.
  • Production decisions have location and logistics consequences.

Aurelian translation:

  • Keep the current actor-local market verbs as the proof slice, then evolve toward a MarketService or equivalent service-owned order book.
  • Define market-eligible item classes. Ordinary stackable supplies can use buy/sell orders; relics, writs, witness-certified custody, and dangerous artifacts move through explicit custody or contract-style protocols.
  • Implement order books with side, item, quantity, price, location/range, expiry, fees, idempotency keys, and ordered ledger receipts.
  • Route multi-owner exchange through reserve/escrow, commit/release, stale-version rejection, cancellation, retry, and crash recovery.
  • Use blueprint jobs for construction: inputs, facility, duration, authority gates, output bounds, and service-owned job state.

Do not borrow:

  • A fully player-driven MMO economy as the first target. Aurelian needs a small authoritative regional economy that proves capability boundaries before it needs market depth.
  • Market transfer for every object. Authority-bearing objects should stay outside generic order books unless a later design proves the custody model.

Evil Islands

Stable mechanics to borrow for construction:

  • Equipment construction combines a design/blueprint with material choices and money; unavailable material can be bought as part of assembly cost.
  • Material class and quality affect item properties. Materials carry distinct weight, durability, energy/complexity, damage, armor, and vulnerability characteristics.
  • Constructed or repaired items can be inspected before committing the job.
  • Enchantment is constrained by object capacity and spell complexity; equipment can carry limited spell effects instead of arbitrary modifiers.

Stable mechanics to borrow for combat:

  • Damage type matters. Slashing, piercing, and crushing style differences make weapon choice meaningful against different enemy defenses.
  • Body-part targeting adds tactical texture. Head, hands/arms, and legs have distinct consequences such as critical risk, attack/casting slowdown, and reduced pursuit.
  • Sight and scouting shape fight selection. Long vision and stealth let the player choose engagements instead of charging every hostile.
  • Pulling and alert behavior are tactical risks. Enemies can notify related enemies, so careless engagement can turn one fight into many.
  • Cast time is a combat risk. Offensive magic can be punished if the player casts while exposed.

Aurelian translation:

  • Put construction inputs in generated content: blueprint, required materials, facility class, cost, duration, rank/star/circle gates, output bounds, artifact authority, and enchantment slots.
  • Derive bounded item properties from blueprint, material, facility quality, paid cost, and player competence. Avoid unbounded loot rolling.
  • Use a small deterministic target-zone set for combat: head, hands, legs, and core.
  • Add damage/mitigation metadata for weapon type, spell type, zone armor, ward state, and inspected knowledge.
  • Make scouting and inspection upgrade enemy information from rough threat to zone armor, ward state, intent, and likely counters.
  • Support stealth openings and pull/alert behavior as explicit service-owned state transitions with readable causality in transcript output.
  • Add fatigue and cast interruption as explicit costs. Retreat should be hard when a mob blocks it, but failures must be legible and deterministic.

Do not borrow:

  • Hidden real-time randomness or reaction-speed demands. Aurelian’s proof path remains command-level and deterministic.
  • Punitive infinite-monster-fatigue behavior. Monsters can pressure retreat, but the service should name the reason and keep rules fair enough to test.
  • Gore or locational damage presentation as spectacle. Body zones exist for tactical outcomes and readable state.

Cross-Game Design Rules For Aurelian

  • External mechanics are planning inputs, not clone targets.
  • Every durable fact that matters to public world state belongs in an authoritative service: calendar, market, construction jobs, profile progression, social standing, custody, and receipts.
  • Every user-visible refusal should name the missing gate: authority, location, rank, resource, stale version, custody, fatigue, target state, or policy.
  • Use pure Rust tests for deterministic rules: calendar rollover, seasonal availability, market matching, construction validation, property derivation, target-zone damage, fatigue, and alert propagation.
  • Use a real capOS userspace test process for cross-service scenarios: expedition flow, custody, market reserve/commit/release, construction jobs, and party transfer.
  • Keep the shell transcript as a low-dependency smoke proof and command-parser proof. It should not be the only test for complex gameplay state machines.

Small Open-Weights LLM Survey for the capOS Agent-Shell

Research notes on current (early 2026) open-weights language models in the 2-4 B active-parameter range, their suitability for the capability-served planner described in docs/proposals/llm-and-agent-proposal.md, and a rough compute-cost estimate for training a comparable model from scratch.

Primary sources: OpenRouter model catalog (https://openrouter.ai/api/v1/models, 353 models listed at survey time); empirical probe against OpenRouter’s hosted endpoints using an agent-planner prompt; published training reports (Llama 3 tech report, Gemma 2 tech report, Qwen3 model cards, MosaicML MPT blog posts); Chinchilla scaling law (Hoffmann et al., 2022).


1. Candidate Landscape

Two families of candidates match “2-4 B active parameters”:

  • Dense 2-4 B: inference FLOPs and memory footprint both scale with total parameters. Friendly to low-RAM hosts.
  • MoE with 2-4 B active: inference FLOPs scale with active params, but total weights must be resident. Only viable on hosts with enough RAM to page-cache the full expert stack.

Dense contenders observed as of 2026-04-24:

ModelParamsLicenseContextNotes
Qwen3-4B-Instruct4 BApache-2.032 KStrong tool-use post-training
Qwen3-1.7B-Instruct1.7 BApache-2.032 KSame family, smaller floor
Gemma 3 4B IT4 BGemma license128 KMultilingual; verbose outputs
Llama 3.2 3B Instruct3 BLlama 3.2 Community128 KPermissive but not OSI
Ministral 3B (2512)3 BMistral Research License128 KNon-commercial; blocks ISO redistribution
Phi-4-mini3.8 BMIT16 KReasoning-leaning training
IBM Granite 4.0 H Micro~3 BApache-2.0128 KNew architecture, less battle-tested
SmolLM3-3B (HuggingFace)3 BApache-2.064 KFully open data + training code

MoE contenders with ~3 B active:

ModelActiveTotalLicenseContextq4 weight size
Qwen3-30B-A3B-Instruct-2507~3 B30 BApache-2.0262 K~18 GiB
Qwen3-Coder-30B-A3B-Instruct~3 B30 BApache-2.0160 K~18 GiB
Qwen3-Next-80B-A3B-Instruct~3 B80 BApache-2.0262 K~48 GiB
Qwen3.5-35B-A3B~3 B35 BApache-2.0262 K~21 GiB
IBM Granite 4.0 Tiny (7B-A1B)~1 B7 BApache-2.0128 K~4 GiB

2. Empirical Probe

Prompt

Agent-planner system prompt: “You are a capOS shell planner. Given a goal and typed tool descriptors (name + param schema), emit a single JSON ActionPlan: {"steps":[{"tool":..,"args":..,"rationale":..}]}. Never invoke tools. Only reference tools from the descriptor list. Output JSON only, no prose.”

User prompt: three typed tool descriptors (ServiceSupervisor.restart, NetworkStack.info, LogReader.tail) and the goal “Restart the network stack, but first confirm it’s in a failed state by checking status and last 20 log lines.”

The test exercises three properties a capOS planner needs:

  1. Correct step ordering (info + tail before restart).
  2. Correct arg packing for methods with and without arguments.
  3. Pure JSON output without Markdown fences, which the dispatcher must otherwise strip.

Results

ModelJSON validOrder correctFencesArg shape
Qwen3-30B-A3B-Instruct-2507yesyesnonecompact, correct
Qwen3-Next-80B-A3B-Instructyesyesnonecorrect, verbose
Qwen3.5-35B-A3Byesyesnonecorrect
Qwen3-8B (proxy for Qwen3-4B)yesyesnonecorrect
Gemma 3 4B ITyesyes```json fencefabricated empty status:"" arg on zero-arg call
Ministral 3B (2512)yesyes```json fencecorrect
Llama 3.2 3B Instructyesno (restart before log check)``` fencecorrect
IBM Granite 4.0 H Microno (three duplicate steps keys in one object)none

Qwen3-8B was used as a stand-in for Qwen3-4B because Qwen3-4B is not served on OpenRouter; Qwen3 family models below 8 B share the same post-training recipe, so output quality for structured agent tasks should be comparable with minor degradation at 4 B and more noticeable degradation at 1.7 B.

Interpretation

  • Qwen3-A3B family produces the tightest, correctly-ordered plans with no markdown fencing. Best quality-per-active-parameter in the sample.
  • Dense 3-4 B Qwen / Gemma / Ministral produce correct plans but add Markdown fences or small schema drift that the dispatcher must tolerate.
  • Llama 3.2 3B violated the ordering constraint – planner-unsafe without additional prompt discipline or rejection sampling.
  • Granite 4.0 H Micro emitted invalid JSON (duplicate object keys). Retest before adopting; may be endpoint-specific rather than the model.

3. Size Thresholds for capOS Use Cases

Mapping observed behaviour to the proposal’s workloads:

WorkloadMinimum credible sizeNotes
NPC dialogue, canned-reply replacement1.7 B denseTemplated plans only; refusal fragile
Short-list planner (≤5 typed tools)3 B denseFloor for credible multi-step ordering
Long-list planner, plan refine, step-up reasoning4 B dense or 30B-A3BRefusal, self-critique, schema-strict JSON
Log / audit summarisation, NPC with context4 B dense or 30B-A3BNeeds retrieval grounding regardless
Embedding / vector retrieval (TextEmbedder)separate small encoderNot a generator workload

Proposal §“Built-in Local Model” sketches a 0.7-2.0 GiB weight budget (q4 class). Qwen3-4B at q4_k_m is ~2.4 GiB, narrowly over that budget. Resolutions:

  1. Bump the default budget to ~2.5 GiB and ship Qwen3-4B-Instruct.
  2. Keep the 2 GiB budget and ship Qwen3-1.7B or SmolLM3-3B (at q5_k_m, ~2.0 GiB), acknowledging weaker planner quality.
  3. Ship Qwen3-1.7B as default and allow ModelAdmin.loadWeights to install Qwen3-4B or a 30B-A3B model post-install.

4. Recommendation for the Proposal

  1. Default built-in (ISO): Qwen3-4B-Instruct at q4_k_m, Apache-2.0. Raise the weight-budget line in the proposal from 2.0 GiB to ~2.5 GiB. Fallback to SmolLM3-3B if a fully-open training-data provenance is required for the trusted-build-inputs chain.

  2. Optional installed upgrade: Qwen3-30B-A3B-Instruct-2507 for hosts with >=24 GiB RAM. Same ~3 B active compute as a 3 B dense, materially better planning quality.

  3. Reject for default ship:

    • Ministral 3B (Mistral Research License – cannot redistribute on ISO).
    • Llama 3.2 3B (failed ordering discipline in the probe; Llama 3.2 Community License also restricts downstream use).
    • IBM Granite 4.0 H Micro until the JSON-output issue is confirmed or refuted on a local run.
  4. Update Open Question 3 of the proposal (“smallest credible local model”) with the threshold: 3 B dense is the floor for a planner that can be trusted with ordering constraints; 1.7 B is restricted to NPC / canned-reply territory.

5. Training Compute Cost for a Custom 2-B-Active Model

Rough order-of-magnitude estimate, on the chance that the project considers a purpose-trained capOS planner model rather than a fine-tune.

5.1 FLOPs Budget

Forward+backward training compute approximates 6 x N_active x D_tokens. Modern open models have drifted far past Chinchilla’s 20-tokens-per-param ratio; 5k-15k tokens per param is typical.

TargetActiveTokensFLOPs
Chinchilla-minimum 2 B dense2 B40 B4.8e20
Llama-3-ish 2 B dense2 B15 T1.8e23
Qwen3-4B-ish 2 B dense2 B36 T4.3e23
30B-A3B MoE (3 B active, 15 T tok)3 B15 T~4e23 (+ ~1.5x router/aux overhead)

5.2 Hardware -> Dollars

Reference: H100 SXM at ~40% MFU ~= 1.4e18 FLOPs / hour; cloud price $2-3 / hr (spot) to $3-4 (on-demand).

ScaleH100-hoursUSD (raw compute)Wall-clock on 1024 H100
Chinchilla 2 B (toy)~350~$1 k<1 hr
2 B @ 15 T tok~130 k~$400 k~5 days
2 B @ 36 T tok (SotA match)~310 k~$900 k~12 days
30B-A3B @ 15 T tok~290 k~$870 k~12 days

5.3 Public Calibration

  • Llama 3 8 B: Meta reports ~1.3 M H100-hours ~= $4 M raw.
  • Llama 3 70 B: ~6.4 M H100-hours ~= $19 M raw.
  • Gemma 2 2 B (~2 T tok, older recipe): <$500 k compute.
  • MosaicML MPT-7B (2023, ~1 T tok, A100-class): ~$200 k.

The 6ND estimate agrees with these published runs within a factor of ~2, which is appropriate for an order-of-magnitude planning number.

5.4 Full-Project Multiplier

Final training run is typically 20-30% of total project compute. Realistic end-to-end budget:

  • Ablations, restarts, hyperparameter sweeps: 3-5x raw training compute.
  • Post-training (SFT + DPO / RLHF / RLVR): +5-15% of pretrain.
  • Data pipeline (crawl, clean, dedupe, licensing): can equal or exceed compute cost; tokenizer corpus curation is non-trivial.
  • Engineering headcount: 3-8 ML engineers for 6-12 months dominates TCO.

Realistic end-to-end to ship a capOS-class 2 B model from scratch: $3-10 M plus a team. A 30B-A3B MoE adds ~50%.

6. Practical Alternative

Training from scratch is almost certainly not worth it for the agent-shell use case. Two much cheaper paths that achieve the same capOS-specific behaviour:

  1. SFT / LoRA on Qwen3-4B or SmolLM3-3B for the capOS ActionPlan JSON schema, tool descriptors, and refusal patterns. ~10 k-100 k curated examples, 8 x H100 for 1-10 days ~= $500-$10 k. Reproducible on commodity cloud.

  2. Continued pretraining on a capOS-specific corpus (manifests, schemas, logs, proposals) if the base lacks domain coverage. Single digits of B tokens, $10 k-$100 k.

The only strong reason to train from scratch would be a fully verifiable weight provenance chain tied to docs/trusted-build-inputs.md. Even then, a reproducible fine-tune of a known base with a signed recipe captures most of the benefit at ~1% of the cost.

6a. nanoGPT / nanochat Scale Reference

Karpathy’s nanoGPT repo reproduces GPT-2 small (124 M params: 12 layers, 768 hidden, 12 heads) as its headline config. Karpathy’s follow-up nanochat (github.com/karpathy/nanochat) ships a full pretrain + SFT pipeline and uses model depth (d) as the size dial rather than parameter count. The README is the only authoritative source; the numbers below are quoted from it, not extrapolated.

  • d12 – “GPT-1 sized”, ~5 min pretraining for quick experiments.
  • d20 – documented speedrun tier: “$48 (~2 hours of 8xH100 GPU node)”, ~$15 on spot instance, “well below $100”. This is the headline reproducibility tier.
  • d24 – appears on the leaderboard as a “slightly overtrained baseline.”
  • d26 – “GPT-2 capability happens to be approximately depth 26”; latest leaderboard entry hits GPT-2 CORE metric (0.256525) in ~1.65 hr on 8xH100. Original 2019 GPT-2 training cost is cited as ~$43 k for comparison.

The README does not publish explicit parameter counts per depth; the mapping from depth to params requires inspecting the config code.

Capability mapping to the capOS planner task (empirical, based on same-size published models rather than nanochat runs themselves):

nanochat scaleRough param bandPlanner capability
d12GPT-1-class, ~50-100 MToy completion only, no planner
d20likely ~100-200 M bandTemplated NPC lines; not a planner
d26GPT-2-class, ~100-400 M bandSimple JSON under strict priming; schema drift common
Hypothetical d30+unclear (not in README)Plausibly approaches 1 B territory (SmolLM3-1B / Qwen3-1.7B / Llama 3.2 1B); still below the 3 B dense floor from the probe in section 2

Training a nanochat-class model from scratch fits a research-OS budget in a way the numbers in section 5 do not: d20 is ~$48 on-demand and d26 is single-digit hours on 8xH100. That is the only scale at which “capOS ships a weight-provenance-complete default planner” is financially plausible without multi-million-dollar compute.

7. Open Follow-Ups

  • Verify Granite 4.0 H Micro JSON behaviour on a local llama.cpp run rather than the OpenRouter endpoint; the probe may have hit a streaming / formatting quirk specific to the provider.
  • Measure q4_k_m tokens-per-second for Qwen3-4B and Qwen3-1.7B on the CPU targets capOS cares about (x86_64 desktop, cloud VM, aarch64 SBC). No numbers are captured here; required before committing to a default.
  • Evaluate an embedding model separately (bge-m3, nomic-embed, gte-modernbert) for the TextEmbedder capability. Out of scope for this survey.
  • Revisit in 6 months: the 2-4 B frontier is moving monthly as of early 2026, and “best open weight” today may be superseded before the proposal’s Phase 2 begins.
  • nanochat d30+ quality and pricing. The README documents tiers up to d26 (GPT-2 capability, ~1.65-3 hr, <$100 on 8xH100). No published numbers exist for d30 or beyond. Open questions, before committing to an in-tree from-scratch provenance model:
    • What is the parameter count for d30 (and d28, d32)? Derive from the nanochat config code, not inferred.
    • What training time and cost does d30 require to reach a non-trivial SFT-able checkpoint on the same 8xH100 setup? Expected band is roughly 2-4x the d26 run (so ~6-12 hr, ~$150-300 on-demand), but this needs measurement – depth scaling of wall-clock is not linear once the model stops fitting comfortably in per-GPU memory.
    • Does a d30-scale nanochat + capOS-specific SFT approach the Qwen3-1.7B planner floor on the section-2 probe? If yes, a provenance-complete default planner becomes realistic for ~$500-$5 k per full run (pretrain + SFT + ablations). If no, provenance has to be bought by fine-tuning a larger external base (Qwen3-1.7B or SmolLM3-3B) and accepting the weaker provenance story.
  • Tokenizer choice for any capOS from-scratch or continue-pretrain path. Independent of model scale or architecture. A capOS-specific tokenizer with reserved tokens for ActionPlan JSON structure, Cap’n Proto type IDs, capability interface names, and common schema keywords is plausible at the nanochat-class budget and may materially reduce tokens-per-plan and schema-drift error rate vs. reusing GPT-2 BPE or a generic SentencePiece. For a fine-tune of Qwen3 / SmolLM3 the tokenizer is fixed by the base and this question collapses to “what special tokens can be added without retraining embeddings.”

Research: Hosted Agent Harnesses

Survey of current agent harness, swarm, memory, and interoperability patterns for capOS-Hosted Agent Swarms. The design question is how capOS should host OpenClaw-like personal agents without copying the ambient-host authority model common in desktop tools.

Source Snapshot

Checked on 2026-04-28:

DeepWiki was accessible for the related projects above. It is useful as a code-linked summary layer, but this note treats it as secondary to primary project docs and papers.

Design Consequences For capOS

  • Treat the harness as the product surface: workspace, memory, tool descriptors, approval, cancellation, audit, and task state matter as much as model choice.
  • Do not treat an agent workspace as a sandbox. In capOS, workspace boundaries should be enforced by capabilities, not by cwd conventions.
  • Keep the model out of the authority path. The model proposes structured tool calls; a trusted runner validates and invokes caps.
  • Use a persistent artifact model for agent knowledge. Raw sources, wiki pages, indexes, logs, and search indexes should be explicit, versioned, label-aware data, not hidden prompt history.
  • Borrow swarm patterns cautiously. Roles, review gates, and durable tasks are useful; anthropomorphic role names and unconstrained peer delegation are not.
  • Treat MCP and A2A as adapter protocols. They can carry descriptors, messages, and artifacts, not raw capOS authority.
  • Prefer deterministic harness proofs first: fake model, fake browser, fake mutating tool, explicit approval, and auditable transcript.

OpenAI Harness Engineering

OpenAI’s Harness engineering article frames the key operational lesson: agents can only reason over context they can inspect. Repo-local files, schemas, tests, executable plans, and mechanically enforced architecture are therefore stronger harness material than knowledge left in chat, external docs, or tacit human convention.

The 2026 Agents SDK update moves in the same direction: a model-native harness, controlled workspaces, sandbox execution, filesystem tools, MCP, skills, custom instructions, shell execution, and patch tools. The important point for capOS is not the Python API. It is the shape: agents need a runtime that makes inspection, action, state, and safety explicit.

capOS implication: proposals, research notes, schemas, CUE manifests, QEMU proofs, and workplan files are not just documentation. They are harness inputs. They should be versioned, concise, indexed, and mechanically checked where possible.

Applying Harness Engineering To This Repository

The capOS repository is already partially harness-engineered: it has AGENTS.md, CLAUDE.md, docs/tasks/README.md, review-finding task records, proposal and research indexes, CUE manifests, named Make targets, QEMU harnesses, and generated-code checks. The missing work is making those artifacts more agent-legible, mechanically navigable, and resistant to stale planning state. The normative repository plan is capOS Repository Harness Engineering. The checklist below is retained as research-derived input, not as a separate planning baseline.

Concrete work needed:

  1. Create a repo harness map. Add a concise docs/agent-harness.md that tells future agents where to find current state, design authority, task selection rules, QEMU proofs, generated-code rules, security review rules, and known stale/superseded documents. It should link, not duplicate, CLAUDE.md, docs/tasks/README.md, REVIEW.md, roadmap, backlog, proposals, research, and review-finding task records.

  2. Make task selection queryable. docs/tasks/README.md is human-readable but not easy to query mechanically. Add stable anchors or a small structured sidecar for selected milestone, immediate gates, active branches/worktrees, paused branches, and blocked findings. The sidecar can be generated from docs/tasks/README.md later; the first step is stable headings and consistent checkbox syntax.

  3. Add a design-status linter. Check that proposal status, proposal index, topics, docs/SUMMARY.md, docs/tasks/README.md pointers, and superseded markers agree. The repo already has mdBook metadata tooling; extend it so stale status drift becomes a failed check.

  4. Add a harness inventory for run targets. Generate or maintain a table of make run-* and make qemu-* targets with purpose, manifest, expected proof output, and owning proposal/backlog. Agents should not infer which QEMU proof applies by grepping Makefile fragments.

  5. Standardize research notes. Require every new external-design proposal to cite a docs/research/*.md note with source snapshot date, primary sources, secondary sources, design consequences, and open research gaps. This prevents proposals from becoming opaque summaries with no reusable research artifact.

  6. Add decision records for major pivots. The project currently records pivots in docs/tasks/README.md, proposals, and changelog. Add short ADR-style records for high-impact direction changes such as endpoint badges to service-object capabilities to session-bound invocation context. Agents need a stable “why this changed” artifact that is not buried in a long proposal.

  7. Expose schema and interface intent. For each important Cap’n Proto interface, add or generate a short doc page with authority semantics, granted-by paths, threat model, QEMU proofs, and known gaps. This maps the core capOS rule “interface is permission” into agent-readable harness context.

  8. Make stale document detection mechanical. Add front matter fields for status, supersedes, superseded_by, implemented_by, and last_reviewed where missing. Then check links both ways. An agent should be warned when it opens a superseded proposal without the replacement.

  9. Record proof transcripts as artifacts. QEMU harnesses validate behavior, but future agents often need the exact expected proof shape. Store bounded transcript excerpts or generated proof summaries under docs/proofs/ or a similar directory, with links from proposals and run-target inventory.

  10. Add eval tasks for agents. Create deterministic “agent can safely edit capOS” evals: find selected milestone, choose the right backlog, identify affected docs, avoid main-worktree edits, run the right check, and update status. These evals should be runnable without model calls by using scripted fixtures where possible.

  11. Create a local knowledge compilation path. Use the LLM Wiki pattern for capOS itself: raw sources are proposals/research/changelog/review notes; compiled pages summarize current capability model, shell path, session model, networking status, and QEMU proofs; lint finds contradictions and stale status. This should be generated into a clearly marked docs/agent-wiki/ tree or kept out of published docs until reviewed.

  12. Keep checks close to docs. Every process rule that matters to agents should have either a check, a generated index, or a fixture. Free-form instructions are useful but insufficient; the harness should fail when architecture or workflow invariants drift.

Near-term implementation order:

  1. Add docs/agent-harness.md.
  2. Add run-target inventory.
  3. Extend mdBook metadata checks for proposal status/index drift.
  4. Add front matter fields for superseded/replacement relationships.
  5. Add the first reviewed docs/agent-wiki/ compilation for the selected milestone only.

OpenClaw Harness Controls

OpenClaw is the closest current personal-agent analogue:

  • channel ingress through chat apps, webhooks, cron, and a gateway;
  • a local-first gateway security boundary;
  • an agent runtime with a workspace as the default tool cwd;
  • bootstrap instruction/memory/persona files injected into context;
  • built-in tools for read/write/edit, exec/process, browser, web, memory, and skills;
  • per-agent workspace, sandbox, and tool policy;
  • managed browser profiles and optional real-browser/remote-CDP routing;
  • markdown memory plus search/index plugins.

Important controls:

  • Exec exposes host selection (sandbox, gateway, node), security mode (deny, allowlist, full), approval prompts, timeouts, background sessions, PTY support, and restrictions on PATH/loader environment overrides.
  • Browser automation uses a managed Chromium profile, snapshots, screenshots, action refs, profile routing, and CDP. Arbitrary JavaScript evaluation is explicitly risky.
  • Memory stores markdown as source of truth. Search returns bounded snippets, file paths, and line ranges rather than entire memory files by default.
  • Multi-agent routing can assign different workspaces, sandboxes, and tool allow/deny lists to different agents.

DeepWiki adds code-linked observations: OpenClaw treats tools as functional capabilities and skills as SKILL.md extensions; includes a security audit surface; supports Docker/seccomp sandboxing; and uses a personal-assistant trust model. The OpenClaw skills summary also records the failure mode of a single growing MEMORY.md: context overflow, compaction loss, and poor retrieval.

capOS implication: copy the harness knobs, not the host authority model. Workspace, exec, browser, memory, and skills should be separate caps with auditable grants.

Memory and Wiki Systems

Karpathy’s LLM Wiki pattern shifts memory from query-time retrieval over raw chunks to a maintained artifact: immutable raw sources, an LLM-maintained markdown wiki, and a schema/instruction file that defines conventions. The key operations are ingest, query, and lint. The useful artifacts are index.md, log.md, page cross-links, source citations, and health checks for stale or contradictory pages.

OpenClaw memory and DeepWiki’s OpenClaw skills summary point to similar requirements:

  • daily append-only logs versus curated long-term memory;
  • markdown as human-inspectable source of truth;
  • local indexes using SQLite, FTS/vector search, or hybrid search;
  • snippets and line ranges for bounded recall;
  • background distillation, pruning, and health checks;
  • optional encryption or OS keychain integration for secret-adjacent memory.

capOS implication: AgentMemory should expose source, wiki, index, log, lint, and search subcaps. Wiki pages should carry provenance and labels. Remote embedding should be denied for high-label data.

Schema-Guided Reasoning

Abdullin’s Schema-Guided Reasoning describes using structured output schemas to force reasoning through predefined steps, produce auditable intermediate state, and validate outputs. It is especially relevant for local or weaker models.

capOS should use SGR for:

  • task intake and risk classification;
  • plan decomposition;
  • tool-call approval summaries;
  • source ingest and citation extraction;
  • code/design/security review;
  • final handoff and memory updates.

This is harness structure, not authority. A schema can make reasoning more testable, but the runner still enforces capabilities.

Swarm and Multi-Agent Frameworks

MetaGPT encodes standard operating procedures into multi-agent workflows. Its useful lesson is artifact gating: requirements, design, implementation, test, and review phases should produce intermediate outputs that later phases can inspect.

Generative Agents / Smallville contributes the memory-stream, reflection, and planning pattern for long-lived simulated agents. It is useful for NPCs, companion agents, and social simulations, but it is not an authority model. Believable behavior is not safe behavior.

Gas Town focuses on durable multi-agent engineering work: roles, workers, worktrees, convoys, merge queues, attribution, and handoff. Its strongest lesson is that work must survive chat-window loss and worker recycling.

AutoGen emphasizes actor-style asynchronous agent communication, distributed runtimes, tools, memory, observability, and group/team patterns. Microsoft Agent Framework adds a production framing: graph workflows, checkpointing, human-in-the-loop, durable execution, telemetry, and MCP/A2A integrations. LangGraph’s durable execution docs add a specific replay rule: side effects and non-determinism need task wrappers or idempotence so resumed workflows do not repeat external writes.

CrewAI and CAMEL-AI show the common high-level framework shape: agents, crews or societies, flows/workflows, memory/knowledge, toolkits, RAG, structured outputs, observability, and human-in-the-loop triggers.

OpenManus, summarized by DeepWiki, is a useful “general agent” reference: a think-act loop, multi-provider LLM support, MCP integration, sandboxed code and browser automation, and multiple entry points for general tasks, MCP, and data analysis.

capOS implication: implement durable AgentTask and SwarmScheduler first. Do not start with free-form inter-agent chat as the substrate.

Interoperability Protocols

MCP provides standardized tool/resource/prompt discovery and execution over JSON-RPC, with stdio and HTTP transports. It is useful for external tool ecosystems, but it is not a capability security model by itself.

capOS should translate MCP descriptors into capOS tool descriptors and execute through the trusted runner. Local stdio MCP servers should run with no ambient filesystem or network authority. Remote MCP should require explicit HttpEndpoint and credential caps.

A2A is a primary reference for peer-agent interoperability. Its project describes agent cards for discovery, negotiation of text/forms/media modalities, collaboration on long-running tasks, and operation without exposing internal state, memory, or tools. Its documented feature set includes JSON-RPC 2.0 over HTTP(S), synchronous request/response, streaming, push notifications, and exchange of text, file, and structured JSON data.

capOS should translate that into a stricter local bridge. Remote agents are untrusted peers. Agent cards map to reviewed descriptors, not authority. Incoming A2A messages become AgentMessage events delivered through an AgentInbox; task ids, causal parents, size limits, expiry, and sender identity are mandatory. Artifact references require separate caps before content is read. Requested actions become proposed tool calls. Requested authority becomes an approval request. Raw capOS caps should not cross an A2A bridge.

For local swarms, the same rule applies without the network protocol: agents coordinate through task records, inbox messages, resource leases, resource watches, and merge/release queues, not through free-form chat that tries to remember who is editing a repo, todo item, wiki page, or browser profile.

Research Still Missing

  • Primary security advisories for OpenClaw and comparable personal-agent runtimes, especially gateway exposure, node hosts, skills, browser profiles, exec approvals, memory, and provider credentials.
  • MCP security beyond the happy-path spec: tool poisoning, stdio command spawning, remote auth, marketplace signing, prompt injection, and lookalike tools.
  • A2A security and identity: authentication, authorization, task provenance, artifact integrity, and non-transfer of authority.
  • Browser automation containment: CDP risk, extension relays, logged-in profiles, downloads/uploads, arbitrary JS evaluation, clipboard, screenshots, SSRF/private-network policy, and deterministic testing.
  • Memory correctness: citation fidelity, contradiction detection, stale summaries, label propagation, hallucinated links, human review, and rollback.
  • Retrieval tradeoffs: index-first wiki navigation versus vector RAG, hybrid BM25/vector search, reranking, local embeddings, snippet budgets, and remote embedding denial.
  • Swarm evaluation: when parallel agents improve throughput, when they create coordination debt, how to assign work, and how to prevent review capture.
  • Local model viability for schema-following, tool calls, memory summarization, and offline embeddings.
  • Provider policy: data retention, regional routing, ephemeral credentials, revocation, spend controls, and audit of remote inference.
  • Formal authority model: prove that model text, memory text, remote agent messages, and MCP descriptors cannot mint capOS authority.

Research: Scientific Agent-Lab Software Stack

This note surveys existing scientific software that capOS should treat as adaptable service backends for a future agent-facing research lab. The central lesson is that capOS should not invent a new computer algebra system, solver, proof assistant, notebook system, or package manager. It should give agents typed, audited, resource-bounded capabilities over mature tools and preserve the exact environment, inputs, outputs, and proof artifacts needed for review.

Design Consequences For capOS

  • Provide a scientific-standard package as a service graph, not as ambient binaries on a global filesystem.
  • Start by adapting existing command-line and library tools behind narrow typed facades. Native rewrites are unjustified until a backend needs a smaller trusted core or a direct capability ABI.
  • Treat heavyweight systems such as SageMath, OSCAR, JupyterLab, Lean/mathlib, and Spack as environment subjects: they need package-store, workspace, process, network, cache, and quota policy, not just a binary launch API.
  • Expose solver and proof tools as deterministic request/response services whenever possible. A model should ask SmtSolver.check, ProofSession.build, or OptimizationSolver.solve, not run arbitrary shell text.
  • Keep formal proof assistants separate from automatic solvers. SMT results are useful evidence, but durable mathematical claims need proof artifacts checked by Lean, Rocq, Isabelle, Agda, or another trusted kernel.
  • Make provenance a first-class output. Every notebook cell, solver run, proof build, CAS session, package environment, model prompt, and data input should produce replayable metadata and an audit record.
  • Prefer open-source backends in the default package. Proprietary engines such as Wolfram Engine can be optional connector services with explicit license, network, and production-use metadata.

Source Baseline

External sources used for this survey:

Local grounding:

Tool Families

Computer Algebra And Exact Mathematics

PARI/GP is the obvious default for number-theory work. The upstream project is a cross-platform open-source computer algebra system designed for fast number theory computations, and it exposes both the gp shell and the PARI C library. For capOS, the C library is the better long-term service backend; the shell is useful for early compatibility and transcript capture.

SageMath is the best broad open-source mathematical umbrella. It is GPL software built on NumPy, SciPy, matplotlib, SymPy, Maxima, GAP, FLINT, R, and other packages, with the explicit mission of being a free alternative to Magma, Maple, Mathematica, and Matlab. Sage is too large to make into a small TCB component, but it is ideal as an agent lab kernel when the package store and Python compatibility layer exist.

GAP is the standard open system for computational discrete algebra, especially computational group theory. Singular is the specialized polynomial, commutative-algebra, algebraic-geometry, and singularity-theory system. OSCAR is the newer Julia-based research system that unifies GAP, Singular, Polymake, ANTIC/Hecke/Nemo/AbstractAlgebra-style capabilities across algebra, geometry, number theory, and polyhedral geometry. For capOS this suggests two levels: small typed services for common requests, and full language kernels for research workflows that need the native ecosystem.

SymPy is lightweight and embeddable because it is Python-based and has few dependencies. It is a good first symbolic backend for agents that need exact manipulation, code generation, and checkable expressions without launching Sage. SymPy should not replace PARI, GAP, Singular, or OSCAR for their specialist domains.

Numerical, Statistical, And Notebook Workflows

SciPy provides fundamental algorithms for scientific computing in Python: optimization, integration, interpolation, eigenvalue problems, algebraic and differential equations, statistics, sparse matrices, k-dimensional trees, and more. It sits on NumPy and compiled Fortran/C/C++ kernels, so capOS support depends on Python, native extension loading, BLAS/LAPACK packaging, and controlled native-code execution.

R remains the standard open statistical environment. GNU Octave remains the open MATLAB-like numerical environment for linear and nonlinear numerical work. Julia is strategically important because OSCAR, JuMP, SciML, and many modern research packages depend on it. The first capOS lab should host these as isolated language kernels rather than trying to normalize them into one universal API.

JupyterLab is the standard interactive computing front end because notebooks combine code, prose, equations, visualizations, outputs, and controls. capOS should adapt the notebook model but not grant notebook kernels ambient shell or filesystem authority. A future NotebookSession should start kernels with explicit workspace, package environment, data, network, and compute caps, and record every execution result as a reproducible artifact.

Satisfiability, Optimization, And Operations Research

Z3 and cvc5 are the primary open SMT backends to expose through a capOS SmtSolver capability. Both support stand-alone and library use. SMT-LIB should be a supported import/export format, but the service API should expose typed assumptions, objectives, timeouts, model requests, unsat cores, and proof/certificate availability explicitly.

For mathematical optimization, capOS should separate modeling layers from solver engines:

  • JuMP and CVXPY are high-level modeling interfaces that let researchers state optimization problems in Julia or Python.
  • HiGHS is a strong open backend for large sparse LP, MIP, and QP models.
  • SCIP is a broad optimization suite around constraint integer programming, with current Apache-licensed releases.
  • OR-Tools is a practical operations-research toolkit, especially for constraint programming, routing, scheduling, and combinatorial optimization.

The capOS API should accept common model formats and provide bounded solve jobs with time/memory limits, deterministic seeds when supported, solution certificates when available, and reproducibility metadata. It should not hide which backend solved the problem.

Formal Proof Systems

Lean 4 is both a general-purpose functional language and an interactive theorem prover, with mathlib as its main community mathematics library. It is the best default for agent-assisted formal mathematics because current LLM tooling and library momentum are strongest there.

Rocq, formerly Coq, remains an industrial-strength dependently typed prover with a long verification history and program extraction story. Isabelle is a generic proof assistant, with Isabelle/HOL and mature automation important for systems proofs. Agda is valuable for constructive type theory and dependently typed programming. A capOS lab should support all of them as separate proof kernel families instead of pretending they are interchangeable.

Agent integration should be conservative:

  • Agents may propose proof edits, search lemmas, call tactics, and run builds.
  • The proof checker decides whether a theorem is accepted.
  • Accepted proof artifacts must include toolchain version, library revision, package closure, command line, and full build log.
  • CAS or SMT evidence can guide a proof but is not the proof unless the proof assistant checks an imported certificate or independently reconstructs it.

Reproducible Environments And Package Stores

The scientific stack is too large and language-diverse for a hand-written capOS package format to be the first step. Existing systems offer useful pieces:

  • Nix provides isolated, declarative package builds and large package coverage.
  • Guix-HPC focuses on reproducible scientific deployment, per-user environments, and bit-for-bit repeatability from a specific Guix commit.
  • Spack is the HPC-oriented answer for many compiler, MPI, CPU-target, and library variant combinations.
  • Apptainer is common in HPC because it packages software into portable images while integrating with GPUs, high-speed networks, and shared filesystems.

capOS should not import any of these as the kernel package manager. Instead, it should adapt their recipe and closure ideas into capability-native PackageCatalog, PackageClosure, Environment, and BuildService interfaces. Early implementations can execute Nix/Guix/Spack/Apptainer on a Linux host or sidecar; later capOS can consume signed closures as Store objects.

What An LLM Research Lab Needs

A credible LLM agent research lab on capOS needs more than model inference:

  • Workspace service. Branchable project workspaces with exact input, output, patch, and artifact history.
  • Package environments. Content-addressed software closures for Python, Julia, R, C/C++/Fortran, Lean, Rocq, Isabelle, GAP, PARI, Sage, and solver stacks.
  • Notebook service. Jupyter-compatible documents and kernels, but kernels receive explicit caps instead of ambient filesystem, process, or network access.
  • Experiment registry. Runs have immutable parameters, model versions, prompts, tool descriptors, seeds, package closures, datasets, results, and reviewer decisions.
  • Solver/proof services. CAS, SMT, optimization, and formal proof systems are high-level tool capabilities with structured inputs and bounded resources.
  • Literature and retrieval. Paper, code, dataset, citation, and note stores are ordinary namespaces; retrieval does not imply authority to fetch or publish.
  • Job graph orchestration. Long calculations, training/evaluation jobs, proof builds, benchmark sweeps, and multi-agent tasks need resumable job graphs with cancellation and status.
  • Compute authority. CPU, memory, storage, network, GPU, realtime, and external-provider quotas must be explicit and visible in the audit log.
  • Human review surfaces. Agents can generate results, but publication, credential use, external API calls, irreversible filesystem changes, and proof-of-result claims need review gates.

Adaptation Strategy

The near-term path is a staged compatibility bridge:

  1. Hosted Linux sidecars. Run existing stacks on Linux while capOS exposes them as remote capability services. Use namespace/cgroup/seccomp/Landlock sandboxes for trusted batch tools; use hardware-backed Linux guests (QEMU/KVM first, Firecracker/Kata-style microVMs later) for untrusted notebooks, model-generated code, package builds, and multi-tenant agent jobs. Treat User-Mode Linux as a developer/debug fallback, not the primary strong-isolation boundary. This proves interfaces and audit before native package support exists.
  2. Command-wrapper services. Wrap tools such as gp, gap, Singular, lean, lake, rocq, isabelle, octave, Rscript, and solver CLIs with explicit input/output directories, timeout, memory, and network policy.
  3. Library-backed services. Replace wrappers with direct C/C++/Rust/Julia FFI or process-local RPC for small stable APIs such as PARI, Z3, cvc5, and HiGHS.
  4. Notebook and language kernels. Add Python, Julia, R, Sage, and Lean kernels with capOS-authored kernel launchers and artifact capture.
  5. Package-closure ingestion. Import Nix/Guix/Spack closures as signed Store objects, then build a capOS-native catalog around content hashes, licenses, vulnerability metadata, CPU/GPU compatibility, and provenance.
  6. Native capOS services. Only after the interface stabilizes, port the most useful small engines or linkable libraries into native userspace.

Risks And Open Questions

  • Supply-chain size. Sage, Julia, Python scientific stacks, and proof libraries bring huge dependency closures. capOS must record and constrain them rather than pretend they are small trusted components.
  • Nondeterminism. Floating-point math, randomized solvers, parallel BLAS, GPU kernels, and package resolution can make replay differ. Results need deterministic seeds and variance metadata, not only final answers.
  • License boundaries. GPL, LGPL, Apache, MIT, BSD, proprietary, academic, and optional commercial solvers need explicit metadata before packaging.
  • Proof trust. A CAS result, SMT model, or solver objective value can be false because of bugs, numeric tolerance, or bad modeling. Formal proof claims must be checked by the named proof kernel or labeled as empirical.
  • Agent overreach. The default scientific package must not grant arbitrary shell, network, credential, package-install, or publishing authority to a model. Agents receive tools through runner policy, not direct backend caps.
  • Notebook security. Notebooks are executable documents. Opening one is not consent to run it with the reader’s caps.
  • Linux sidecar boundary drift. Namespaces, seccomp, Landlock, gVisor, User-Mode Linux, KVM guests, and microVMs are different security and compatibility claims. capOS must record the backend, host kernel, policy, image hashes, guest tickless/nohz state, and capOS outer NoHzEligibility/NoHzActivation state rather than labeling all of them “sandboxed Linux”.

Recommendation

Define a future scientific-standard package as a curated service graph with three profiles:

  • Base. PARI/GP, SymPy, Z3, cvc5, HiGHS, Lean, and artifact/provenance services through tight command or library wrappers.
  • Research. SageMath, GAP, Singular, OSCAR/Julia, R, Octave, JuMP, CVXPY, SCIP, OR-Tools, Jupyter-compatible notebooks, and package-closure support.
  • Lab. Hosted-agent workspaces, experiment registry, browser/web research tools, GPU-backed model/scientific kernels, distributed job graphs, and publication/review workflows.

The Base profile is the first useful target for agents: exact number theory, symbolic manipulation, SMT checking, linear/integer optimization, and Lean proof checking are enough to make an agent substantially more reliable without granting it a general-purpose scientific workstation.

Research: Multimedia Pipeline Latency

Survey of PipeWire and JACK design lessons for a capOS multimedia graph whose explicit goal is the minimal possible guaranteed-stable stack latency.

Goal

The capOS multimedia pipeline should optimize for the lowest end-to-end latency that capOS can guarantee stable under the selected workload, device, and routing graph. “Guaranteed stable” means the graph is admitted only when the kernel/services can reserve enough CPU, memory, device, and wakeup budget for every realtime cycle, and the graph fails closed when those guarantees can no longer be met. A graph that reports a smaller nominal buffer but produces xruns, underruns, clock drift, or large tail latency is worse than a graph with a slightly larger fixed quantum and a schedulability guarantee.

The target is not one universal latency number. The target is a measurable operating point with an explicit contract:

  • fixed sample rate and quantum for the realtime island;
  • bounded callback/process time per node;
  • bounded graph traversal time per cycle;
  • admitted worst-case execution budget for every node and bridge;
  • reserved memory and pre-registered buffers for the whole graph;
  • no allocation, blocking IPC, paging, logging, or credential checks on the realtime data path;
  • visible latency contribution per node, link, bridge, device, and provider;
  • admission rejection when the graph cannot fit the selected quantum;
  • fail-closed handling through bypass, silence, stream stop, or quantum renegotiation rather than unbounded queue growth;
  • policy that can choose “lowest stable” for pro audio and “efficient stable” for ordinary desktop/media playback.

This guarantee applies to local capOS-controlled realtime islands. It does not extend to browser scheduling, networks, or remote model/provider inference. Those parts can be measured, bounded by policy where possible, and isolated from the local graph, but not honestly guaranteed by capOS.

PipeWire Lessons

PipeWire separates graph configuration and IPC from realtime data processing. Its graph scheduling documentation describes a main thread for IPC and graph configuration and data processing threads that run with realtime priority. Node resources, buffers, I/O areas, and metadata are prepared in shared memory before realtime processing begins.

PipeWire also treats graph quantum and rate as first-class timing controls. Synchronous links can process in the same cycle, while asynchronous links add one cycle of latency. Its latency model propagates min/max latency through ports and adds latency when links or nodes introduce buffering.

Consequences for capOS:

  • Media graph control and media graph processing should be separate execution domains.
  • Buffers and metadata must be preallocated before the realtime cycle starts.
  • A link that crosses an isolation, clock, process, network, or wakeup boundary must declare its additional latency instead of hiding it.
  • Latency should be graph metadata, not an after-the-fact measurement only.
  • Quantum and rate are policy inputs, not incidental driver details.

JACK Lessons

JACK was designed for professional low-latency audio. Its API centers on a process callback invoked by the JACK server at the correct time, graph-order callbacks, xrun notification, and port latency ranges. JACK’s latency API asks clients to report min/max latency so applications can detect routing that has become anomalous or needs compensation.

Consequences for capOS:

  • A capOS native audio graph needs a cycle callback model for realtime nodes, even if the public API is capability-oriented rather than JACK-compatible C.
  • The realtime callback contract must be restrictive: no blocking endpoint calls, no dynamic allocation, no filesystem/name lookups, and no waiting for policy decisions.
  • Xruns and deadline misses are not debug trivia. They are first-class graph events that policy can use to increase quantum, disable expensive nodes, or move work to a different scheduling context.
  • Per-port latency ranges are more useful than a single optimistic value.

Guarantee Model

capOS should use a guarantee ladder rather than a single vague “low latency” mode:

LevelMeaningAllowed uses
Best effortNo reserved budget; telemetry onlyordinary media, background capture
Bounded soft realtimeDeadlines and drops, but no formal admission proofweb shell voice, remote model paths
Guaranteed realtime islandFixed quantum, admitted CPU/memory/device budgets, fail-closed overrunsnative audio, local voice, pro-audio paths
Hard device deadlineDriver/device deadline is reserved and violation is treated as a system fault for that islandfuture dedicated hardware paths

The first serious multimedia milestone should target guaranteed realtime islands for local audio. Web shell and remote model voice should remain bounded soft realtime because the browser/provider/network portions are outside local control.

Admission should require:

  • every node declares worst-case execution time or a conservative budget;
  • every bridge declares buffering and wakeup latency;
  • every buffer pool is allocated and pinned/registered before start;
  • every realtime thread has a scheduling context with period, budget, and priority;
  • graph topology is frozen for the active cycle plan;
  • overrun policy is configured before start.

If admission fails, the graph does not start at that quantum. If a running graph misses its guarantee, the system records a violation and applies the configured fail-closed policy instead of preserving continuity by accumulating hidden latency.

Stack Latency Model

For capOS, “stack latency” should be modeled as a composed budget:

flowchart LR
    DeviceIn[ADC / capture device] --> DriverIn[driver period]
    DriverIn --> CaptureRing[capture ring]
    CaptureRing --> Graph[media graph quantum cycles]
    Graph --> Bridge[process / isolation / network bridges]
    Bridge --> Codec[codec / resampler / model adapter]
    Codec --> PlaybackRing[playback ring]
    PlaybackRing --> DriverOut[driver period]
    DriverOut --> DeviceOut[DAC / playback device]

Each edge should carry:

  • latency min/max in frames or nanoseconds;
  • clock domain;
  • quantum/rate;
  • buffering depth;
  • deadline;
  • drift estimate;
  • xrun/drop counters.

The useful metric is not just nominal round-trip latency. For guaranteed islands it is the admitted latency bound plus violation count. For softer paths it is nominal latency, p95/p99 process-cycle latency, worst observed cycle over a window, xrun rate, and drift between capture and playback clocks.

capOS Media Graph Shape

The multimedia graph should be a userspace service family:

flowchart LR
    Control[control plane endpoint] --> GraphManager[MediaGraphManager]
    GraphManager --> Policy[latency / route / permission policy]
    GraphManager --> Nodes[node services]
    Nodes --> Rings[MemoryObject media rings]
    Rings --> Driver[audio/video driver services]
    Rings --> Apps[application nodes]
    Rings --> Provider[realtime model provider nodes]

The control plane may use normal capability endpoints. The data plane should use shared MemoryObject rings plus futex/notification wakeups. Cap’n Proto messages remain appropriate for graph setup, route changes, permission checks, and telemetry, but not for per-frame audio payload copying.

Node classes:

  • driver node: owns device-facing caps such as DeviceMmio, DMAPool, and Interrupt;
  • graph driver node: provides the cycle clock for a realtime island;
  • transform node: resampler, mixer, echo canceller, VAD, format converter;
  • app node: user application capture/playback endpoint;
  • bridge node: crosses process, clock, network, provider, or web boundary;
  • realtime model node: provider/local model adapter that consumes and emits media plus control events.

Guaranteed Realtime Islands

capOS should not try to make the whole desktop one realtime graph. It should support small realtime islands with explicit rate/quantum policy:

  • pro-audio island: low quantum, strict admission, few nodes, no remote model hop in the realtime loop;
  • voice-agent island: low enough latency for conversation, with VAD/barge-in priority and bounded buffering;
  • ordinary media island: efficient quantum and power policy;
  • screen/video island: frame-deadline oriented rather than audio-period oriented.

Bridges between islands are allowed, but each bridge declares the latency it adds. A bridge from a guaranteed island to a non-guaranteed island must not backpressure the guaranteed island. It may drop, resample, replace with silence, or move to a larger negotiated quantum, but it must not create an unbounded queue. This is the PipeWire/JACK lesson in capOS terms: do not hide async links.

Scheduling Implications

Per-SQE deadlines are useful for stale work handling, but they are not enough for guaranteed multimedia latency. The graph needs future scheduling contexts:

  • period: graph quantum duration;
  • budget: maximum CPU time per period for a node or node group;
  • priority: realtime island priority relative to other interactive work;
  • affinity: optional CPU isolation for device and graph threads;
  • overrun policy: drop, silence, bypass node, increase quantum, or stop graph.

Until scheduling contexts exist, capOS can only prototype bounded soft realtime. The design should still attach monotonic deadlines to media buffers and SQEs so late work is discarded deterministically instead of accumulating hidden latency, but documentation should not claim a local guarantee before admission and budget reservation exist.

Web Shell And Remote Models

Web shell voice and remote realtime models cannot provide guaranteed local stack latency across the full path. Browser scheduling, WebRTC/WebSocket transport, provider inference, and network jitter all sit outside capOS control.

The capOS goal still applies: guarantee the part of the stack capOS controls when it is inside an admitted realtime island, then expose the rest as measured latency and jitter:

  • browser capture/playback buffer estimates;
  • gateway queue depth;
  • provider adapter send/receive jitter;
  • model first-audio latency;
  • tool-call pause duration;
  • barge-in cancellation latency;
  • playback underrun/drop counters.

This argues for a local media graph even when the model session is provider native. The local graph is where capOS can enforce bounded buffers, drops, deadlines, and audit.

Design Rules

  1. Prefer fixed quantum inside a realtime island.
  2. Reject graph activation or graph changes that cannot be admitted at the selected quantum unless policy explicitly relaxes the guarantee.
  3. Treat every async boundary as one or more declared latency cycles.
  4. Keep realtime callbacks pure data processing.
  5. Move permission checks, tool execution, logging, graph mutation, and model policy to non-realtime threads.
  6. Preallocate buffers and register memory before starting the graph.
  7. Use latency ranges and measured telemetry, not a single optimistic latency.
  8. Provide fail-closed policy that stops, bypasses, silences, or renegotiates quantum when a guarantee is violated, rather than letting queues grow.
  9. Preserve capability isolation even when it costs a cycle; make the cost explicit and measurable.
  10. Keep pro-audio/local paths independent from remote-provider voice paths.

Open Questions

  • What is the first capOS-visible latency target: voice shell, local playback, or pro-audio loopback?
  • Should graph-driver threads live in a privileged media service, or can an application own a realtime island under broker policy?
  • How should admission control estimate whether a new node can fit a quantum before activating it?
  • Should bridge latency be specified by policy, measured dynamically, or both?
  • Which telemetry window should determine when a bounded-soft-realtime path should switch to a larger quantum?
  • How should future CPU donation interact with graph scheduling contexts?

References

Research: Realtime Multimodal Agent APIs

Survey of provider APIs for realtime native-audio, multimodal, tool-using agents, and the consequences for capOS voice agent-shell, web shell, media graph, scheduling, and capability boundaries.

Scope

This report focuses on APIs where a model can consume realtime audio and emit both audio output and structured tool calls in one session. That is distinct from a chained pipeline where the application separately runs ASR, a text model, and TTS.

The immediate capOS question is whether the earlier agent-shell design should remain text-first with optional ASR/TTS wrappers, or whether it needs a first-class realtime multimodal model session.

Source Snapshot

All source observations below were checked against official provider documentation on 2026-04-25.

  • The companion multimedia pipeline latency note covers PipeWire and JACK lessons for low-latency graph scheduling, latency reporting, realtime callbacks, and stable quantum selection.
  • OpenAI Realtime API docs describe speech-to-speech sessions, WebRTC and WebSocket transports, realtime function calling, interruption/truncation, and the gpt-realtime model family.
  • OpenAI Voice Agents docs explicitly frame the architecture choice as direct live audio sessions versus chained speech-to-text, text-agent, and text-to-speech pipelines.
  • Google AI Gemini Live API docs describe realtime audio/image/text input, audio output, WebSocket transport, VAD, barge-in, tool use, and ephemeral tokens for client-to-server browser use.
  • Vertex AI Gemini Live API docs describe the enterprise/cloud variant with realtime voice/video, native audio, transcriptions, function calling, Google Search grounding, and provisioned-throughput-oriented deployment considerations.

Provider Findings

OpenAI Realtime API

OpenAI’s Realtime API is a stateful session API for low-latency interactions with realtime models. The docs describe calling models such as gpt-realtime for speech-to-speech conversations over WebRTC or WebSocket, with the session carrying model, voice, conversation items, and generated responses.

Important details for capOS:

  • Browser clients are steered toward WebRTC for more consistent media performance; server-to-server integrations are steered toward WebSocket.
  • WebRTC media and control are split: audio is handled by the peer connection, while other events travel over a data channel.
  • WebSocket integrations carry JSON events and require the application to manage input and output audio buffers directly.
  • Realtime function calling is session/response configured. The model emits a function_call item with a name, JSON arguments, and a generated call id. The application executes the function and sends back a function_call_output conversation item keyed by that call id.
  • Realtime interruption is a first-class path. With VAD, user speech can cancel an ongoing model response. WebRTC/SIP paths have server-side knowledge of played audio; WebSocket paths require the client to stop playback and send truncation metadata for unplayed audio.
  • gpt-realtime-1.5 is documented as a realtime audio-in/audio-out model with text, audio, and image input; text and audio output; and function calling. The current model page marks video as unsupported.

OpenAI’s Voice Agents docs expose the architectural tradeoff directly: live speech-to-speech sessions are the natural low-latency path, while chained ASR plus text-agent plus TTS gives stronger intermediate control and is often more appropriate for approval-heavy workflows.

Google AI Gemini Live API

Google AI’s Gemini Live API is a realtime stateful WebSocket API. The developer docs describe audio, image, and text input; audio output; VAD; barge-in; transcriptions; proactive audio; affective dialog; and tool use.

Important details for capOS:

  • The Google AI developer API lists input audio as raw 16-bit PCM at 16 kHz little-endian, image input as JPEG at up to 1 FPS, and output audio as raw 16-bit PCM at 24 kHz little-endian.
  • The public developer API supports server-to-server and client-to-server approaches. Client-to-server avoids backend media proxy latency but requires ephemeral tokens rather than long-lived API keys in client code.
  • Ephemeral tokens are Live-API-only, short-lived credentials. Google documents default timing behavior of roughly one minute to start a new session and thirty minutes for sending messages over a connection, with the ability to restrict tokens to Live API model/config constraints.
  • Tool use supports function calling and Google Search. Function declarations are installed in session configuration, and the client must manually send tool responses. Google AI docs distinguish synchronous function calls from non-blocking function declarations on models that support them, with response scheduling options such as interrupting current model output, waiting until idle, or staying silent.
  • Tool support differs by model family and revision. The Google AI docs list Gemini 3.1 Flash Live Preview and Gemini 2.5 Flash Live Preview with function calling, but not all asynchronous behavior is supported by every model.

Vertex AI Gemini Live API

Vertex AI’s Live API docs describe the Google Cloud deployment path. The docs currently present gemini-live-2.5-flash-native-audio as generally available and recommended for low-latency voice agents, with native audio, transcriptions, VAD, affective dialog, proactive audio, and tool use. They also document a preview native-audio model and state a deprecation date for the older preview native-audio release.

The Vertex AI page is relevant to capOS for enterprise deployment:

  • It documents raw PCM input/output rates and a stateful WSS protocol.
  • It describes realtime voice/video agents, tool use through function calling and Google Search, audio transcriptions, barge-in, and proactive audio.
  • It points at partner WebRTC integrations, while the core Vertex API remains WebSocket-oriented in the referenced docs.
  • It exposes cloud operational concerns not present in the simple developer API view: access management, request logging, provisioned throughput, PayGo variants, quotas, and regional/cloud deployment policy.

Comparison

AxisOpenAI RealtimeGemini Live APIVertex AI Live API
Primary low-latency model shapeRealtime model sessionLive model sessionCloud Live model session
Browser media pathWebRTC recommendedWebSocket with ephemeral token; partner WebRTC integrations existPartner WebRTC integrations; core docs emphasize WSS
Server pathWebSocketWebSocket via Gen AI SDK/raw protocolWebSocket via Gen AI SDK/raw protocol
InputText/audio/image on current realtime modelsAudio/image/textAudio/video/text
OutputText/audioAudio in Google AI overviewAudio/text in Vertex overview
Tool callsFunction calling, client executes and returns outputFunction calling, client sends FunctionResponseFunction calling and Google Search grounding
InterruptionVAD, cancellation, output truncationVAD/barge-inVAD/barge-in
Client credential patternOpenAI ephemeral client secrets for browser realtimeLive-API ephemeral tokensCloud auth/service identity; client direct path depends on deployment

The practical conclusion is that a capOS abstraction should not bake in a single provider transport. OpenAI’s best browser path is WebRTC; Gemini’s core developer path is WebSocket with ephemeral tokens; Vertex AI adds enterprise auth and throughput controls. The common semantic layer is not “WebRTC” or “WebSocket.” It is a realtime model session carrying media frames, transcripts, model audio output, structured tool calls, tool results, cancellation, and session policy.

Consequences For capOS

A First-Class RealtimeModelSession

The existing language-model proposal is text-centric:

  • LanguageModel.complete
  • LanguageModel.stream
  • tool calls emitted in assistant messages
  • runner executes tools

That remains useful. It should not be stretched to pretend realtime audio is just a token stream. Native realtime voice models need a sibling interface:

interface RealtimeModel {
    info @0 () -> (info :RealtimeModelInfo);
    open @1 (config :RealtimeSessionConfig) -> (session :RealtimeModelSession);
}

interface RealtimeModelSession {
    sendInput @0 (event :RealtimeInputEvent) -> ();
    next @1 () -> (event :RealtimeOutputEvent, done :Bool);
    sendToolResult @2 (result :RealtimeToolResult) -> ();
    cancel @3 (reason :CancelReason) -> ();
    close @4 () -> ();
}

This interface lets a provider adapter hide whether it is OpenAI WebRTC, OpenAI WebSocket, Gemini WebSocket, Vertex AI, a local model, or a future GPU pipeline. It also keeps the existing capOS rule: the model never receives session authority. It emits structured tool calls, and the trusted runner executes or refuses them.

Direct Native Audio Versus Chained Pipeline

capOS should support both.

Use a direct native-audio session when:

  • the user expects conversational voice with low latency;
  • barge-in and expressive speech matter;
  • the provider model can safely handle tool-call turns in the same session;
  • provider telemetry, cost, and policy permit streaming user audio off-box.

Use a chained pipeline when:

  • the workflow is approval-heavy or destructive;
  • deterministic transcript capture is mandatory before reasoning;
  • ASR and TTS need to be local for privacy;
  • the agent runner needs to inspect, redact, or transform text before model inference;
  • the session is anonymous or guest and broker policy forbids remote live audio.

For web-shell voice, direct native audio is a better interactive experience, but the chained path is the safer fallback and the better first local proof.

Tool Calls Remain Proposals

Realtime providers can emit tool calls while producing or pausing audio. capOS must still treat those calls exactly like text-agent tool calls:

  1. The model emits a structured call name and arguments.
  2. The agent runner validates the call against advertised tool descriptors.
  3. Broker policy decides auto, consent, stepUp, or forbidden.
  4. The runner invokes the underlying typed capability if allowed.
  5. The runner sends a tool result back into the realtime session.
  6. Audit records bind model id, session id, tool descriptor revision, typed arguments, permission decision, outcome, and any spoken/user confirmation.

The model must not hold the tool caps. The provider session must not receive raw TerminalSession, Launcher, ProcessSpawner, tokens, credentials, or session bundle authority.

Audio Is Not Terminal Text

Voice input should not be encoded as TerminalSession.readLine, and output audio should not be TerminalSession.writeLine. The terminal stream remains a presentation channel. Voice is a sibling media channel bound to the same authenticated session id.

This separation matters because realtime audio has properties terminal text does not:

  • frame timestamps;
  • playback positions;
  • output truncation;
  • VAD and barge-in events;
  • partial transcripts;
  • deadline and stale-frame handling;
  • binary frame formats;
  • provider-specific session ids and event ids.

Media Graph Substrate

Provider-native realtime sessions do not eliminate the need for a local media graph. The graph becomes the local routing and policy layer, with the explicit goal of minimizing and guaranteeing the portion of stack latency capOS controls inside admitted realtime islands:

flowchart LR
    Mic[BrowserMic / DeviceMic] --> Capture[capture buffer]
    Capture --> Gate[VAD or push-to-talk gate]
    Gate --> Adapter[provider adapter or local ASR]
    Adapter --> Session[RealtimeModelSession]
    Session --> Runner[tool-call gate in agent runner]
    Runner --> Output[model audio output / local TTS]
    Output --> Playback[playback buffer]
    Playback --> Speaker[BrowserSpeaker / DeviceSpeaker]

On native capOS, device-facing audio eventually needs DeviceMmio, DMAPool, and Interrupt authority. On WebShellGateway, browser WebAudio/WebRTC handles physical microphone/speaker I/O, while capOS still owns the session authority and tool execution boundary. The graph should follow the multimedia latency research rule: use admitted realtime islands, preallocated media rings, declared async-link latency, fail-closed overrun policy, and xrun/deadline telemetry rather than hidden buffering.

Scheduling And Deadlines

Realtime voice is soft realtime for web-shell use:

  • capture frames should be forwarded before they become stale;
  • model output audio should be played or discarded, not accumulated without bound;
  • barge-in must beat model momentum;
  • tool execution must not block media handling forever.

Per-SQE or per-media-frame deadlines are useful metadata, but not authority. CPU guarantees still belong to future scheduling contexts. The media graph and realtime provider adapter should attach absolute monotonic deadlines to frames, tool calls, and playback events so stale work can be dropped deterministically.

Browser/WebShellGateway Implications

Provider docs support two deployment shapes:

  • Browser connects directly to provider using provider-issued ephemeral credentials. This minimizes media latency but exposes provider session traffic directly to browser JavaScript.
  • Browser streams media to WebShellGateway, which connects to the provider server-side. This keeps provider credentials off the browser and lets capOS inspect/redact/rate-limit audio, but adds gateway latency.

For capOS, direct browser-to-provider media should be treated as an optimized media path, not the baseline authority model. The baseline should keep WebShellGateway and the agent runner in control of session lifecycle, tool-call gating, audit, and teardown. If direct provider media is later used, it should initially be media-only unless the provider offers a trusted server-side control channel that lets the capOS adapter receive tool calls, send tool results, and revoke the provider session without relying on browser JavaScript.

The later browser-agent UI model is a separate policy choice: browser JavaScript may receive provider tool-call events and orchestrate the provider loop, but it still receives no capOS session caps or tool authority. Every provider tool call must be forwarded as a structured ToolRequest to WebShellGateway, and the gateway must validate descriptor freshness, session state, consent/step-up, quotas, replay protection, and audit before invoking real capOS capabilities. If those gateway controls are unavailable, provider tool declarations must be disabled in the direct browser session and all tool-capable turns must use gateway-mediated provider sessions. The browser receives only short-lived, provider-scoped, model/config-locked tokens minted by a broker-controlled service.

  1. Keep LanguageModel for text and chained workflows.
  2. Add RealtimeModel / RealtimeModelSession for native realtime multimodal sessions.
  3. Model provider adapters should be ordinary services:
    • OpenAIRealtimeProvider
    • GeminiLiveProvider
    • VertexLiveProvider
    • LocalRealtimeProvider
  4. A capOS-side agent runner or WebShellGateway’s server-side tool proxy remains the only holder of session caps and the only executor of real capOS tools.
  5. WebShellGateway owns browser transport, media channels, and browser-agent tool proxy enforcement, but browser JavaScript owns no tool authority.
  6. Media graph primitives should use MemoryObject, notifications, futexes, and scheduling contexts as they land.
  7. Direct browser-to-provider connections require broker-minted ephemeral credentials and explicit audit of what bypasses gateway media inspection.

Open Design Questions

  • Should RealtimeModelSession expose provider event ids verbatim, or should it normalize them to capOS-generated ids and retain provider ids only in audit metadata?
  • Should direct provider WebRTC be allowed for operator sessions, or should all production web-shell voice flow through WebShellGateway?
  • How much partial transcript text is trusted enough to render before the provider marks it final?
  • Can a provider-generated audio response be spoken before pending consent or stepUp decisions are resolved, or must speech pause at tool-call gates?
  • How should local wake-word/VAD models be sandboxed so they can improve UX without becoming an authorization factor?
  • Should media-frame deadlines be added to the existing SQE reserved field, or kept in media-ring metadata until the scheduler has scheduling contexts?

References

Research: Robotics Realtime Control

Survey of robotics realtime-control practice and the consequences for using capOS as a robot brain for industrial robots, vacuum/mobile robots, RC cars, drones, and autonomous vehicles.

Scope

This note is about the operating-system and middleware boundary, not robot kinematics or control theory. The capOS question is whether a capability OS can be a credible robot brain without pretending that every perception, planning, networking, and actuator path has the same timing or safety requirements.

The answer is conditional:

  • capOS is a plausible high-level robot brain and isolation substrate.
  • capOS should eventually host bounded realtime control islands.
  • capOS should not claim certified hard-realtime safety-controller status until scheduling contexts, driver isolation, timing analysis, fault containment, and certification evidence exist.
  • For early physical robots, capOS should supervise and coordinate while microcontrollers, PLCs, motor controllers, or flight controllers close the tightest safety loops.

Source Snapshot

External source observations below were checked on 2026-04-25.

Related local grounding:

External Findings

ROS 2 Realtime Direction

ROS 2 documentation frames realtime computing as central to autonomous vehicles, spacecraft, and industrial manufacturing. Its realtime programming guide emphasizes periodic loops, bounded jitter, and avoiding page faults, dynamic allocation, and indefinitely blocking synchronization on the realtime path.

The ROS 2 design background makes a sharper point: an OS can provide deterministic services, but application code must still avoid nondeterministic behavior. It recommends separating startup/preallocation, realtime-safe loop, and teardown phases. This maps directly to capOS admission: graph setup may use ordinary capability calls, but the admitted realtime cycle must run over preallocated buffers and pre-authorized work.

ros2_control

The ros2_control controller manager is a useful concrete precedent. It owns a periodic hardware-control loop whose shape is read state, update controllers, and write commands. Its documentation attempts to run the main controller thread under SCHED_FIFO, reports controller/hardware periodicity and execution-time diagnostics, and warns that normal Linux is throughput-oriented rather than ideal for hardware control.

Consequences for capOS:

  • The robot-control API should make the cyclic read/update/write loop explicit.
  • Controller activation, hardware claiming, fallback, and limits are safety policy, not incidental plugin mechanics.
  • Periodicity, execution time, overruns, and command-limit enforcement need to be first-class telemetry.
  • A controller state query or lifecycle transition that is not realtime-safe must be prohibited inside the admitted control loop.

micro-ROS Executor

micro-ROS documents why the default ROS 2 executor is problematic for deterministic robotic control: timer precedence, non-preemptive round-robin callback execution, no explicit callback priority, and only one input per handle can all create priority inversion and weak latency bounds. Its rclc Executor adds static sequential execution, trigger conditions, optional multi-thread scheduling configuration, and Logical Execution Time semantics. It also allocates callbacks during configuration, not during runtime.

Consequences for capOS:

  • A robot graph should have an explicit execution plan, not generic event-loop fairness.
  • Sense-plan-act phases should be expressible as a timed DAG with trigger conditions.
  • LET-style input/output boundaries are useful for sensor fusion and multi-rate control where lower jitter is worth one controlled period of latency.
  • Runtime graph mutation belongs outside the realtime cycle.

Current Research Trend

A 2026 ROS 2 realtime survey reports that recent work focuses on executor analysis, DDS communication delays, response time, reaction time, data age, message filters, profiling tools, and micro-ROS. That confirms that the hard part is not merely “use ROS 2”; it is making callback scheduling, data age, and communication delays analyzable.

ReDAG-RT, submitted in March 2026, is a recent example of the same pressure. It adds a user-space global scheduler for ROS 2 callback DAGs using rate-priority ordering and per-DAG concurrency bounds. The result is relevant even if capOS does not run ROS 2 unchanged: robot workloads want graph-level scheduling policy with bounded interference, not only thread priorities.

A UAV PREEMPT_RT paper submitted in April 2026 studies a 250 Hz flight-control loop on Raspberry Pi 5 and isolates timing effects from deferred Linux activation paths versus direct realtime activation. The useful warning for capOS is that multicore SoC shared-resource contention can dominate nominal loop frequency. Capability isolation is not sufficient without temporal and cache/bus interference accounting.

seL4 MCS And Timing Work

seL4 MCS exposes scheduling contexts as kernel-managed objects, including periodic threads and passive servers. The Trustworthy Systems timing work emphasizes deadline guarantees, temporal isolation, and WCET analysis for kernel paths.

Consequences for capOS:

  • Processor time should become explicit authority. A process that can command a motor still needs budget authority to do so at a period.
  • Passive-server and scheduling-context donation semantics fit robot services: a controller can run on the caller’s admitted budget when that is the intended timing contract.
  • Hard realtime claims require bounded kernel paths and timing evidence, not only a priority scheduler.

Linux PREEMPT_RT And Xenomai

The Linux kernel now documents PREEMPT_RT internals, including priority inheritance, threaded interrupts, and differences from non-RT kernels. Xenomai remains a strong precedent for systems that split stringent realtime work into a co-kernel or companion core while keeping Linux services available for ordinary work.

Consequences for capOS:

  • There is a practical ladder: normal scheduling, soft realtime with telemetry, admitted realtime islands, and hard device deadlines.
  • If capOS cannot yet provide hard bounds, it should make that status visible instead of hiding it behind a “realtime” label.
  • A future capOS robotics platform may still delegate the smallest motor or flight-control loop to an MCU/RTOS while capOS owns capability isolation, planning, perception, logging, updates, and operator control.

Orocos

Orocos is a long-running robotics control precedent: portable C++ libraries for advanced machine and robot control, with the Real-Time Toolkit as a component framework for realtime components.

Consequence for capOS: robotics developers need component lifecycle, deployment, ports, and runtime introspection. capOS should not expose only raw actuator writes; it needs a component/graph model where a realtime component can be admitted, activated, monitored, and deactivated without granting broad device authority.

Mobile Robots, Drones, And Cars

Nav2 presents a production-grade ROS 2 navigation framework for mobile and surface robots, with perception, planning, control, localization, behaviors, collision monitoring, docking, and teleoperation. It is the right class of software for vacuum cleaners, warehouse robots, rovers, and small RC-car autonomy, but it is not itself a hard-safety controller.

PX4 recommends ROS 2 for companion-computer integration when low latency and Linux libraries matter, while the autopilot remains the flight controller. ArduPilot documents the same split: companion computers consume MAVLink telemetry and make higher-level decisions while the autopilot owns the hard vehicle-control loop.

Autoware is the comparable open-source autonomous-driving stack. It is built on ROS and presents perception, localization, planning, control, and vehicle interface modules for autonomous driving. That is the right architectural shape for a capOS self-driving-car prototype: capOS can isolate and supervise modules, but a safety-certified vehicle interface and independent safety controller remain mandatory.

Manufacturing Interoperability

OPC UA Companion Specifications exist to define industry/device-specific information models and environment profiles. OPC UA is designed to scale from field-level devices to enterprise management. For manufacturing robots, this matters because the robot brain rarely talks only to motors; it must also exchange state, jobs, alarms, and audit data with PLCs, MES/SCADA systems, and vendor controllers.

Consequence for capOS: industrial integration should use typed gateway services. A capOS robot brain should expose and consume narrow manufacturing capabilities such as RobotCellStatus, JobQueue, SafetyState, ProgramSelector, and AlarmLog, not ambient network sockets or filesystem paths.

Timing Classes

Robots mix several timing classes:

ClassTypical loopExamplescapOS stance
Hard safetymicroseconds to millisecondse-stop chain, torque disable, flight stabilizationexternal certified controller first; future capOS only with evidence
Cyclic motion control250 Hz to 4 kHz or higherjoint servo, wheel velocity, PWM/ESC updates, EtherCAT cyclefuture admitted realtime island; early offload to MCU/PLC
Local autonomy10 Hz to 100 Hzobstacle avoidance, local planner, odometry fusionplausible early capOS target with deadline/drop telemetry
Perception and mapping1 Hz to 60 Hzcamera/lidar processing, SLAM, object detectioncapOS service graph, GPU/NPU caps later
Mission behaviorevent-driven to 10 Hzroute plan, behavior tree, job dispatch, teleop modestrong capOS fit
Fleet/cloud integrationseconds and slowerlogs, updates, digital twin, MES/SCADAstrong capOS fit

The mistake would be to put all of these on one generic executor and call it a robot brain. The capOS advantage is that each row can have different authority, budget, telemetry, and failure policy.

Domain Consequences

Manufacturing Robots

capOS can plausibly supervise a robot cell:

  • isolate vendor robot gateways, PLC gateways, camera/lidar services, planning services, operator UI, audit, and update agents;
  • hold explicit capabilities for cell state, job selection, robot program invocation, fixtures, safety-state observation, and logs;
  • run non-safety planning and perception near the robot;
  • bridge OPC UA, fieldbus, and vendor APIs through narrow service caps.

capOS should not initially replace:

  • certified safety PLCs;
  • e-stop and guarding;
  • servo drives’ inner control loops;
  • vendor-certified robot-controller safety functions.

Vacuum Cleaners And Indoor Mobile Robots

capOS is a better early fit here:

  • high-level mapping, route planning, room segmentation, cleaning policy, docking, telemetry, and operator control are natural services;
  • wheel PID, bumper debounce, cliff sensors, battery protection, and motor current cutoffs can stay on a small MCU;
  • Nav2-like navigation concepts can map to capOS graph services and typed actuator/sensor caps.

The first useful physical demo could be a small differential-drive base with capOS running on an SBC and an MCU exposing a typed BaseDrive cap.

RC Cars And Rovers

An RC-car class platform is a good capOS autonomy test because it is simple enough to instrument and unsafe enough to require strict boundaries:

  • capOS can run teleop, camera perception, local planning, logging, and a geofenced mission controller;
  • PWM/ESC steering and throttle should be mediated by a microcontroller or device service with a watchdog;
  • command caps should carry speed, steering, freshness deadline, and mode;
  • stale or revoked command authority should force neutral throttle and safe steering.

Drones

capOS should be a companion computer first:

  • consume MAVLink/uORB-like telemetry through a typed autopilot bridge;
  • run perception, mapping, object tracking, mission planning, and logging;
  • send high-level setpoints only through a FlightSetpoint cap with mode, envelope, rate, and geofence limits;
  • never bypass the flight controller’s arming, failsafe, and stabilization logic in early stages.

Self-Driving Cars

capOS is a research host for autonomous-driving software, not a near-term safety-certified vehicle OS:

  • isolate perception, localization, prediction, planning, map, and vehicle interface modules;
  • make every actuator-affecting path explicit and auditable;
  • use a safety gateway that clamps commands to an envelope and can degrade to minimal-risk behavior;
  • keep independent safety monitors and hardware controls outside the model or planner process.

The useful capOS contribution is not “the LLM drives the car.” It is a capability and timing architecture that prevents perception, model, network, UI, or update components from accidentally gaining actuator or safety authority.

capOS Design Consequences

Robot Brain Means Authority Router, Not Monolith

The robot brain should be a composed service graph:

flowchart LR
    Sensors[Sensor services] --> Perception[Perception]
    Perception --> World[World model]
    World --> Planner[Planner / behavior]
    Planner --> Control[Controller island]
    Control --> Actuators[Actuator gateway]

    Safety[Safety monitor] --> Control
    Safety --> Actuators
    Operator[Operator UI / teleop] --> Planner
    Audit[Audit / telemetry] --- Sensors
    Audit --- Control

The security boundary is the capability graph. The timing boundary is the admitted realtime island. Both must be visible in documentation and telemetry.

Control-Loop Admission

A future ControlLoopManager should admit a loop only after it has:

  • fixed period and deadline;
  • declared worst-case execution budget;
  • preallocated command/state buffers;
  • reserved scheduling context;
  • pinned or registered memory for device I/O;
  • bounded input data age policy;
  • actuator command clamp policy;
  • overrun policy;
  • watchdog/freshness behavior;
  • audit/telemetry route outside the realtime path.

No Cap’n Proto allocation, service discovery, logging, credential lookup, model inference, network fetch, filesystem access, or policy prompt belongs in the admitted loop.

Capability Shapes

Likely future interfaces:

interface SensorStream {
  describe @0 () -> (info :SensorInfo);
  openRing @1 (config :StreamConfig) -> (ring :MemoryObject);
  readStatus @2 () -> (status :StreamStatus);
}

interface ActuatorCommand {
  describe @0 () -> (info :ActuatorInfo);
  submit @1 (command :CommandFrame) -> (accepted :Bool);
  neutral @2 (reason :Text) -> ();
}

interface ControlLoop {
  describe @0 () -> (info :LoopInfo);
  start @1 () -> ();
  stop @2 (reason :Text) -> ();
  readTelemetry @3 () -> (telemetry :LoopTelemetry);
}

interface SafetyState {
  read @0 () -> (state :SafetySnapshot);
  subscribe @1 () -> (events :SensorStream);
}

CommandFrame should include sequence, monotonic timestamp, deadline, coordinate frame, mode, limit profile, and typed payload. A stale command is a failed command.

Robot Description And Frames

capOS needs a typed robot description model rather than an ambient URDF file path. A robot description service should expose:

  • kinematic tree;
  • named frames and transforms;
  • joint limits and command interfaces;
  • sensors, actuators, and calibration;
  • safety envelopes and operating modes;
  • firmware/controller identity;
  • simulation twins.

The description is read-only to most services. Mutating calibration or limits requires a separate authority and should produce audit records.

ROS 2 Compatibility

capOS should not try to replace the robotics ecosystem in the first pass. It should host compatibility bridges:

  • ROS 2 graph bridge for topics/actions/services;
  • micro-ROS/MCU bridge for embedded controllers;
  • MAVLink bridge for autopilots;
  • OPC UA bridge for manufacturing cells;
  • simulation bridge for Gazebo/Isaac/Webots-like tools.

Each bridge receives only the caps it needs. A ROS bridge should not become an ambient authority tunnel from the ROS graph to every actuator.

Models And Agents

Language or vision-language models can help with:

  • operator command interpretation;
  • diagnostics and log summarization;
  • task planning under human approval;
  • visual inspection;
  • code/config generation in simulation.

They must not hold actuator caps. Model output is untrusted. A planner or agent may propose a mission step, but a trusted runner must validate it against tool descriptors, safety state, geofence, mode, and command limits before any actuator-affecting capability is invoked.

Safety And Certification Gap

capOS currently has no certification story for:

  • IEC 61508 / ISO 13849 / ISO 10218 / ISO 26262 style evidence;
  • bounded interrupt latency on target hardware;
  • WCET for kernel paths;
  • IOMMU-backed driver isolation for physical devices;
  • independent safety monitor authority;
  • safe boot/update rollback for robots;
  • fault-injection and hardware-in-loop test evidence.

Therefore the honest position is:

  • research/simulation: capOS can be the main robot OS;
  • hobby mobile robot: capOS can be the SBC brain with MCU safety;
  • industrial cell: capOS can supervise and integrate, not replace safety PLCs;
  • self-driving car: capOS can host research autonomy modules behind a safety gateway, not claim road-safety control.

Implementation Path

  1. Simulation-only robot graph: fake sensors, fake actuators, behavior service, and audit, all over typed capabilities.
  2. Differential-drive demo: BaseDrive MCU bridge, encoder/IMU sensor stream, watchdog, stale-command neutral behavior, and QEMU/host simulation proof.
  3. ROS 2/Nav2 bridge: import/export selected topics/actions with explicit caps and no broad graph authority.
  4. Control-loop telemetry: deadline, data age, overrun, stale command, clamp, watchdog reset, and safety-state event counters.
  5. Realtime island prototype: fixed-period local controller over preallocated rings once scheduling contexts and notification objects exist.
  6. Device authority integration: fieldbus/CAN/EtherCAT/serial through DeviceMmio, DMAPool, Interrupt, or userspace driver caps after the DMA isolation gate.
  7. Manufacturing gateway: OPC UA/PLC bridge exposing cell status, job dispatch, alarms, and robot-program selection as typed caps.
  8. Autonomy stack: perception/planning/control services with explicit timing and safety envelopes.

Open Questions

  • Should capOS define a native robot-description schema or import URDF/SDF into a normalized capability service?
  • Should the first physical demo target a differential-drive base, RC car, or manipulator simulator?
  • What is the smallest useful scheduling-context API for a 50-100 Hz mobile robot controller?
  • How should transform-tree state be represented: service, shared snapshot ring, or both?
  • Where should command-limit enforcement live: actuator gateway, controller, safety monitor, or all three with different authority?
  • Can the same media graph ring shape support camera/lidar frames and audio, or does robot perception need a distinct sensor-stream ABI?

References